Recently, we debugged an issue where a user had a UIMA-AS client running on Windows, connecting to a UIMA-AS service running on Linux in the cloud.

The linux box was set up with LANG etc set to UTF-8.  Windows did not have any special configuration.

After a successful service deployment on Linux, the Windows client sent a get meta, which received a "message string" from the transport, and tried to parse it with the xml parser, but that returned an error

org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:202)

Eventually the user worked around this launching the Windows client Java with the extra parameter

 -D"file.encoding-UTF-8"

which made this problem go away (but may introduce other issues).

Should UIMA-AS communication protocols specify UTF-8 explicitly, instead of defaulting to "platform defaults" which seem to cause issues if the defaults aren't compatible?

-Marshall

Reply via email to