Recently, we debugged an issue where a user had a UIMA-AS client running on
Windows, connecting to a UIMA-AS service running on Linux in the cloud.
The linux box was set up with LANG etc set to UTF-8. Windows did not have any
special configuration.
After a successful service deployment on Linux, the Windows client sent a get
meta, which received a "message string" from the transport, and tried to parse
it with the xml parser, but that returned an error
org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:202)
Eventually the user worked around this launching the Windows client Java with
the extra parameter
-D"file.encoding-UTF-8"
which made this problem go away (but may introduce other issues).
Should UIMA-AS communication protocols specify UTF-8 explicitly, instead of
defaulting to "platform defaults" which seem to cause issues if the defaults
aren't compatible?
-Marshall