: The XML spec says that XML parsers are only required to support
: UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
: encoding for XML, there is no guarantee that a conforming parser
: will accept it.

there may not be a garuntee -- but shouldn't we at least try to respect
the clients wishes?

: Ultraseek has been indexing XML for the past nine years, and
: I remember a single customer that had XML in a non-standard
: encoding. Effectively all real-world XML is in one of the
: standard encodings.

That may be, but Solr was only publicly available for 9 months before we
had someone running into confusion because they were tyring to post an XML
file that wasn't UTF-8 :)

    http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6498685

: The right spec for XML over HTTP is RFC 3023. For text/xml
: with no charset spec, the XML must be interpreted as US-ASCII.

I can go along with that ... if there is a specification for a file format
that says which charset should be assumed if it can't be determined then i
agree, that's a case where it makes sense to hardcode "UTF-8" or
"US-ASCII" in Solr ... but that's not justification for using something
like request.setCharacterEncoding("UTF-8") in the SolrDispatcher where it
applies to everything -- it's a justification for hardcoding a default of
US-ASCII or UTF-8 in the XmlUpdateRequestHandler.

as a general rule, it seems like trusting the ServletContainer for the
default is hte rightthing to do.





-Hoss

Reply via email to