: The XML spec says that XML parsers are only required to support : UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different : encoding for XML, there is no guarantee that a conforming parser : will accept it.
there may not be a garuntee -- but shouldn't we at least try to respect the clients wishes? : Ultraseek has been indexing XML for the past nine years, and : I remember a single customer that had XML in a non-standard : encoding. Effectively all real-world XML is in one of the : standard encodings. That may be, but Solr was only publicly available for 9 months before we had someone running into confusion because they were tyring to post an XML file that wasn't UTF-8 :) http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6498685 : The right spec for XML over HTTP is RFC 3023. For text/xml : with no charset spec, the XML must be interpreted as US-ASCII. I can go along with that ... if there is a specification for a file format that says which charset should be assumed if it can't be determined then i agree, that's a case where it makes sense to hardcode "UTF-8" or "US-ASCII" in Solr ... but that's not justification for using something like request.setCharacterEncoding("UTF-8") in the SolrDispatcher where it applies to everything -- it's a justification for hardcoding a default of US-ASCII or UTF-8 in the XmlUpdateRequestHandler. as a general rule, it seems like trusting the ServletContainer for the default is hte rightthing to do. -Hoss