On 2/1/07 3:18 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > As for XML, or any other format a user might POST to solr (or ask solr > to fetch from a remote source) what possible reason would we have to only > supporting UTF-8? .. why do you suggest that the XML standard "specify > UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we > should use the charset specified in the content-type if there is one, and > if not use the encoding specified in the xml header, ie... > > <?xml encoding='EUC-JP'?>
The XML spec says that XML parsers are only required to support UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different encoding for XML, there is no guarantee that a conforming parser will accept it. Ultraseek has been indexing XML for the past nine years, and I remember a single customer that had XML in a non-standard encoding. Effectively all real-world XML is in one of the standard encodings. The right spec for XML over HTTP is RFC 3023. For text/xml with no charset spec, the XML must be interpreted as US-ASCII. >From section 8.5: Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii". wunder