: > Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it : > anywhere -- not even in the example config: new users shouldn't need to : > know about have any special solrconfig options that must be (un)set to get : > Solr to use their servlet container / system default charset. : : I strongly disagree. When we use standards like URIs and XML which : specify UTF-8, we should use UTF-8.
I'm confused: As far as URI/URLs go, Solr isn't the one decoding them, and as I said: nothing in the servlet spec suggests that an app has any say over how the servlet container will decode them, presubably because they *must* be UTF-8 ... so this is not our problem, and we should go out of our way to try and force the servlet container to treat the URLs as utf8. As for XML, or any other format a user might POST to solr (or ask solr to fetch from a remote source) what possible reason would we have to only supporting UTF-8? .. why do you suggest that the XML standard "specify UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we should use the charset specified in the content-type if there is one, and if not use the encoding specified in the xml header, ie... <?xml encoding='EUC-JP'?> ...the only real question in my mind is what to do if user supplied data has *NO* charset information of any kind ... for XML the spec seems very clear that in that case you test for UTF-8 or UTF-16 ... but for arbitrary streams of character data in other formats (CSV, JSON, etc...) it seems like trysting the servlet container to tell us the default encoding is the right way to go. -Hoss