[ https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990132#comment-12990132 ]
Uwe Schindler commented on SOLR-2347: ------------------------------------- To again post the most important info from XML spec how to handle charset detection: {quote} XML parsers by definition only take a byte stream and a charset hint and use the XML 1.0 spec to determince the charset: http://www.w3.org/TR/REC-xml/#charencoding and http://www.w3.org/TR/REC-xml/#sec-guessing {quote} > Use InputStream and not Reader for XML parsing > ---------------------------------------------- > > Key: SOLR-2347 > URL: https://issues.apache.org/jira/browse/SOLR-2347 > Project: Solr > Issue Type: Bug > Reporter: Uwe Schindler > Assignee: Uwe Schindler > > Followup to SOLR-96: > Solr mostly uses java.io.Reader and passes this Reader to the XML parser. > According to XML spec, a XML file should be initially seen as a binary stream > with a default charset of UTF-8 or another charset given by the network > protocol (like Content-Type header in HTTP). But very important, this default > charset is only a "hint" to the parser - mandatory is the charset from the > XML header processing inctruction. Because of this, the parser must be able > to change the charset when reading the XML headers (possibly also when seeing > BOM markers). This is not possible if the XML parser gets a java.io.Reader > instead of java.io.InputStreams. SOLR-96 already fixed this for the > XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue > should fix the rest to be conforming to XML-spec (open schema.xml and > config.xml as InputStream not Reader and others). > This change would not break anything in Solr (perhaps only backwards > compatibility in the API), as the default used by XML parsers is UTF-8. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org