[ 
https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990505#comment-12990505
 ] 

Uwe Schindler commented on SOLR-2347:
-------------------------------------

The same problem applies to DIH, this may be the problem behind SOLR-2346!

DataSource in DIH always is generified to <Reader>, so all input from files or 
URLs are read using a charset. When you want to stream XML data using DIH, the 
XML parser is also unable to use the encoding as declared in the XML file 
itsself. In my opinion, all DataSources should simply be generified to 
DataSource<ContentStream>, which makes also a lot of code easier and the 
consumer can choose between binary or chars.

> Use InputStream and not Reader for XML parsing
> ----------------------------------------------
>
>                 Key: SOLR-2347
>                 URL: https://issues.apache.org/jira/browse/SOLR-2347
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>
> Followup to SOLR-96:
> Solr mostly uses java.io.Reader and passes this Reader to the XML parser. 
> According to XML spec, a XML file should be initially seen as a binary stream 
> with a default charset of UTF-8 or another charset given by the network 
> protocol (like Content-Type header in HTTP). But very important, this default 
> charset is only a "hint" to the parser - mandatory is the charset from the 
> XML header processing inctruction. Because of this, the parser must be able 
> to change the charset when reading the XML headers (possibly also when seeing 
> BOM markers). This is not possible if the XML parser gets a java.io.Reader 
> instead of java.io.InputStreams. SOLR-96 already fixed this for the 
> XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue 
> should fix the rest to be conforming to XML-spec (open schema.xml and 
> config.xml as InputStream not Reader and others).
> This change would not break anything in Solr (perhaps only backwards 
> compatibility in the API), as the default used by XML parsers is UTF-8.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to