[ 
https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505056#comment-13505056
 ] 

Uwe Schindler commented on SOLR-2347:
-------------------------------------

It is not only XML files. In general, the encoding information of textual 
content should be determined by the parser. E.g. if you write a DIH instance 
reading from a network stream, the encoding might be defined by the headers 
(e.g. HTTP). In the case of XML it is defined by both headers and the data 
itsself (<?xml> header). Data import handler should in this case work with 
InputStreams, so the encoding could be determined later (e.g. when reading 
unknown text files, e.g. ICU4J could autodetect the encoding from language, 
etc.). This would also fit DIH better with TIKA processing.

My proposal is to let DIH take InputStreams and let the encoding be determined 
in a later stage of processing.
                
> Use InputStream and not Reader for XML parsing
> ----------------------------------------------
>
>                 Key: SOLR-2347
>                 URL: https://issues.apache.org/jira/browse/SOLR-2347
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.1
>
>
> Followup to SOLR-96:
> Solr mostly uses java.io.Reader and passes this Reader to the XML parser. 
> According to XML spec, a XML file should be initially seen as a binary stream 
> with a default charset of UTF-8 or another charset given by the network 
> protocol (like Content-Type header in HTTP). But very important, this default 
> charset is only a "hint" to the parser - mandatory is the charset from the 
> XML header processing inctruction. Because of this, the parser must be able 
> to change the charset when reading the XML headers (possibly also when seeing 
> BOM markers). This is not possible if the XML parser gets a java.io.Reader 
> instead of java.io.InputStreams. SOLR-96 already fixed this for the 
> XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue 
> should fix the rest to be conforming to XML-spec (open schema.xml and 
> config.xml as InputStream not Reader and others).
> This change would not break anything in Solr (perhaps only backwards 
> compatibility in the API), as the default used by XML parsers is UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to