Thanks Uwe. Then I think we should at least wrap the IS with a Buffered IS in EnwikiDocMaker (that's what I wanted to achieve in the first place, reusing LDM's BufferedReader)?
On Fri, Apr 10, 2009 at 10:22 AM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi Shai, > > > > with XML parsers you should generally avoid using Readers, unless you know > exactly that the underlying XML encoding is really the one given to the > Reader. Readers as parameters should only be used for sources that are > invariant of the encoding (like Java Strings containing XML, and without > encoding declaration!!!!). > > > > Good examples of correctly using a Reader are: > > - new InputSource(new StringReader(“<tag>….</tag>”)); // no xml > declaration > > - An XML stream serialized from a SAX/DOM to a Writer itself (so it is > without encoding), e.g. stored in a Lucene Stored String. > > > > But documents from unknown source should always handled as byte streams. > The XML parser must be able to switch the encoding according to the > declaration it found in XML header, this is not possible with Readers. > > > > Uwe > > > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > ------------------------------ > > *From:* Shai Erera [mailto:ser...@gmail.com] > *Sent:* Friday, April 10, 2009 8:47 AM > *To:* java-dev@lucene.apache.org > *Subject:* Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader) > > > > I started working on the patch for 1591, and noticed EnwikiDocMaker uses > the FileInputStream instance from LineDocMaker and not the BuferredReader. I > don't see any reason to this, as InputSource accepts a Reader. I can change > it as part of 1591, unless you think I'm missing something. >