Are you using MappingCharFilter? It unfortunately has known bugs which require controversial API changes to fix: https://issues.apache.org/jira/browse/LUCENE-6595
Mike McCandless http://blog.mikemccandless.com On Sat, Oct 3, 2015 at 6:02 PM, Uwe Schindler <[email protected]> wrote: > Hi, > > Lucene does not remove the \r\n while indexing or storing fields. The > Analyzer just splits e.g., at whitespace (depends on Analyzer). So if you > original data has \r\n, then the offsets would be according to that (it > counts 2 chars). > > Could it be that you read it using a BufferedReader per line and pass as > Strings? > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [email protected] > > >> -----Original Message----- >> From: Ziqi Zhang [mailto:[email protected]] >> Sent: Saturday, October 03, 2015 5:01 PM >> To: [email protected] >> Subject: lucene deliberately removes \r (windows carriage char) >> >> Hi >> >> I am trying to pin-point a mismatch between the offsets produced by lucene >> indexing process when I use the offsets to substring from the original >> document content. >> >> I try to debug as far as I can go but I lost track of lucene when I am at >> line 298 >> of DefaultIndexingChain (lucene 5.3.0): >> >> for (IndexableField field : docState.doc) { >> fieldCount = processField(field, fieldGen, fieldCount); >> } >> >> Basically at this point I can see that the content field (one of the >> IndexableField) I am interested in has already removed all "\r" from the >> "\r\n" newline characters (windows) from the content. But I am unable to >> trace how these IndexableField are generated, and how the raw content is >> passed to them. >> >> I can be certain that my program did pass strings with lots of "\r\n" >> >> So the question is is this (i.e., removing \r) deliberate? >> >> Thanks >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
