Are you using MappingCharFilter?

It unfortunately has known bugs which require controversial API
changes to fix: https://issues.apache.org/jira/browse/LUCENE-6595

Mike McCandless

http://blog.mikemccandless.com

On Sat, Oct 3, 2015 at 6:02 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> Hi,
>
> Lucene does not remove the \r\n while indexing or storing fields. The 
> Analyzer just splits e.g., at whitespace (depends on Analyzer). So if you 
> original data has \r\n, then the offsets would be according to that (it 
> counts 2 chars).
>
> Could it be that you read it using a BufferedReader per line and pass as 
> Strings?
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -----Original Message-----
>> From: Ziqi Zhang [mailto:ziqi.zh...@sheffield.ac.uk]
>> Sent: Saturday, October 03, 2015 5:01 PM
>> To: java-user@lucene.apache.org
>> Subject: lucene deliberately removes \r (windows carriage char)
>>
>> Hi
>>
>> I am trying to pin-point a mismatch between the offsets produced by lucene
>> indexing process when I use the offsets to substring from the original
>> document content.
>>
>> I try to debug as far as I can go but I lost track of lucene when I am at 
>> line 298
>> of DefaultIndexingChain (lucene 5.3.0):
>>
>> for (IndexableField field : docState.doc) {
>>          fieldCount = processField(field, fieldGen, fieldCount);
>>        }
>>
>> Basically at this point I can see that the content field (one of the
>> IndexableField) I am interested in has already removed all "\r" from the
>> "\r\n" newline characters (windows) from the content. But I am unable to
>> trace how these IndexableField are generated, and how the raw content is
>> passed to them.
>>
>> I can be certain that my program did pass strings with lots of "\r\n"
>>
>> So the question is is this (i.e., removing \r) deliberate?
>>
>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to