Re: lucene deliberately removes \r (windows carriage char)

Ziqi Zhang Sat, 03 Oct 2015 14:01:45 -0700

Well this is very strange then. If I knew where exactly those"IndexableField" are constructed in the pipeline i could possibly pindown the bug...

In any case, no I did not use MappingCharFilter or a BufferedReader.The way I pass content to analyse is straightforward:

>>>
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("content", "ok\r\nhere is the text\r\n");
......

The schema for the field "content" to be analysed begins with taking thetext content in the field for tokenization:

>>>
<analyzer type="index">

<tokenizerclass="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"

                            sentenceModel="en-sent.bin"
                            tokenizerModel="en-token.bin"/>
.............

Where OpenNLPTokenizerFactory creates a OpenNLPTokenizer, what isidentical to the code provide at

https://issues.apache.org/jira/browse/LUCENE-2899
except that I adapted to Lucene 5.3

And by looking at the source code of OpenNLPTokenizer, I can see it isusing the "input" variable (type Reader) of the superclass Tokenizer toget the text content to be analyzed. At runtime through debugging I seethat "input" is instantiated as a "ReusableStringReader", and you cansee the string value has become "ok\nhere is the text\n"


Any other thoughts please?


On 03/10/2015 17:37, Michael McCandless wrote:

Are you using MappingCharFilter?

It unfortunately has known bugs which require controversial API
changes to fix: https://issues.apache.org/jira/browse/LUCENE-6595

Mike McCandless

http://blog.mikemccandless.com

On Sat, Oct 3, 2015 at 6:02 PM, Uwe Schindler <u...@thetaphi.de> wrote:

Hi,

Lucene does not remove the \r\n while indexing or storing fields. The Analyzer 
just splits e.g., at whitespace (depends on Analyzer). So if you original data 
has \r\n, then the offsets would be according to that (it counts 2 chars).

Could it be that you read it using a BufferedReader per line and pass as 
Strings?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

-----Original Message-----
From: Ziqi Zhang [mailto:ziqi.zh...@sheffield.ac.uk]
Sent: Saturday, October 03, 2015 5:01 PM
To: java-user@lucene.apache.org
Subject: lucene deliberately removes \r (windows carriage char)

Hi

I am trying to pin-point a mismatch between the offsets produced by lucene
indexing process when I use the offsets to substring from the original
document content.

I try to debug as far as I can go but I lost track of lucene when I am at line 
298
of DefaultIndexingChain (lucene 5.3.0):

for (IndexableField field : docState.doc) {
          fieldCount = processField(field, fieldGen, fieldCount);
        }

Basically at this point I can see that the content field (one of the
IndexableField) I am interested in has already removed all "\r" from the
"\r\n" newline characters (windows) from the content. But I am unable to
trace how these IndexableField are generated, and how the raw content is
passed to them.

I can be certain that my program did pass strings with lots of "\r\n"

So the question is is this (i.e., removing \r) deliberate?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: lucene deliberately removes \r (windows carriage char)

Reply via email to