Does solr remove "\r" from text content for indexing?

Ziqi Zhang Sun, 04 Oct 2015 04:09:38 -0700

Hi

I am trying to pin-point a mismatch between the offsets produced by solrindexing process when I use the offsets to substring from the originaldocument content. It seems that if the text content contains "\r"(windows carriage sign), solr automatically removes it, so "ok\r\nthisis the text\r\nand..." becomes "ok\nthis is the text\nand..." and as areulst the offsets created by solr indexing do not work with theoriginal content.

I have asked this issue on the lucene mailing list but have beensuggested that it is likely to be solr that caused this.


*To reproduce this issue, here is what I have done:*

1. Compile OpenNLPTokenizer.java and OpenNLPTokenizerFactory.java (inattachment), which I use to analyse a text field. OpenNLPTokenizer.javais almost identical to that athttps://issues.apache.org/jira/browse/LUCENE-6595 except that I adaptedit to lucene 5.3.0. If you look at line 74 of OpenNLPTokenizer, it takesthe "input" variable (of type Reader) from its superclass Tokenizer, andtokenizes its content. At runtime time by debugging, I can see thestring content held by this variable has already removed "\r" (detailsbelow)


2. configure solrconfig.xml and schema.xml to use the above tokenizer:

In solrconfig, do something like below and place the compiled code intothe folder

<lib dir="${solr.install.dir:../../../..}/classes" regex=".*\.class" />

In schema.xml define a new field type:

<fieldType name="testFieldType" class="solr.TextField"positionIncrementGap="100">

            <analyzer type="index">

<tokenizerclass="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"

sentenceModel=".../your_path/en-sent.bin"
tokenizerModel=".../your_path/en-token.bin"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
            </analyzer>
 </fieldType>

Download "en-sent.bin" and "en-token.bin" from below and place itsomewhere and then change the sentenceModel and tokenizerModel paramsabove to point to them:

http://opennlp.sourceforge.net/models-1.5/en-token.bin
http://opennlp.sourceforge.net/models-1.5/en-sent.bin

Then define a new field in the schema:

<field name="content" type="testFieldType" indexed="true" stored="false"multiValued="false" termVectors="true" termPositions="true"termOffsets="true"/>

3. Run the testing class TestIndexing.java (attachment) in debuggingmode, *you need to place a break point on line 74 of OpenNLPTokenizer*.


*
**To see the problem, notice that:

*- Line 19 of TestIndexing.java passes raw string "ok\r\nthis is thetext\r\nand..." to be added to field "content", which is to be analyzedby the "testFieldType" defined above. So this will trigger theOpenNLPTokenizer class- When you are at line 74 of OpenNLPTokenizer, inspect the value of thevariable "input". It is instantiated as a *ReusableStringReader*, andits value is now "ok\nthis is the text\nand...", all "\r" has been removed.



*In an attempt to solve the problem, I have learnt that:

*- (suggested by a lucene developer) the ReuseableStringReader I see iscaused by the way how Solr sets the field contents (as String). If theStringReader has no \r anymore, then it is Solr's fault.- trying to follow the debugger I pin pointed at line 299 ofDefaultIndexingChain that is shown as below:


      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }

And again during debugging, I can see the field "content" isencapsulated in an "IndexableField" object and its content is already"\r" removed.However at this point I cannot trace further to find how suchIndexableFields are created by solr, or lucene...



Any thoughts on this would be much appreciated!

package org.apache.lucene.analysis.opennlp;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;

import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.util.Span;

import org.apache.commons.io.IOUtils;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.AttributeFactory;
import org.apache.sis.util.iso.AbstractFactory;

/**
 * Run OpenNLP SentenceDetector and Tokenizer.
 * Must have Sentence and/or Tokenizer.
 */
public final class OpenNLPTokenizer extends Tokenizer {
    private static final int DEFAULT_BUFFER_SIZE = 256;

    private int finalOffset;
    private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = 
addAttribute(OffsetAttribute.class);

    //
    private Span[] sentences = null;
    private Span[][] words = null;
    private Span[] wordSet = null;
    boolean first = true;
    int indexSentence = 0;
    int indexWord = 0;
    private char[] fullText;

    private SentenceDetector sentenceOp = null;
    private opennlp.tools.tokenize.Tokenizer tokenizerOp = null;

    public OpenNLPTokenizer(AttributeFactory factory, SentenceDetector 
sentenceOp, opennlp.tools.tokenize.Tokenizer tokenizerOp) {
        super(factory);
        termAtt.resizeBuffer(DEFAULT_BUFFER_SIZE);
        if (sentenceOp == null && tokenizerOp == null) {
            throw new IllegalArgumentException("OpenNLPTokenizer: need one or 
both of Sentence Detector and Tokenizer");
        }
        this.sentenceOp = sentenceOp;
        this.tokenizerOp = tokenizerOp;
    }

    // OpenNLP ops run all-at-once. Have to cache sentence and/or word spans 
and feed them out.
    // Cache entire input buffer- don't know if this is the right 
implementation.
    // Of if the CharTermAttribute can cache it across multiple increments?

    @Override
    public final boolean incrementToken() throws IOException {
        if (first) {
            loadAll();
            restartAtBeginning();
            first = false;
        }
        if (sentences.length == 0) {
            first = true;
            return false;
        }
        int sentenceOffset = sentences[indexSentence].getStart();
        if (wordSet == null) {
            wordSet = words[indexSentence];
        }
        clearAttributes();
        while (indexSentence < sentences.length) {
            while (indexWord == wordSet.length) {
                indexSentence++;
                if (indexSentence < sentences.length) {
                    wordSet = words[indexSentence];
                    indexWord = 0;
                    sentenceOffset = sentences[indexSentence].getStart();
                } else {
                    first = true;
                    return false;
                }
            }
            // set termAtt from private buffer
            Span sentence = sentences[indexSentence];
            Span word = wordSet[indexWord];
            int spot = sentence.getStart() + word.getStart();
            termAtt.setEmpty();
            int termLength = word.getEnd() - word.getStart();
            if (termAtt.buffer().length < termLength) {
                termAtt.resizeBuffer(termLength);
            }
            termAtt.setLength(termLength);
            char[] buffer = termAtt.buffer();

            finalOffset = correctOffset(sentenceOffset + word.getEnd());
            int start=correctOffset(word.getStart() + sentenceOffset);
            offsetAtt.setOffset(start, finalOffset);
            for (int i = 0; i < termLength; i++) {
                buffer[i] = fullText[spot + i];
            }

            indexWord++;
            return true;
        }
        first = true;
        return false;
    }

    void restartAtBeginning() throws IOException {
        indexWord = 0;
        indexSentence = 0;
        indexWord = 0;
        finalOffset = 0;
        wordSet = null;
    }

    void loadAll() throws IOException {
        fillBuffer();
        detectSentences();
        words = new Span[sentences.length][];
        for (int i = 0; i < sentences.length; i++) {
            splitWords(i);
        }
    }

    void splitWords(int i) {
        Span current = sentences[i];
        String sentence = String.copyValueOf(fullText, current.getStart(), 
current.getEnd() - current.getStart());
        words[i] = tokenizerOp.tokenizePos(sentence);
    }

    // read all text, turn into sentences
    void detectSentences() throws IOException {
        fullText.hashCode();
        sentences = sentenceOp.sentPosDetect(new String(fullText));
    }

    void fillBuffer() throws IOException {
        fullText = IOUtils.toCharArray(input);
        /*int offset = 0;
        int size = 10000;
        fullText = new char[size];
        int length = input.read(fullText);
        while(length == size) {
//    fullText = IOUtils.toCharArray(input);
            fullText = Arrays.copyOf(fullText, offset + size);
            offset += size;
            length = input.read(fullText, offset, size);
        }
        fullText = Arrays.copyOf(fullText, offset + length);*/
    }

    @Override
    public final void end() {
        // set final offset
        offsetAtt.setOffset(finalOffset, finalOffset);
    }

//  public void reset(Reader input) throws IOException {
//    super.reset(input);
//    fullText = null;
//    sentences = null;
//    words = null;
//    first = true;
//  }

    @Override
    public void reset() throws IOException {
        super.reset();
        clearAttributes();
        restartAtBeginning();
    }
}

package org.apache.lucene.analysis.opennlp;

import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import org.apache.commons.lang.exception.ExceptionUtils;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeFactory;

import java.io.File;
import java.util.Map;

/**

 */
public class OpenNLPTokenizerFactory extends TokenizerFactory {
    private final int maxTokenLength;
    private SentenceDetector sentenceOp = null;
    private opennlp.tools.tokenize.Tokenizer tokenizerOp = null;

    /** Creates a new StandardTokenizerFactory */
    public OpenNLPTokenizerFactory(Map<String,String> args) {
        super(args);
        maxTokenLength = getInt(args, "maxTokenLength", 
StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
        String sentModel = args.get("sentenceModel");
        String tokenizerModel =args.get("tokenizerModel");
        try{
            sentenceOp = new SentenceDetectorME(new SentenceModel(new 
File(sentModel)));
        }catch (Exception e){
            StringBuilder msg = new StringBuilder("Required parameter 
invalid:");
            msg.append("sentenceModel=").append(sentModel).append("\n");
            msg.append(ExceptionUtils.getFullStackTrace(e));
            throw new IllegalArgumentException(msg.toString());
        }
        try{
            tokenizerOp = new TokenizerME(new TokenizerModel(new 
File(tokenizerModel)));
        }catch (Exception e){
            StringBuilder msg = new StringBuilder("Required parameter 
invalid:");
            msg.append("tokenizerModel=").append(sentModel).append("\n");
            msg.append(ExceptionUtils.getFullStackTrace(e));
            throw new IllegalArgumentException(msg.toString());
        }
    }

    @Override
    public Tokenizer create(AttributeFactory factory) {
        OpenNLPTokenizer tokenizer = new OpenNLPTokenizer(factory, sentenceOp, 
tokenizerOp);
        return tokenizer;
    }
}

package mypackage;

import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.common.SolrInputDocument;

import java.io.IOException;
import java.nio.file.Paths;


public class TestIndexing {
    public static void main(String[] args) throws IOException, 
SolrServerException {
        SolrClient solrClient =
                new 
EmbeddedSolrServer(Paths.get("D:\\solr-5.3.0_\\server\\solr"),
                        "core1");
        SolrInputDocument solrDoc = new SolrInputDocument();
        solrDoc.addField("id", "01");
        String realContent= "ok\r\nthis is the text\r\nand...";
        solrDoc.addField("content", realContent);
        solrClient.add(solrDoc);
        solrClient.commit();
        solrClient.close();
    }
}

Does solr remove "\r" from text content for indexing?

Reply via email to