Hi

I am trying to pin-point a mismatch between the offsets produced by solr indexing process when I use the offsets to substring from the original document content. It seems that if the text content contains "\r" (windows carriage sign), solr automatically removes it, so "ok\r\nthis is the text\r\nand..." becomes "ok\nthis is the text\nand..." and as a reulst the offsets created by solr indexing do not work with the original content.

I have asked this issue on the lucene mailing list but have been suggested that it is likely to be solr that caused this.

*To reproduce this issue, here is what I have done:*

1. Compile OpenNLPTokenizer.java and OpenNLPTokenizerFactory.java (in attachment), which I use to analyse a text field. OpenNLPTokenizer.java is almost identical to that at https://issues.apache.org/jira/browse/LUCENE-6595 except that I adapted it to lucene 5.3.0. If you look at line 74 of OpenNLPTokenizer, it takes the "input" variable (of type Reader) from its superclass Tokenizer, and tokenizes its content. At runtime time by debugging, I can see the string content held by this variable has already removed "\r" (details below)

2. configure solrconfig.xml and schema.xml to use the above tokenizer:
In solrconfig, do something like below and place the compiled code into the folder
<lib dir="${solr.install.dir:../../../..}/classes" regex=".*\.class" />

In schema.xml define a new field type:
<fieldType name="testFieldType" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
<tokenizer class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
sentenceModel=".../your_path/en-sent.bin"
tokenizerModel=".../your_path/en-token.bin"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
            </analyzer>
 </fieldType>

Download "en-sent.bin" and "en-token.bin" from below and place it somewhere and then change the sentenceModel and tokenizerModel params above to point to them:
http://opennlp.sourceforge.net/models-1.5/en-token.bin
http://opennlp.sourceforge.net/models-1.5/en-sent.bin

Then define a new field in the schema:
<field name="content" type="testFieldType" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>

3. Run the testing class TestIndexing.java (attachment) in debugging mode, *you need to place a break point on line 74 of OpenNLPTokenizer*.

*
**To see the problem, notice that:
*- Line 19 of TestIndexing.java passes raw string "ok\r\nthis is the text\r\nand..." to be added to field "content", which is to be analyzed by the "testFieldType" defined above. So this will trigger the OpenNLPTokenizer class - When you are at line 74 of OpenNLPTokenizer, inspect the value of the variable "input". It is instantiated as a *ReusableStringReader*, and its value is now "ok\nthis is the text\nand...", all "\r" has been removed.


*In an attempt to solve the problem, I have learnt that:
*- (suggested by a lucene developer) the ReuseableStringReader I see is caused by the way how Solr sets the field contents (as String). If the StringReader has no \r anymore, then it is Solr's fault. - trying to follow the debugger I pin pointed at line 299 of DefaultIndexingChain that is shown as below:

      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }

And again during debugging, I can see the field "content" is encapsulated in an "IndexableField" object and its content is already "\r" removed. However at this point I cannot trace further to find how such IndexableFields are created by solr, or lucene...


Any thoughts on this would be much appreciated!

package org.apache.lucene.analysis.opennlp;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;

import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.util.Span;

import org.apache.commons.io.IOUtils;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.AttributeFactory;
import org.apache.sis.util.iso.AbstractFactory;

/**
 * Run OpenNLP SentenceDetector and Tokenizer.
 * Must have Sentence and/or Tokenizer.
 */
public final class OpenNLPTokenizer extends Tokenizer {
    private static final int DEFAULT_BUFFER_SIZE = 256;

    private int finalOffset;
    private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = 
addAttribute(OffsetAttribute.class);

    //
    private Span[] sentences = null;
    private Span[][] words = null;
    private Span[] wordSet = null;
    boolean first = true;
    int indexSentence = 0;
    int indexWord = 0;
    private char[] fullText;

    private SentenceDetector sentenceOp = null;
    private opennlp.tools.tokenize.Tokenizer tokenizerOp = null;

    public OpenNLPTokenizer(AttributeFactory factory, SentenceDetector 
sentenceOp, opennlp.tools.tokenize.Tokenizer tokenizerOp) {
        super(factory);
        termAtt.resizeBuffer(DEFAULT_BUFFER_SIZE);
        if (sentenceOp == null && tokenizerOp == null) {
            throw new IllegalArgumentException("OpenNLPTokenizer: need one or 
both of Sentence Detector and Tokenizer");
        }
        this.sentenceOp = sentenceOp;
        this.tokenizerOp = tokenizerOp;
    }

    // OpenNLP ops run all-at-once. Have to cache sentence and/or word spans 
and feed them out.
    // Cache entire input buffer- don't know if this is the right 
implementation.
    // Of if the CharTermAttribute can cache it across multiple increments?

    @Override
    public final boolean incrementToken() throws IOException {
        if (first) {
            loadAll();
            restartAtBeginning();
            first = false;
        }
        if (sentences.length == 0) {
            first = true;
            return false;
        }
        int sentenceOffset = sentences[indexSentence].getStart();
        if (wordSet == null) {
            wordSet = words[indexSentence];
        }
        clearAttributes();
        while (indexSentence < sentences.length) {
            while (indexWord == wordSet.length) {
                indexSentence++;
                if (indexSentence < sentences.length) {
                    wordSet = words[indexSentence];
                    indexWord = 0;
                    sentenceOffset = sentences[indexSentence].getStart();
                } else {
                    first = true;
                    return false;
                }
            }
            // set termAtt from private buffer
            Span sentence = sentences[indexSentence];
            Span word = wordSet[indexWord];
            int spot = sentence.getStart() + word.getStart();
            termAtt.setEmpty();
            int termLength = word.getEnd() - word.getStart();
            if (termAtt.buffer().length < termLength) {
                termAtt.resizeBuffer(termLength);
            }
            termAtt.setLength(termLength);
            char[] buffer = termAtt.buffer();

            finalOffset = correctOffset(sentenceOffset + word.getEnd());
            int start=correctOffset(word.getStart() + sentenceOffset);
            offsetAtt.setOffset(start, finalOffset);
            for (int i = 0; i < termLength; i++) {
                buffer[i] = fullText[spot + i];
            }

            indexWord++;
            return true;
        }
        first = true;
        return false;
    }

    void restartAtBeginning() throws IOException {
        indexWord = 0;
        indexSentence = 0;
        indexWord = 0;
        finalOffset = 0;
        wordSet = null;
    }

    void loadAll() throws IOException {
        fillBuffer();
        detectSentences();
        words = new Span[sentences.length][];
        for (int i = 0; i < sentences.length; i++) {
            splitWords(i);
        }
    }

    void splitWords(int i) {
        Span current = sentences[i];
        String sentence = String.copyValueOf(fullText, current.getStart(), 
current.getEnd() - current.getStart());
        words[i] = tokenizerOp.tokenizePos(sentence);
    }

    // read all text, turn into sentences
    void detectSentences() throws IOException {
        fullText.hashCode();
        sentences = sentenceOp.sentPosDetect(new String(fullText));
    }

    void fillBuffer() throws IOException {
        fullText = IOUtils.toCharArray(input);
        /*int offset = 0;
        int size = 10000;
        fullText = new char[size];
        int length = input.read(fullText);
        while(length == size) {
//    fullText = IOUtils.toCharArray(input);
            fullText = Arrays.copyOf(fullText, offset + size);
            offset += size;
            length = input.read(fullText, offset, size);
        }
        fullText = Arrays.copyOf(fullText, offset + length);*/
    }

    @Override
    public final void end() {
        // set final offset
        offsetAtt.setOffset(finalOffset, finalOffset);
    }

//  public void reset(Reader input) throws IOException {
//    super.reset(input);
//    fullText = null;
//    sentences = null;
//    words = null;
//    first = true;
//  }

    @Override
    public void reset() throws IOException {
        super.reset();
        clearAttributes();
        restartAtBeginning();
    }
}
package org.apache.lucene.analysis.opennlp;

import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import org.apache.commons.lang.exception.ExceptionUtils;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeFactory;

import java.io.File;
import java.util.Map;

/**

 */
public class OpenNLPTokenizerFactory extends TokenizerFactory {
    private final int maxTokenLength;
    private SentenceDetector sentenceOp = null;
    private opennlp.tools.tokenize.Tokenizer tokenizerOp = null;

    /** Creates a new StandardTokenizerFactory */
    public OpenNLPTokenizerFactory(Map<String,String> args) {
        super(args);
        maxTokenLength = getInt(args, "maxTokenLength", 
StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
        String sentModel = args.get("sentenceModel");
        String tokenizerModel =args.get("tokenizerModel");
        try{
            sentenceOp = new SentenceDetectorME(new SentenceModel(new 
File(sentModel)));
        }catch (Exception e){
            StringBuilder msg = new StringBuilder("Required parameter 
invalid:");
            msg.append("sentenceModel=").append(sentModel).append("\n");
            msg.append(ExceptionUtils.getFullStackTrace(e));
            throw new IllegalArgumentException(msg.toString());
        }
        try{
            tokenizerOp = new TokenizerME(new TokenizerModel(new 
File(tokenizerModel)));
        }catch (Exception e){
            StringBuilder msg = new StringBuilder("Required parameter 
invalid:");
            msg.append("tokenizerModel=").append(sentModel).append("\n");
            msg.append(ExceptionUtils.getFullStackTrace(e));
            throw new IllegalArgumentException(msg.toString());
        }
    }

    @Override
    public Tokenizer create(AttributeFactory factory) {
        OpenNLPTokenizer tokenizer = new OpenNLPTokenizer(factory, sentenceOp, 
tokenizerOp);
        return tokenizer;
    }
}
package mypackage;

import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.common.SolrInputDocument;

import java.io.IOException;
import java.nio.file.Paths;


public class TestIndexing {
    public static void main(String[] args) throws IOException, 
SolrServerException {
        SolrClient solrClient =
                new 
EmbeddedSolrServer(Paths.get("D:\\solr-5.3.0_\\server\\solr"),
                        "core1");
        SolrInputDocument solrDoc = new SolrInputDocument();
        solrDoc.addField("id", "01");
        String realContent= "ok\r\nthis is the text\r\nand...";
        solrDoc.addField("content", realContent);
        solrClient.add(solrDoc);
        solrClient.commit();
        solrClient.close();
    }
}

Reply via email to