Hi
I am trying to pin-point a mismatch between the offsets produced by solr
indexing process when I use the offsets to substring from the original
document content. It seems that if the text content contains "\r"
(windows carriage sign), solr automatically removes it, so "ok\r\nthis
is the text\r\nand..." becomes "ok\nthis is the text\nand..." and as a
reulst the offsets created by solr indexing do not work with the
original content.
I have asked this issue on the lucene mailing list but have been
suggested that it is likely to be solr that caused this.
*To reproduce this issue, here is what I have done:*
1. Compile OpenNLPTokenizer.java and OpenNLPTokenizerFactory.java (in
attachment), which I use to analyse a text field. OpenNLPTokenizer.java
is almost identical to that at
https://issues.apache.org/jira/browse/LUCENE-6595 except that I adapted
it to lucene 5.3.0. If you look at line 74 of OpenNLPTokenizer, it takes
the "input" variable (of type Reader) from its superclass Tokenizer, and
tokenizes its content. At runtime time by debugging, I can see the
string content held by this variable has already removed "\r" (details
below)
2. configure solrconfig.xml and schema.xml to use the above tokenizer:
In solrconfig, do something like below and place the compiled code into
the folder
<lib dir="${solr.install.dir:../../../..}/classes" regex=".*\.class" />
In schema.xml define a new field type:
<fieldType name="testFieldType" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer
class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
sentenceModel=".../your_path/en-sent.bin"
tokenizerModel=".../your_path/en-token.bin"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Download "en-sent.bin" and "en-token.bin" from below and place it
somewhere and then change the sentenceModel and tokenizerModel params
above to point to them:
http://opennlp.sourceforge.net/models-1.5/en-token.bin
http://opennlp.sourceforge.net/models-1.5/en-sent.bin
Then define a new field in the schema:
<field name="content" type="testFieldType" indexed="true" stored="false"
multiValued="false" termVectors="true" termPositions="true"
termOffsets="true"/>
3. Run the testing class TestIndexing.java (attachment) in debugging
mode, *you need to place a break point on line 74 of OpenNLPTokenizer*.
*
**To see the problem, notice that:
*- Line 19 of TestIndexing.java passes raw string "ok\r\nthis is the
text\r\nand..." to be added to field "content", which is to be analyzed
by the "testFieldType" defined above. So this will trigger the
OpenNLPTokenizer class
- When you are at line 74 of OpenNLPTokenizer, inspect the value of the
variable "input". It is instantiated as a *ReusableStringReader*, and
its value is now "ok\nthis is the text\nand...", all "\r" has been removed.
*In an attempt to solve the problem, I have learnt that:
*- (suggested by a lucene developer) the ReuseableStringReader I see is
caused by the way how Solr sets the field contents (as String). If the
StringReader has no \r anymore, then it is Solr's fault.
- trying to follow the debugger I pin pointed at line 299 of
DefaultIndexingChain that is shown as below:
for (IndexableField field : docState.doc) {
fieldCount = processField(field, fieldGen, fieldCount);
}
And again during debugging, I can see the field "content" is
encapsulated in an "IndexableField" object and its content is already
"\r" removed.
However at this point I cannot trace further to find how such
IndexableFields are created by solr, or lucene...
Any thoughts on this would be much appreciated!
package org.apache.lucene.analysis.opennlp;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.util.Span;
import org.apache.commons.io.IOUtils;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.AttributeFactory;
import org.apache.sis.util.iso.AbstractFactory;
/**
* Run OpenNLP SentenceDetector and Tokenizer.
* Must have Sentence and/or Tokenizer.
*/
public final class OpenNLPTokenizer extends Tokenizer {
private static final int DEFAULT_BUFFER_SIZE = 256;
private int finalOffset;
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt =
addAttribute(OffsetAttribute.class);
//
private Span[] sentences = null;
private Span[][] words = null;
private Span[] wordSet = null;
boolean first = true;
int indexSentence = 0;
int indexWord = 0;
private char[] fullText;
private SentenceDetector sentenceOp = null;
private opennlp.tools.tokenize.Tokenizer tokenizerOp = null;
public OpenNLPTokenizer(AttributeFactory factory, SentenceDetector
sentenceOp, opennlp.tools.tokenize.Tokenizer tokenizerOp) {
super(factory);
termAtt.resizeBuffer(DEFAULT_BUFFER_SIZE);
if (sentenceOp == null && tokenizerOp == null) {
throw new IllegalArgumentException("OpenNLPTokenizer: need one or
both of Sentence Detector and Tokenizer");
}
this.sentenceOp = sentenceOp;
this.tokenizerOp = tokenizerOp;
}
// OpenNLP ops run all-at-once. Have to cache sentence and/or word spans
and feed them out.
// Cache entire input buffer- don't know if this is the right
implementation.
// Of if the CharTermAttribute can cache it across multiple increments?
@Override
public final boolean incrementToken() throws IOException {
if (first) {
loadAll();
restartAtBeginning();
first = false;
}
if (sentences.length == 0) {
first = true;
return false;
}
int sentenceOffset = sentences[indexSentence].getStart();
if (wordSet == null) {
wordSet = words[indexSentence];
}
clearAttributes();
while (indexSentence < sentences.length) {
while (indexWord == wordSet.length) {
indexSentence++;
if (indexSentence < sentences.length) {
wordSet = words[indexSentence];
indexWord = 0;
sentenceOffset = sentences[indexSentence].getStart();
} else {
first = true;
return false;
}
}
// set termAtt from private buffer
Span sentence = sentences[indexSentence];
Span word = wordSet[indexWord];
int spot = sentence.getStart() + word.getStart();
termAtt.setEmpty();
int termLength = word.getEnd() - word.getStart();
if (termAtt.buffer().length < termLength) {
termAtt.resizeBuffer(termLength);
}
termAtt.setLength(termLength);
char[] buffer = termAtt.buffer();
finalOffset = correctOffset(sentenceOffset + word.getEnd());
int start=correctOffset(word.getStart() + sentenceOffset);
offsetAtt.setOffset(start, finalOffset);
for (int i = 0; i < termLength; i++) {
buffer[i] = fullText[spot + i];
}
indexWord++;
return true;
}
first = true;
return false;
}
void restartAtBeginning() throws IOException {
indexWord = 0;
indexSentence = 0;
indexWord = 0;
finalOffset = 0;
wordSet = null;
}
void loadAll() throws IOException {
fillBuffer();
detectSentences();
words = new Span[sentences.length][];
for (int i = 0; i < sentences.length; i++) {
splitWords(i);
}
}
void splitWords(int i) {
Span current = sentences[i];
String sentence = String.copyValueOf(fullText, current.getStart(),
current.getEnd() - current.getStart());
words[i] = tokenizerOp.tokenizePos(sentence);
}
// read all text, turn into sentences
void detectSentences() throws IOException {
fullText.hashCode();
sentences = sentenceOp.sentPosDetect(new String(fullText));
}
void fillBuffer() throws IOException {
fullText = IOUtils.toCharArray(input);
/*int offset = 0;
int size = 10000;
fullText = new char[size];
int length = input.read(fullText);
while(length == size) {
// fullText = IOUtils.toCharArray(input);
fullText = Arrays.copyOf(fullText, offset + size);
offset += size;
length = input.read(fullText, offset, size);
}
fullText = Arrays.copyOf(fullText, offset + length);*/
}
@Override
public final void end() {
// set final offset
offsetAtt.setOffset(finalOffset, finalOffset);
}
// public void reset(Reader input) throws IOException {
// super.reset(input);
// fullText = null;
// sentences = null;
// words = null;
// first = true;
// }
@Override
public void reset() throws IOException {
super.reset();
clearAttributes();
restartAtBeginning();
}
}
package org.apache.lucene.analysis.opennlp;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import org.apache.commons.lang.exception.ExceptionUtils;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeFactory;
import java.io.File;
import java.util.Map;
/**
*/
public class OpenNLPTokenizerFactory extends TokenizerFactory {
private final int maxTokenLength;
private SentenceDetector sentenceOp = null;
private opennlp.tools.tokenize.Tokenizer tokenizerOp = null;
/** Creates a new StandardTokenizerFactory */
public OpenNLPTokenizerFactory(Map<String,String> args) {
super(args);
maxTokenLength = getInt(args, "maxTokenLength",
StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
String sentModel = args.get("sentenceModel");
String tokenizerModel =args.get("tokenizerModel");
try{
sentenceOp = new SentenceDetectorME(new SentenceModel(new
File(sentModel)));
}catch (Exception e){
StringBuilder msg = new StringBuilder("Required parameter
invalid:");
msg.append("sentenceModel=").append(sentModel).append("\n");
msg.append(ExceptionUtils.getFullStackTrace(e));
throw new IllegalArgumentException(msg.toString());
}
try{
tokenizerOp = new TokenizerME(new TokenizerModel(new
File(tokenizerModel)));
}catch (Exception e){
StringBuilder msg = new StringBuilder("Required parameter
invalid:");
msg.append("tokenizerModel=").append(sentModel).append("\n");
msg.append(ExceptionUtils.getFullStackTrace(e));
throw new IllegalArgumentException(msg.toString());
}
}
@Override
public Tokenizer create(AttributeFactory factory) {
OpenNLPTokenizer tokenizer = new OpenNLPTokenizer(factory, sentenceOp,
tokenizerOp);
return tokenizer;
}
}
package mypackage;
import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.common.SolrInputDocument;
import java.io.IOException;
import java.nio.file.Paths;
public class TestIndexing {
public static void main(String[] args) throws IOException,
SolrServerException {
SolrClient solrClient =
new
EmbeddedSolrServer(Paths.get("D:\\solr-5.3.0_\\server\\solr"),
"core1");
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("id", "01");
String realContent= "ok\r\nthis is the text\r\nand...";
solrDoc.addField("content", realContent);
solrClient.add(solrDoc);
solrClient.commit();
solrClient.close();
}
}