Re: Token implementation

DM Smith Fri, 11 Jul 2008 17:42:38 -0700

Michael McCandless wrote:

Maybe we should un-deprecate the termText() method but add javadocsexplaining that for better performance you should use the char[] reusemethods instead?

I think so, too. Should we leave it as deprecated until 3.0? With theperformance note and the encouragement to go for re-use, but also with anote that the current implementation is deprecated not the interface.

That's not quite what deprecated means. My thought on this is that itwill give everyone a heads up that the current implementation is goingaway and that the replacement is sub-optimal.

(I use Eclipse and have it set to flag all deprecated uses. This helpsme look for places to change.)


I think that this will make migration to 3.0 be much easier.

With this changing Term to add Term(String, Token) won't be necessary.

-- DM

Mike

DM Smith wrote:
Michael McCandless wrote:
DM Smith wrote:
Shouldn't Term have constructors that take a Token?
I think that makes sense, though normally Token appears duringanalysis and Term during searching (I think?) -- how often would youneed to make a Term from a Token?
The problem I'm addressing is that tokens are used in contexts thatneed String and not char[].
The call to the deprecated
 String termText = token.termText();
needs to be replaced with:
String termText = new String(token.termBuffer(), 0,token.termLength());
There are over 170 calls to token.termText(), each of these placeshave to be modified. In some, perhaps many, of these cases it may bepossible to use char[] directly to get a performance gain.
In the case of Term changing it to work with char[] buffer, intstart, int length, does not seem quite right. I think the ripplewould keep getting bigger. But logically, the Term's text is the textof a Token.
To me it makes sense to have a method that returns the token as aString, but that method is deprecated and the suggested replacementis to directly use the buffer. So this leads to the above construct.Perhaps it would be good to add a new method and document that as oneof two replacements.
public String term() {
return termText != null ? termText : new String(token.termBuffer(),0, token.termLength());
}
Here is an example from QueryParser that has 5 instances, eachcalling the deprecated t.termText() method. In this example, there isthe construction of a query from a token stream.
Each of the problem lines are of the pattern:
 TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));
To remove the deprecated call to t.termText(), the Token's bufferneeds to be marshalled with something like:String termText = new String(token.termBuffer(), 0,token.termLength());
 TermQuery currentQuery = new TermQuery(new Term(field, termText)));

/**
 * @exception ParseException throw in overridden method to disallow
 */
protected Query getFieldQuery(String field, String queryText) throwsParseException {
  // Use the analyzer to get all the tokens, and then build a TermQuery,
  // PhraseQuery, or nothing based on the term count
TokenStream source = analyzer.tokenStream(field, newStringReader(queryText));
  Vector v = new Vector();
  org.apache.lucene.analysis.Token t;
  int positionCount = 0;
  boolean severalTokensAtSamePosition = false;

  while (true) {
    try {
      t = source.next();
    }
    catch (IOException e) {
      t = null;
    }
    if (t == null)
      break;
    v.addElement(t);
    if (t.getPositionIncrement() != 0)
      positionCount += t.getPositionIncrement();
    else
      severalTokensAtSamePosition = true;
  }
  try {
    source.close();
  }
  catch (IOException e) {
    // ignore
  }

  if (v.size() == 0)
    return null;
  else if (v.size() == 1) {
    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
    return new TermQuery(new Term(field, t.termText()));
  } else {
    if (severalTokensAtSamePosition) {
      if (positionCount == 1) {
        // no phrase query:
        BooleanQuery q = new BooleanQuery(true);
        for (int i = 0; i < v.size(); i++) {
          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
          TermQuery currentQuery = new TermQuery(
              new Term(field, t.termText()));
          q.add(currentQuery, BooleanClause.Occur.SHOULD);
        }
        return q;
      }
      else {
        // phrase query:
        MultiPhraseQuery mpq = new MultiPhraseQuery();
        mpq.setSlop(phraseSlop);
        List multiTerms = new ArrayList();
        int position = -1;
        for (int i = 0; i < v.size(); i++) {
          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
          if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
            if (enablePositionIncrements) {
              mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
            } else {
              mpq.add((Term[])multiTerms.toArray(new Term[0]));
            }
            multiTerms.clear();
          }
          position += t.getPositionIncrement();
          multiTerms.add(new Term(field, t.termText()));
        }
        if (enablePositionIncrements) {
          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
        } else {
          mpq.add((Term[])multiTerms.toArray(new Term[0]));
        }
        return mpq;
      }
    }
    else {
      PhraseQuery pq = new PhraseQuery();
      pq.setSlop(phraseSlop);
      int position = -1;
      for (int i = 0; i < v.size(); i++) {
        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
        if (enablePositionIncrements) {
          position += t.getPositionIncrement();
          pq.add(new Term(field, t.termText()),position);
        } else {
          pq.add(new Term(field, t.termText()));
        }
      }
      return pq;
    }
  }
}


Here is an example that works around the deprecated code:
public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
Analyzer analyzer = new ShingleAnalyzerWrapper(newWhitespaceAnalyzer(), 2);
  searcher = setUpSearcher(analyzer);

  PhraseQuery q = new PhraseQuery();

  TokenStream ts = analyzer.tokenStream("content",
new StringReader("thissentence"));
  Token token;
  int j = -1;
  while ((token = ts.next()) != null) {
    j += token.getPositionIncrement();
String termText = new String(token.termBuffer(), 0,token.termLength());
    q.add(new Term("content", termText), j);
  }

  Hits hits = searcher.search(q);
  int[] ranks = new int[] { 0 };
  compareRanks(hits, ranks);
}

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to