Re: Token implementation

DM Smith Fri, 11 Jul 2008 20:13:46 -0700


On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:

Another suggestion from me:
How about making token object as an singleton?


Would that work for a multi-threaded application?

Maybe we should un-deprecate the termText() method but add javadocs

explaining that for better performance you should use the char[]reuse

methods instead?

Mike

DM Smith wrote:

Michael McCandless wrote:


DM Smith wrote:

Shouldn't Term have constructors that take a Token?


I think that makes sense, though normally Token appears during
analysis and Term during searching (I think?) -- how often would
you need to make a Term from a Token?

The problem I'm addressing is that tokens are used in contexts that
need String and not char[].
The call to the deprecated
String termText = token.termText();
needs to be replaced with:
String termText = new String(token.termBuffer(), 0,
token.termLength());

There are over 170 calls to token.termText(), each of these places
have to be modified. In some, perhaps many, of these cases it may be
possible to use char[] directly to get a performance gain.

In the case of Term changing it to work with char[] buffer, int
start, int length, does not seem quite right. I think the ripple
would keep getting bigger. But logically, the Term's text is the
text of a Token.

To me it makes sense to have a method that returns the token as a
String, but that method is deprecated and the suggested replacement
is to directly use the buffer. So this leads to the above construct.
Perhaps it would be good to add a new method and document that as
one of two replacements.
public String term() {
return termText != null ? termText : new String(token.termBuffer(),
0, token.termLength());
}

Here is an example from QueryParser that has 5 instances, each
calling the deprecated t.termText() method. In this example, there
is the construction of a query from a token stream.
Each of the problem lines are of the pattern:
TermQuery currentQuery = new TermQuery(new Term(field,
t.termText()));

To remove the deprecated call to t.termText(), the Token's buffer
needs to be marshalled with something like:
String termText = new String(token.termBuffer(), 0,
token.termLength());
TermQuery currentQuery = new TermQuery(new Term(field, termText)));

/**
* @exception ParseException throw in overridden method to disallow
*/
protected Query getFieldQuery(String field, String queryText)
throws ParseException {
 // Use the analyzer to get all the tokens, and then build a
TermQuery,
 // PhraseQuery, or nothing based on the term count

 TokenStream source = analyzer.tokenStream(field, new
StringReader(queryText));
 Vector v = new Vector();
 org.apache.lucene.analysis.Token t;
 int positionCount = 0;
 boolean severalTokensAtSamePosition = false;

 while (true) {
   try {
     t = source.next();
   }
   catch (IOException e) {
     t = null;
   }
   if (t == null)
     break;
   v.addElement(t);
   if (t.getPositionIncrement() != 0)
     positionCount += t.getPositionIncrement();
   else
     severalTokensAtSamePosition = true;
 }
 try {
   source.close();
 }
 catch (IOException e) {
   // ignore
 }

 if (v.size() == 0)
   return null;
 else if (v.size() == 1) {
   t = (org.apache.lucene.analysis.Token) v.elementAt(0);
   return new TermQuery(new Term(field, t.termText()));
 } else {
   if (severalTokensAtSamePosition) {
     if (positionCount == 1) {
       // no phrase query:
       BooleanQuery q = new BooleanQuery(true);
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
         TermQuery currentQuery = new TermQuery(
             new Term(field, t.termText()));
         q.add(currentQuery, BooleanClause.Occur.SHOULD);
       }
       return q;
     }
     else {
       // phrase query:
       MultiPhraseQuery mpq = new MultiPhraseQuery();
       mpq.setSlop(phraseSlop);
       List multiTerms = new ArrayList();
       int position = -1;
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);

if (t.getPositionIncrement() > 0 && multiTerms.size() >0) {

           if (enablePositionIncrements) {
             mpq.add((Term[])multiTerms.toArray(new
Term[0]),position);
           } else {
             mpq.add((Term[])multiTerms.toArray(new Term[0]));
           }
           multiTerms.clear();
         }
         position += t.getPositionIncrement();
         multiTerms.add(new Term(field, t.termText()));
       }
       if (enablePositionIncrements) {
         mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
       } else {
         mpq.add((Term[])multiTerms.toArray(new Term[0]));
       }
       return mpq;
     }
   }
   else {
     PhraseQuery pq = new PhraseQuery();
     pq.setSlop(phraseSlop);
     int position = -1;
     for (int i = 0; i < v.size(); i++) {
       t = (org.apache.lucene.analysis.Token) v.elementAt(i);
       if (enablePositionIncrements) {
         position += t.getPositionIncrement();
         pq.add(new Term(field, t.termText()),position);
       } else {
         pq.add(new Term(field, t.termText()));
       }
     }
     return pq;
   }
 }
}


Here is an example that works around the deprecated code:

public void testShingleAnalyzerWrapperPhraseQuery() throwsException {

 Analyzer analyzer = new ShingleAnalyzerWrapper(new
WhitespaceAnalyzer(), 2);
 searcher = setUpSearcher(analyzer);

 PhraseQuery q = new PhraseQuery();

 TokenStream ts = analyzer.tokenStream("content",
                                       new StringReader("this
sentence"));
 Token token;
 int j = -1;
 while ((token = ts.next()) != null) {
   j += token.getPositionIncrement();
   String termText = new String(token.termBuffer(), 0,
token.termLength());
   q.add(new Term("content", termText), j);
 }

 Hits hits = searcher.search(q);
 int[] ranks = new int[] { 0 };
 compareRanks(hits, ranks);
}

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to