Re: Token implementation

DM Smith Fri, 11 Jul 2008 12:43:24 -0700

Michael McCandless wrote:

DM Smith wrote:
 Shouldn't Term have constructors that take a Token?
I think that makes sense, though normally Token appears duringanalysis and Term during searching (I think?) -- how often would youneed to make a Term from a Token?

The problem I'm addressing is that tokens are used in contexts that needString and not char[].

The call to the deprecated
  String termText = token.termText();
needs to be replaced with:
  String termText = new String(token.termBuffer(), 0, token.termLength());

There are over 170 calls to token.termText(), each of these places haveto be modified. In some, perhaps many, of these cases it may be possibleto use char[] directly to get a performance gain.

In the case of Term changing it to work with char[] buffer, int start,int length, does not seem quite right. I think the ripple would keepgetting bigger. But logically, the Term's text is the text of a Token.

To me it makes sense to have a method that returns the token as aString, but that method is deprecated and the suggested replacement isto directly use the buffer. So this leads to the above construct.Perhaps it would be good to add a new method and document that as one oftwo replacements.

public String term() {

return termText != null ? termText : new String(token.termBuffer(), 0,token.termLength());

Here is an example from QueryParser that has 5 instances, each callingthe deprecated t.termText() method. In this example, there is theconstruction of a query from a token stream.

Each of the problem lines are of the pattern:
  TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));

To remove the deprecated call to t.termText(), the Token's buffer needsto be marshalled with something like:

  String termText = new String(token.termBuffer(), 0, token.termLength());
  TermQuery currentQuery = new TermQuery(new Term(field, termText)));

 /**
  * @exception ParseException throw in overridden method to disallow
  */

protected Query getFieldQuery(String field, String queryText) throwsParseException {

   // Use the analyzer to get all the tokens, and then build a TermQuery,
   // PhraseQuery, or nothing based on the term count

TokenStream source = analyzer.tokenStream(field, newStringReader(queryText));

   Vector v = new Vector();
   org.apache.lucene.analysis.Token t;
   int positionCount = 0;
   boolean severalTokensAtSamePosition = false;

   while (true) {
     try {
       t = source.next();
     }
     catch (IOException e) {
       t = null;
     }
     if (t == null)
       break;
     v.addElement(t);
     if (t.getPositionIncrement() != 0)
       positionCount += t.getPositionIncrement();
     else
       severalTokensAtSamePosition = true;
   }
   try {
     source.close();
   }
   catch (IOException e) {
     // ignore
   }

   if (v.size() == 0)
     return null;
   else if (v.size() == 1) {
     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
     return new TermQuery(new Term(field, t.termText()));
   } else {
     if (severalTokensAtSamePosition) {
       if (positionCount == 1) {
         // no phrase query:
         BooleanQuery q = new BooleanQuery(true);
         for (int i = 0; i < v.size(); i++) {
           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
           TermQuery currentQuery = new TermQuery(
               new Term(field, t.termText()));
           q.add(currentQuery, BooleanClause.Occur.SHOULD);
         }
         return q;
       }
       else {
         // phrase query:
         MultiPhraseQuery mpq = new MultiPhraseQuery();
         mpq.setSlop(phraseSlop);
         List multiTerms = new ArrayList();
         int position = -1;
         for (int i = 0; i < v.size(); i++) {
           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
             if (enablePositionIncrements) {
               mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
             } else {
               mpq.add((Term[])multiTerms.toArray(new Term[0]));
             }
             multiTerms.clear();
           }
           position += t.getPositionIncrement();
           multiTerms.add(new Term(field, t.termText()));
         }
         if (enablePositionIncrements) {
           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
         } else {
           mpq.add((Term[])multiTerms.toArray(new Term[0]));
         }
         return mpq;
       }
     }
     else {
       PhraseQuery pq = new PhraseQuery();
       pq.setSlop(phraseSlop);
       int position = -1;
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
         if (enablePositionIncrements) {
           position += t.getPositionIncrement();
           pq.add(new Term(field, t.termText()),position);
         } else {
           pq.add(new Term(field, t.termText()));
         }
       }
       return pq;
     }
   }
 }


Here is an example that works around the deprecated code:
 public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {

Analyzer analyzer = new ShingleAnalyzerWrapper(newWhitespaceAnalyzer(), 2);

   searcher = setUpSearcher(analyzer);

   PhraseQuery q = new PhraseQuery();

   TokenStream ts = analyzer.tokenStream("content",

new StringReader("thissentence"));

   Token token;
   int j = -1;
   while ((token = ts.next()) != null) {
     j += token.getPositionIncrement();

String termText = new String(token.termBuffer(), 0,token.termLength());

     q.add(new Term("content", termText), j);
   }

   Hits hits = searcher.search(q);
   int[] ranks = new int[] { 0 };
   compareRanks(hits, ranks);
 }

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to