Re: Token implementation

Hiroaki Kawai Fri, 11 Jul 2008 18:43:32 -0700

Another suggestion from me:
How about making token object as an singleton?


> Maybe we should un-deprecate the termText() method but add javadocs  
> explaining that for better performance you should use the char[] reuse  
> methods instead?
> 
> Mike
> 
> DM Smith wrote:
> 
> > Michael McCandless wrote:
> >>
> >> DM Smith wrote:
> >>
> >>> Shouldn't Term have constructors that take a Token?
> >>
> >> I think that makes sense, though normally Token appears during  
> >> analysis and Term during searching (I think?) -- how often would  
> >> you need to make a Term from a Token?
> >>
> > The problem I'm addressing is that tokens are used in contexts that  
> > need String and not char[].
> > The call to the deprecated
> >  String termText = token.termText();
> > needs to be replaced with:
> >  String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >
> > There are over 170 calls to token.termText(), each of these places  
> > have to be modified. In some, perhaps many, of these cases it may be  
> > possible to use char[] directly to get a performance gain.
> >
> > In the case of Term changing it to work with char[] buffer, int  
> > start, int length, does not seem quite right. I think the ripple  
> > would keep getting bigger. But logically, the Term's text is the  
> > text of a Token.
> >
> > To me it makes sense to have a method that returns the token as a  
> > String, but that method is deprecated and the suggested replacement  
> > is to directly use the buffer. So this leads to the above construct.  
> > Perhaps it would be good to add a new method and document that as  
> > one of two replacements.
> > public String term() {
> > return termText != null ? termText : new String(token.termBuffer(),  
> > 0, token.termLength());
> > }
> >
> > Here is an example from QueryParser that has 5 instances, each  
> > calling the deprecated t.termText() method. In this example, there  
> > is the construction of a query from a token stream.
> > Each of the problem lines are of the pattern:
> >  TermQuery currentQuery = new TermQuery(new Term(field,  
> > t.termText()));
> >
> > To remove the deprecated call to t.termText(), the Token's buffer  
> > needs to be marshalled with something like:
> >  String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
> >
> > /**
> >  * @exception ParseException throw in overridden method to disallow
> >  */
> > protected Query getFieldQuery(String field, String queryText)   
> > throws ParseException {
> >   // Use the analyzer to get all the tokens, and then build a  
> > TermQuery,
> >   // PhraseQuery, or nothing based on the term count
> >
> >   TokenStream source = analyzer.tokenStream(field, new  
> > StringReader(queryText));
> >   Vector v = new Vector();
> >   org.apache.lucene.analysis.Token t;
> >   int positionCount = 0;
> >   boolean severalTokensAtSamePosition = false;
> >
> >   while (true) {
> >     try {
> >       t = source.next();
> >     }
> >     catch (IOException e) {
> >       t = null;
> >     }
> >     if (t == null)
> >       break;
> >     v.addElement(t);
> >     if (t.getPositionIncrement() != 0)
> >       positionCount += t.getPositionIncrement();
> >     else
> >       severalTokensAtSamePosition = true;
> >   }
> >   try {
> >     source.close();
> >   }
> >   catch (IOException e) {
> >     // ignore
> >   }
> >
> >   if (v.size() == 0)
> >     return null;
> >   else if (v.size() == 1) {
> >     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
> >     return new TermQuery(new Term(field, t.termText()));
> >   } else {
> >     if (severalTokensAtSamePosition) {
> >       if (positionCount == 1) {
> >         // no phrase query:
> >         BooleanQuery q = new BooleanQuery(true);
> >         for (int i = 0; i < v.size(); i++) {
> >           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >           TermQuery currentQuery = new TermQuery(
> >               new Term(field, t.termText()));
> >           q.add(currentQuery, BooleanClause.Occur.SHOULD);
> >         }
> >         return q;
> >       }
> >       else {
> >         // phrase query:
> >         MultiPhraseQuery mpq = new MultiPhraseQuery();
> >         mpq.setSlop(phraseSlop);
> >         List multiTerms = new ArrayList();
> >         int position = -1;
> >         for (int i = 0; i < v.size(); i++) {
> >           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
> >             if (enablePositionIncrements) {
> >               mpq.add((Term[])multiTerms.toArray(new  
> > Term[0]),position);
> >             } else {
> >               mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >             }
> >             multiTerms.clear();
> >           }
> >           position += t.getPositionIncrement();
> >           multiTerms.add(new Term(field, t.termText()));
> >         }
> >         if (enablePositionIncrements) {
> >           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
> >         } else {
> >           mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >         }
> >         return mpq;
> >       }
> >     }
> >     else {
> >       PhraseQuery pq = new PhraseQuery();
> >       pq.setSlop(phraseSlop);
> >       int position = -1;
> >       for (int i = 0; i < v.size(); i++) {
> >         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >         if (enablePositionIncrements) {
> >           position += t.getPositionIncrement();
> >           pq.add(new Term(field, t.termText()),position);
> >         } else {
> >           pq.add(new Term(field, t.termText()));
> >         }
> >       }
> >       return pq;
> >     }
> >   }
> > }
> >
> >
> > Here is an example that works around the deprecated code:
> > public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
> >   Analyzer analyzer = new ShingleAnalyzerWrapper(new  
> > WhitespaceAnalyzer(), 2);
> >   searcher = setUpSearcher(analyzer);
> >
> >   PhraseQuery q = new PhraseQuery();
> >
> >   TokenStream ts = analyzer.tokenStream("content",
> >                                         new StringReader("this  
> > sentence"));
> >   Token token;
> >   int j = -1;
> >   while ((token = ts.next()) != null) {
> >     j += token.getPositionIncrement();
> >     String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >     q.add(new Term("content", termText), j);
> >   }
> >
> >   Hits hits = searcher.search(q);
> >   int[] ranks = new int[] { 0 };
> >   compareRanks(hits, ranks);
> > }
> >
> > -- DM
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to