Re: Token implementation

Hiroaki Kawai Sat, 12 Jul 2008 07:38:16 -0700

DM Smith <[EMAIL PROTECTED]> wrote:
> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
> 
> > Another suggestion from me:
> > How about making token object as an singleton?
> 
> Would that work for a multi-threaded application?


Of cource. We should make that thread local singleton.


> >
> >
> >
> >> Maybe we should un-deprecate the termText() method but add javadocs
> >> explaining that for better performance you should use the char[]  
> >> reuse
> >> methods instead?
> >>
> >> Mike
> >>
> >> DM Smith wrote:
> >>
> >>> Michael McCandless wrote:
> >>>>
> >>>> DM Smith wrote:
> >>>>
> >>>>> Shouldn't Term have constructors that take a Token?
> >>>>
> >>>> I think that makes sense, though normally Token appears during
> >>>> analysis and Term during searching (I think?) -- how often would
> >>>> you need to make a Term from a Token?
> >>>>
> >>> The problem I'm addressing is that tokens are used in contexts that
> >>> need String and not char[].
> >>> The call to the deprecated
> >>> String termText = token.termText();
> >>> needs to be replaced with:
> >>> String termText = new String(token.termBuffer(), 0,
> >>> token.termLength());
> >>>
> >>> There are over 170 calls to token.termText(), each of these places
> >>> have to be modified. In some, perhaps many, of these cases it may be
> >>> possible to use char[] directly to get a performance gain.
> >>>
> >>> In the case of Term changing it to work with char[] buffer, int
> >>> start, int length, does not seem quite right. I think the ripple
> >>> would keep getting bigger. But logically, the Term's text is the
> >>> text of a Token.
> >>>
> >>> To me it makes sense to have a method that returns the token as a
> >>> String, but that method is deprecated and the suggested replacement
> >>> is to directly use the buffer. So this leads to the above construct.
> >>> Perhaps it would be good to add a new method and document that as
> >>> one of two replacements.
> >>> public String term() {
> >>> return termText != null ? termText : new String(token.termBuffer(),
> >>> 0, token.termLength());
> >>> }
> >>>
> >>> Here is an example from QueryParser that has 5 instances, each
> >>> calling the deprecated t.termText() method. In this example, there
> >>> is the construction of a query from a token stream.
> >>> Each of the problem lines are of the pattern:
> >>> TermQuery currentQuery = new TermQuery(new Term(field,
> >>> t.termText()));
> >>>
> >>> To remove the deprecated call to t.termText(), the Token's buffer
> >>> needs to be marshalled with something like:
> >>> String termText = new String(token.termBuffer(), 0,
> >>> token.termLength());
> >>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
> >>>
> >>> /**
> >>> * @exception ParseException throw in overridden method to disallow
> >>> */
> >>> protected Query getFieldQuery(String field, String queryText)
> >>> throws ParseException {
> >>>  // Use the analyzer to get all the tokens, and then build a
> >>> TermQuery,
> >>>  // PhraseQuery, or nothing based on the term count
> >>>
> >>>  TokenStream source = analyzer.tokenStream(field, new
> >>> StringReader(queryText));
> >>>  Vector v = new Vector();
> >>>  org.apache.lucene.analysis.Token t;
> >>>  int positionCount = 0;
> >>>  boolean severalTokensAtSamePosition = false;
> >>>
> >>>  while (true) {
> >>>    try {
> >>>      t = source.next();
> >>>    }
> >>>    catch (IOException e) {
> >>>      t = null;
> >>>    }
> >>>    if (t == null)
> >>>      break;
> >>>    v.addElement(t);
> >>>    if (t.getPositionIncrement() != 0)
> >>>      positionCount += t.getPositionIncrement();
> >>>    else
> >>>      severalTokensAtSamePosition = true;
> >>>  }
> >>>  try {
> >>>    source.close();
> >>>  }
> >>>  catch (IOException e) {
> >>>    // ignore
> >>>  }
> >>>
> >>>  if (v.size() == 0)
> >>>    return null;
> >>>  else if (v.size() == 1) {
> >>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
> >>>    return new TermQuery(new Term(field, t.termText()));
> >>>  } else {
> >>>    if (severalTokensAtSamePosition) {
> >>>      if (positionCount == 1) {
> >>>        // no phrase query:
> >>>        BooleanQuery q = new BooleanQuery(true);
> >>>        for (int i = 0; i < v.size(); i++) {
> >>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >>>          TermQuery currentQuery = new TermQuery(
> >>>              new Term(field, t.termText()));
> >>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
> >>>        }
> >>>        return q;
> >>>      }
> >>>      else {
> >>>        // phrase query:
> >>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
> >>>        mpq.setSlop(phraseSlop);
> >>>        List multiTerms = new ArrayList();
> >>>        int position = -1;
> >>>        for (int i = 0; i < v.size(); i++) {
> >>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >  
> >>> 0) {
> >>>            if (enablePositionIncrements) {
> >>>              mpq.add((Term[])multiTerms.toArray(new
> >>> Term[0]),position);
> >>>            } else {
> >>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >>>            }
> >>>            multiTerms.clear();
> >>>          }
> >>>          position += t.getPositionIncrement();
> >>>          multiTerms.add(new Term(field, t.termText()));
> >>>        }
> >>>        if (enablePositionIncrements) {
> >>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
> >>>        } else {
> >>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >>>        }
> >>>        return mpq;
> >>>      }
> >>>    }
> >>>    else {
> >>>      PhraseQuery pq = new PhraseQuery();
> >>>      pq.setSlop(phraseSlop);
> >>>      int position = -1;
> >>>      for (int i = 0; i < v.size(); i++) {
> >>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >>>        if (enablePositionIncrements) {
> >>>          position += t.getPositionIncrement();
> >>>          pq.add(new Term(field, t.termText()),position);
> >>>        } else {
> >>>          pq.add(new Term(field, t.termText()));
> >>>        }
> >>>      }
> >>>      return pq;
> >>>    }
> >>>  }
> >>> }
> >>>
> >>>
> >>> Here is an example that works around the deprecated code:
> >>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
> >>> Exception {
> >>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new
> >>> WhitespaceAnalyzer(), 2);
> >>>  searcher = setUpSearcher(analyzer);
> >>>
> >>>  PhraseQuery q = new PhraseQuery();
> >>>
> >>>  TokenStream ts = analyzer.tokenStream("content",
> >>>                                        new StringReader("this
> >>> sentence"));
> >>>  Token token;
> >>>  int j = -1;
> >>>  while ((token = ts.next()) != null) {
> >>>    j += token.getPositionIncrement();
> >>>    String termText = new String(token.termBuffer(), 0,
> >>> token.termLength());
> >>>    q.add(new Term("content", termText), j);
> >>>  }
> >>>
> >>>  Hits hits = searcher.search(q);
> >>>  int[] ranks = new int[] { 0 };
> >>>  compareRanks(hits, ranks);
> >>> }
> >>>
> >>> -- DM
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to