DM Smith <[EMAIL PROTECTED]> wrote: > On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote: > > > Another suggestion from me: > > How about making token object as an singleton? > > Would that work for a multi-threaded application?
Of cource. We should make that thread local singleton. > > > > > > > >> Maybe we should un-deprecate the termText() method but add javadocs > >> explaining that for better performance you should use the char[] > >> reuse > >> methods instead? > >> > >> Mike > >> > >> DM Smith wrote: > >> > >>> Michael McCandless wrote: > >>>> > >>>> DM Smith wrote: > >>>> > >>>>> Shouldn't Term have constructors that take a Token? > >>>> > >>>> I think that makes sense, though normally Token appears during > >>>> analysis and Term during searching (I think?) -- how often would > >>>> you need to make a Term from a Token? > >>>> > >>> The problem I'm addressing is that tokens are used in contexts that > >>> need String and not char[]. > >>> The call to the deprecated > >>> String termText = token.termText(); > >>> needs to be replaced with: > >>> String termText = new String(token.termBuffer(), 0, > >>> token.termLength()); > >>> > >>> There are over 170 calls to token.termText(), each of these places > >>> have to be modified. In some, perhaps many, of these cases it may be > >>> possible to use char[] directly to get a performance gain. > >>> > >>> In the case of Term changing it to work with char[] buffer, int > >>> start, int length, does not seem quite right. I think the ripple > >>> would keep getting bigger. But logically, the Term's text is the > >>> text of a Token. > >>> > >>> To me it makes sense to have a method that returns the token as a > >>> String, but that method is deprecated and the suggested replacement > >>> is to directly use the buffer. So this leads to the above construct. > >>> Perhaps it would be good to add a new method and document that as > >>> one of two replacements. > >>> public String term() { > >>> return termText != null ? termText : new String(token.termBuffer(), > >>> 0, token.termLength()); > >>> } > >>> > >>> Here is an example from QueryParser that has 5 instances, each > >>> calling the deprecated t.termText() method. In this example, there > >>> is the construction of a query from a token stream. > >>> Each of the problem lines are of the pattern: > >>> TermQuery currentQuery = new TermQuery(new Term(field, > >>> t.termText())); > >>> > >>> To remove the deprecated call to t.termText(), the Token's buffer > >>> needs to be marshalled with something like: > >>> String termText = new String(token.termBuffer(), 0, > >>> token.termLength()); > >>> TermQuery currentQuery = new TermQuery(new Term(field, termText))); > >>> > >>> /** > >>> * @exception ParseException throw in overridden method to disallow > >>> */ > >>> protected Query getFieldQuery(String field, String queryText) > >>> throws ParseException { > >>> // Use the analyzer to get all the tokens, and then build a > >>> TermQuery, > >>> // PhraseQuery, or nothing based on the term count > >>> > >>> TokenStream source = analyzer.tokenStream(field, new > >>> StringReader(queryText)); > >>> Vector v = new Vector(); > >>> org.apache.lucene.analysis.Token t; > >>> int positionCount = 0; > >>> boolean severalTokensAtSamePosition = false; > >>> > >>> while (true) { > >>> try { > >>> t = source.next(); > >>> } > >>> catch (IOException e) { > >>> t = null; > >>> } > >>> if (t == null) > >>> break; > >>> v.addElement(t); > >>> if (t.getPositionIncrement() != 0) > >>> positionCount += t.getPositionIncrement(); > >>> else > >>> severalTokensAtSamePosition = true; > >>> } > >>> try { > >>> source.close(); > >>> } > >>> catch (IOException e) { > >>> // ignore > >>> } > >>> > >>> if (v.size() == 0) > >>> return null; > >>> else if (v.size() == 1) { > >>> t = (org.apache.lucene.analysis.Token) v.elementAt(0); > >>> return new TermQuery(new Term(field, t.termText())); > >>> } else { > >>> if (severalTokensAtSamePosition) { > >>> if (positionCount == 1) { > >>> // no phrase query: > >>> BooleanQuery q = new BooleanQuery(true); > >>> for (int i = 0; i < v.size(); i++) { > >>> t = (org.apache.lucene.analysis.Token) v.elementAt(i); > >>> TermQuery currentQuery = new TermQuery( > >>> new Term(field, t.termText())); > >>> q.add(currentQuery, BooleanClause.Occur.SHOULD); > >>> } > >>> return q; > >>> } > >>> else { > >>> // phrase query: > >>> MultiPhraseQuery mpq = new MultiPhraseQuery(); > >>> mpq.setSlop(phraseSlop); > >>> List multiTerms = new ArrayList(); > >>> int position = -1; > >>> for (int i = 0; i < v.size(); i++) { > >>> t = (org.apache.lucene.analysis.Token) v.elementAt(i); > >>> if (t.getPositionIncrement() > 0 && multiTerms.size() > > >>> 0) { > >>> if (enablePositionIncrements) { > >>> mpq.add((Term[])multiTerms.toArray(new > >>> Term[0]),position); > >>> } else { > >>> mpq.add((Term[])multiTerms.toArray(new Term[0])); > >>> } > >>> multiTerms.clear(); > >>> } > >>> position += t.getPositionIncrement(); > >>> multiTerms.add(new Term(field, t.termText())); > >>> } > >>> if (enablePositionIncrements) { > >>> mpq.add((Term[])multiTerms.toArray(new Term[0]),position); > >>> } else { > >>> mpq.add((Term[])multiTerms.toArray(new Term[0])); > >>> } > >>> return mpq; > >>> } > >>> } > >>> else { > >>> PhraseQuery pq = new PhraseQuery(); > >>> pq.setSlop(phraseSlop); > >>> int position = -1; > >>> for (int i = 0; i < v.size(); i++) { > >>> t = (org.apache.lucene.analysis.Token) v.elementAt(i); > >>> if (enablePositionIncrements) { > >>> position += t.getPositionIncrement(); > >>> pq.add(new Term(field, t.termText()),position); > >>> } else { > >>> pq.add(new Term(field, t.termText())); > >>> } > >>> } > >>> return pq; > >>> } > >>> } > >>> } > >>> > >>> > >>> Here is an example that works around the deprecated code: > >>> public void testShingleAnalyzerWrapperPhraseQuery() throws > >>> Exception { > >>> Analyzer analyzer = new ShingleAnalyzerWrapper(new > >>> WhitespaceAnalyzer(), 2); > >>> searcher = setUpSearcher(analyzer); > >>> > >>> PhraseQuery q = new PhraseQuery(); > >>> > >>> TokenStream ts = analyzer.tokenStream("content", > >>> new StringReader("this > >>> sentence")); > >>> Token token; > >>> int j = -1; > >>> while ((token = ts.next()) != null) { > >>> j += token.getPositionIncrement(); > >>> String termText = new String(token.termBuffer(), 0, > >>> token.termLength()); > >>> q.add(new Term("content", termText), j); > >>> } > >>> > >>> Hits hits = searcher.search(q); > >>> int[] ranks = new int[] { 0 }; > >>> compareRanks(hits, ranks); > >>> } > >>> > >>> -- DM > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]