Re: Analyzer on query question

Ian Lea Fri, 03 Aug 2012 14:04:21 -0700

I still don't see what Bill gains by doing the term analysis himself
rather than letting QueryParser do the hard work, in a portable
non-lucene-version-specific way.



--
Ian.


On Fri, Aug 3, 2012 at 9:39 PM, Robert Muir <rcm...@gmail.com> wrote:
> you must call reset() before consuming any tokenstream.
>
> On Fri, Aug 3, 2012 at 4:03 PM, Jack Krupansky <j...@basetechnology.com> 
> wrote:
>> Simon gave sample code for analyzing a multi-term string.
>>
>> Here's some pseudo-code (hasn't been compiled to check it) to analyze a
>> single term with Lucene 3.6:
>>
>> public Term analyzeTerm(Analyzer analyzer, String termString){
>>  TokenStream stream  = analyzer.tokenStream(field, new
>> StringReader(termString));
>>  if (stream.incrementToken())
>>    return new
>> Term(stream.getAttribute(CharacterTermAttribute.class).toString());
>>  else
>>    return null;
>>  // TODO: Close the StringReader
>>  // TODO: Handle terms that analyze into multiple terms (e.g., embedded
>> punctuation)
>> }
>>
>> And here's the corresponding code for Lucene 4.0:
>>
>> public Term analyzeTerm(Analyzer analyzer, String termString){
>>  TokenStream stream  = analyzer.tokenStream(field, new
>> StringReader(termString));
>>  if (stream.incrementToken()){
>>    TermToBytesRefAttribute termAtt =
>> stream.getAttribute(TermToBytesRefAttribute.class);
>>    BytesRef bytes = termAtt.getBytesRef();
>>    return new Term(BytesRef.deepCopyOf(bytes));
>>  } else
>>    return null;
>>  // TODO: Close the StringReader
>>  // TODO: Handle terms that analyze into multiple terms (e.g., embedded
>> punctuation)
>> }
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Bill Chesky
>> Sent: Friday, August 03, 2012 2:55 PM
>> To: java-user@lucene.apache.org
>>
>> Subject: RE: Analyzer on query question
>>
>> Ian/Jack,
>>
>> Ok, thanks for the help.  I certainly don't want to take a cheap way out,
>> hence my original question about whether this is the right way to do this.
>> Jack, you say the right way is to do Term analysis before creating the Term.
>> If anybody has any information on how to accomplish this I'd greatly
>> appreciate it.
>>
>> regards,
>>
>> Bill
>>
>> -----Original Message-----
>> From: Jack Krupansky [mailto:j...@basetechnology.com]
>> Sent: Friday, August 03, 2012 1:22 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Analyzer on query question
>>
>> Bill, the re-parse of Query.toString will work provided that your query
>> terms are either un-analyzed or their analyzer is "idempotent" (can be
>> applied repeatedly without changing the output terms.) In your case, you are
>> doing the former.
>>
>> The bottom line: 1) if it works for you, great, 2) for other readers, please
>> do not depend on this approach if your input data is filtered in any way -
>> if your index analyzer "filters" terms (e.g, stemming, case changes,
>> term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
>> in which case the extra parse (to cause term analysis such as stemming)
>> becomes unnecessary and risky if you are not very careful or very lucky.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Ian Lea
>> Sent: Friday, August 03, 2012 1:12 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Analyzer on query question
>>
>> Bill
>>
>>
>> You're getting the snowball stemming either way which I guess is good,
>> and if you get same results either way maybe it doesn't matter which
>> technique you use.  I'd be a bit worried about parsing the result of
>> query.toString() because you aren't guaranteed to get back, in text,
>> what you put in.
>>
>> My way seems better to me, but then it would.  If you prefer your way
>> I won't argue with you.
>>
>>
>> --
>> Ian.
>>
>>
>> On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky <bill.che...@learninga-z.com>
>> wrote:
>>>
>>> Ian,
>>>
>>> I gave this method a try, at least the way I understood your suggestion.
>>> E.g. to search for the phrase "cells combine" I built up a string like:
>>>
>>> title:"cells combine" description:"cells combine" text:"cells combine"
>>>
>>> then I passed that to the queryParser.parse() method (where queryParser is
>>> an instance of QueryParser constructed using SnowballAnalyzer) and added
>>> the result as a MUST clause in my final BooleanQuery.
>>>
>>> When I print the resulting query out as a string I get:
>>>
>>> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>>>
>>> So it looks like the SnowballAnalyzer is doing some stemming for me.  But
>>> this is the exact same result I'd get doing it the way I described in my
>>> original email.  I just built the unanalyzed string on my own rather than
>>> using the various query classes like PhraseQuery, etc.
>>>
>>> So I don't see the advantage to doing it this way over the original
>>> method.  I just don't know if the original way I described is wrong or
>>> will give me bad results.
>>>
>>> thanks for the help,
>>>
>>> Bill
>>>
>>> -----Original Message-----
>>> From: Ian Lea [mailto:ian....@gmail.com]
>>> Sent: Friday, August 03, 2012 9:32 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Analyzer on query question
>>>
>>> You can add parsed queries to a BooleanQuery.  Would that help in this
>>> case?
>>>
>>> SnowballAnalyzer sba = whatever();
>>> QueryParser qp = new QueryParser(..., sba);
>>> Query q1 = qp.parse("some snowball string");
>>> Query q2 = qp.parse("some other snowball string");
>>>
>>> BooleanQuery bq = new BooleanQuery();
>>> bq.add(q1, ...);
>>> bq.add(q2, ...);
>>> bq.add(loads of other stuff);
>>>
>>>
>>> --
>>> ian.
>>>
>>>
>>> On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky <bill.che...@learninga-z.com>
>>> wrote:
>>>>
>>>> Thanks Simon,
>>>>
>>>> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem
>>>> to have been introduced until 3.1.0.  Similarly my version of Lucene does
>>>> not have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant
>>>> BooleanQuery.add(BooleanClause).
>>>
>>>
>>>>
>>>> In any case, most of what you're doing there, I'm just not familiar with.
>>>> Seems very low level.  I've never had to use TokenStreams to build a
>>>> query before and I'm not really sure what is going on there.  Also, I
>>>> don't know what PositionIncrementAttribute is or how it would be used to
>>>> create a PhraseQuery.   The way I'm currently creating PhraseQuerys is
>>>> very straightforward and intuitive.  E.g. to search for the term "foo
>>>> bar" I'd build the query like this:
>>>>
>>>>                                                 PhraseQuery phraseQuery =
>>>> new PhraseQuery();
>>>>                                                 phraseQuery.add(new
>>>> Term("title", "foo"));
>>>>                                                 phraseQuery.add(new
>>>> Term("title", "bar"));
>>>>
>>>> Is there really no easier way to associate the correct analyzer with
>>>> these types of queries?
>>>>
>>>> Bill
>>>>
>>>> -----Original Message-----
>>>> From: Simon Willnauer [mailto:simon.willna...@gmail.com]
>>>> Sent: Friday, August 03, 2012 3:43 AM
>>>> To: java-user@lucene.apache.org; Bill Chesky
>>>> Subject: Re: Analyzer on query question
>>>>
>>>> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
>>>> <bill.che...@learninga-z.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I understand that generally speaking you should use the same analyzer on
>>>>> querying as was used on indexing.  In my code I am using the
>>>>> SnowballAnalyzer on index creation.  However, on the query side I am
>>>>> building up a complex BooleanQuery from other BooleanQuerys and/or
>>>>> PhraseQuerys on several fields.  None of these require specifying an
>>>>> analyzer anywhere.  This is causing some odd results, I think, because a
>>>>> different analyzer (or no analyzer?) is being used for the query.
>>>>>
>>>>> Question: how do I build my boolean and phrase queries using the
>>>>> SnowballAnalyzer?
>>>>>
>>>>> One thing I did that seemed to kind of work was to build my complex
>>>>> query normally then build a snowball-analyzed query using a QueryParser
>>>>> instantiated with a SnowballAnalyzer.  To do this, I simply pass the
>>>>> string value of the complex query to the QueryParser.parse() method to
>>>>> get the new query.  Something like this:
>>>>>
>>>>>     // build a complex query from other BooleanQuerys and PhraseQuerys
>>>>>     BooleanQuery fullQuery = buildComplexQuery();
>>>>>     QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new
>>>>> SnowballAnalyzer(Version.LUCENE_30, "English"));
>>>>>     Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>>>>>
>>>>>     TopScoreDocCollector collector = TopScoreDocCollector.create(10000,
>>>>> true);
>>>>>     indexSearcher.search(snowballAnalyzedQuery, collector);
>>>>
>>>>
>>>> you can just use the analyzer directly like this:
>>>> Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
>>>>
>>>> TokenStream stream = analyzer.tokenStream("title", new
>>>> StringReader(fullQuery.toString()):
>>>> CharTermAttribute termAttr =
>>>> stream.addAttribute(CharTermAttribute.class);
>>>> stream.reset();
>>>> BooleanQuery q = new BooleanQuery();
>>>> while(stream.incrementToken()) {
>>>>   q.addClause(new BooleanClause(Occur.MUST, new Term("title",
>>>> termAttr.toString())));
>>>> }
>>>>
>>>> you also have access to the token positions if you want to create
>>>> phrase queries etc. just add a PositionIncrementAttribute like this:
>>>> PositionIncrementAttribute posAttr =
>>>> stream.addAttribute(PositionsIncrementAttribute.class);
>>>>
>>>> pls. doublecheck the code it's straight from the top of my head.
>>>>
>>>> simon
>>>>
>>>>>
>>>>> Like I said, this seems to kind of work but it doesn't feel right.  Does
>>>>> this make sense?  Is there a better way?
>>>>>
>>>>> thanks in advance,
>>>>>
>>>>> Bill
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Analyzer on query question

Reply via email to