On Fri, Sep 17, 2010 at 7:34 PM, Scott Smith <[email protected]> wrote:
> First, let me say that I didn't think the problem was in QueryParser and I
> apologize if that's how it sounded. QueryParser is a central method to
> Lucene. 1 of me having problems with QueryParser, 1000's of others not. Is
> the problem more likely in my code or lucene. We'll all agree on the answer
> to that question.
Don't worry :)
>
> As further proof, I ran the following code. The first part is from Simon's
> email (thanks for that snippet) and the second part is from LIA2.
>
> // code from Willnauer email
> Analyzer a = new MyAnalyzer(Version.LUCENE_30);
> TokenStream stream = a.reusableTokenStream("body", new
> StringReader("Europabörsen"));
> TermAttribute attr = stream.addAttribute(TermAttribute.class);
> while(stream.incrementToken())
> {
> System.out.println(attr.term());
> }
>
> // code from LIA2
> stream = a.tokenStream("body", new StringReader("Europabörsen"));
> TermAttribute term = stream.addAttribute(TermAttribute.class);
> while (stream.incrementToken())
> {
> System.out.print(term.term());
> }
>
>
> The answer I got back was:
> europabörsen
> europaborsen
>
> I realized the difference between these two was whether I was getting the
> reusableTokeStream or the tokenStream. In looking at my code, the
> ASCIIFoldingFilter was not in the filter setup for the
> resusableTokenStream(). It was for the tokenStream(). I added it to the
> reusableTokenStream and I now get the result I wanted. The above code
> snippet generates the word without the umlaut in both cases. So, problem
> solved.
>
> Thanks to Simon for putting on the right track.
you are using lucene 3.0? If so take a look at ReusableAnalyzerBase
which makes it much easier to build Analyzers and prevents code
duplication.
simon
>
> Scott
>
>
> -----Original Message-----
> From: Simon Willnauer [mailto:[email protected]]
> Sent: Friday, September 17, 2010 1:03 AM
> To: [email protected]
> Subject: Re: QueryParser in 3.x
>
> On Fri, Sep 17, 2010 at 1:06 AM, Scott Smith <[email protected]>
> wrote:
>> I recently upgraded to Lucene 3.0 and am seeing some new behavior that I
>> don't understand. Perhaps someone can explain why.
>>
>>
>>
>> I have a custom analyzer. Part of the analyzer uses the AsciiFoldingFilter.
>> If I run a word with an umlaut through that analyzer using the AnalyzerDemo
>> code in LIA2, as expected, I get the same word except that the umlauted
>> letter is now a simple ascii letter (no umlaut). That's what I would expect
>> and want.
>>
>>
>>
>> If I create a Queryparser using the call "new QueryParser(LUCENE_30, "body",
>> myAnalyzer) and then call the parse() method passing the same word, I can
>> see that the query parser has not removed the umlaut. The string it has is
>> "+body: Europabörsen".
>>
> This seems to be an issue with your analyzer rather than with the
> QueryParser. Since QueryParser didn't really change its behavior in
> 3.0 except of some default values. Can you provide more info what you
> did with your analyzer? Did you try running the term with umlaut chars
> through your Analyzer / Tokenstream directly? Something like that:
>
> Analyzer a = new MyAnalyzer();
> TokenStream stream = a.reusableTokenStream("body", new
> StringReader("Europabörsen"));
> TermAttribute attr = stream.addAttribute(TermAttribute.class);
> while(stream.incrementToken())
> System.out.println(attr.term());
>
> simon
>>
>>
>> I know I had to make a number of changes to the analyzer and the tokenizer
>> to upgrade to 3.x. Is there something very different from the 2.x version
>> that I'm likely missing.
>>
>>
>>
>> Anyone have any thoughts?
>>
>>
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]