Re: Lucene QueryParser/Analyzer inconsistency

2014-06-19 Thread Luis Pureza
Unfortunately I spoke too soon. While the original example seems to have
been fixed, I'm still getting some unexpected results.

As per your suggestion, I modified the Analyzer to:

@Override
protected TokenStreamComponents createComponents(String field, Reader
in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add("/", " "); // Transform all forward slashes into
whitespace
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version,
mappingFilter);
return new TokenStreamComponents(tokenizer);
}

When I try this:

QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(QueryParser.escape("one/two")));

I get

f:one f:two

as expected.

However, if I change the text to "hello one/two", I get:

f:hello f:one/two

I can't figure out what's going on. My custom tokenizer seems to work well,
but I'd rather use Lucene's built-ins.

Thank you,

Luis



On Wed, Jun 18, 2014 at 3:38 PM, Luis Pureza  wrote:

> Thanks, that did work.
>
>
>
> On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky 
> wrote:
>
>> Yeah, this is kind of tricky and confusing! Here's what happens:
>>
>> 1. The query parser "parses" the input string into individual source
>> terms, each delimited by white space. The escape is removed in this
>> process, but... no analyzer has been called at this stage.
>>
>> 2. The query parser (generator) calls the analyzer for each source term.
>> Your analyzer is called at this stage, but... the escape is already gone,
>> so... the  mapping rule is not triggered, leaving the
>> slash recorded in the source term from step 1.
>>
>> You do need the backslash in your original query because a slash
>> introduces a regex query term. It is added by the escape method you call,
>> but the escaping will be gone by the time your analyzer is called.
>>
>> So, just try a simple, unescaped slash in your char mapping table.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Luis Pureza
>> Sent: Tuesday, June 17, 2014 1:43 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene QueryParser/Analyzer inconsistency
>>
>>
>> Hi,
>>
>> I'm experience a puzzling behaviour with the QueryParser and was hoping
>> someone around here can help me.
>>
>> I have a very simple Analyzer that tries to replace forward slashes (/) by
>> spaces. Because QueryParser forces me to escape strings with slashes
>> before
>> parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
>> with a single space. The analyzer is defined as follows:
>>
>> @Override
>> protected TokenStreamComponents createComponents(String field, Reader in)
>> {
>>NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>>builder.add("\\/", " ");
>>Reader mappingFilter = new MappingCharFilter(builder.build(), in);
>>
>>Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
>>return new TokenStreamComponents(tokenizer);
>> }
>>
>> Then I use this analyzer in the QueryParser to parse a string with dashes:
>>
>> String text = QueryParser.escape("one/two");
>> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
>> MyAnalyzer(Version.LUCENE_48));
>> System.err.println(parser.parse(text));
>>
>> The expected output would be
>>
>> f:one f:two
>>
>> However, I get:
>>
>> f:one/two
>>
>> The puzzling thing is that when I debug the analyzer, it tokenizes the
>> input string correctly, returning two tokens instead of one.
>>
>> What is going on?
>>
>> Many thanks,
>>
>> Luís Pureza
>>
>> P.S.: I was able to fix this issue temporarily by creating my own
>> tokenizer
>> that tokenizes on whitespace and slashes. However, I still don't
>> understand
>> what's going on.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: Lucene QueryParser/Analyzer inconsistency

2014-06-18 Thread Luis Pureza
Thanks, that did work.



On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky 
wrote:

> Yeah, this is kind of tricky and confusing! Here's what happens:
>
> 1. The query parser "parses" the input string into individual source
> terms, each delimited by white space. The escape is removed in this
> process, but... no analyzer has been called at this stage.
>
> 2. The query parser (generator) calls the analyzer for each source term.
> Your analyzer is called at this stage, but... the escape is already gone,
> so... the  mapping rule is not triggered, leaving the
> slash recorded in the source term from step 1.
>
> You do need the backslash in your original query because a slash
> introduces a regex query term. It is added by the escape method you call,
> but the escaping will be gone by the time your analyzer is called.
>
> So, just try a simple, unescaped slash in your char mapping table.
>
> -- Jack Krupansky
>
> -Original Message- From: Luis Pureza
> Sent: Tuesday, June 17, 2014 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Lucene QueryParser/Analyzer inconsistency
>
>
> Hi,
>
> I'm experience a puzzling behaviour with the QueryParser and was hoping
> someone around here can help me.
>
> I have a very simple Analyzer that tries to replace forward slashes (/) by
> spaces. Because QueryParser forces me to escape strings with slashes before
> parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
> with a single space. The analyzer is defined as follows:
>
> @Override
> protected TokenStreamComponents createComponents(String field, Reader in) {
>NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>builder.add("\\/", " ");
>Reader mappingFilter = new MappingCharFilter(builder.build(), in);
>
>Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
>return new TokenStreamComponents(tokenizer);
> }
>
> Then I use this analyzer in the QueryParser to parse a string with dashes:
>
> String text = QueryParser.escape("one/two");
> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
> MyAnalyzer(Version.LUCENE_48));
> System.err.println(parser.parse(text));
>
> The expected output would be
>
> f:one f:two
>
> However, I get:
>
> f:one/two
>
> The puzzling thing is that when I debug the analyzer, it tokenizes the
> input string correctly, returning two tokens instead of one.
>
> What is going on?
>
> Many thanks,
>
> Luís Pureza
>
> P.S.: I was able to fix this issue temporarily by creating my own tokenizer
> that tokenizes on whitespace and slashes. However, I still don't understand
> what's going on.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Luis Pureza
Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add("\\/", " ");
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape("one/two");
QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on.