Re: Lucene QueryParser/Analyzer inconsistency

2014-06-19 Thread Luis Pureza
Unfortunately I spoke too soon. While the original example seems to have
been fixed, I'm still getting some unexpected results.

As per your suggestion, I modified the Analyzer to:

@Override
protected TokenStreamComponents createComponents(String field, Reader
in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add("/", " "); // Transform all forward slashes into
whitespace
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version,
mappingFilter);
return new TokenStreamComponents(tokenizer);
}

When I try this:

QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(QueryParser.escape("one/two")));

I get

f:one f:two

as expected.

However, if I change the text to "hello one/two", I get:

f:hello f:one/two

I can't figure out what's going on. My custom tokenizer seems to work well,
but I'd rather use Lucene's built-ins.

Thank you,

Luis



On Wed, Jun 18, 2014 at 3:38 PM, Luis Pureza  wrote:

> Thanks, that did work.
>
>
>
> On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky 
> wrote:
>
>> Yeah, this is kind of tricky and confusing! Here's what happens:
>>
>> 1. The query parser "parses" the input string into individual source
>> terms, each delimited by white space. The escape is removed in this
>> process, but... no analyzer has been called at this stage.
>>
>> 2. The query parser (generator) calls the analyzer for each source term.
>> Your analyzer is called at this stage, but... the escape is already gone,
>> so... the  mapping rule is not triggered, leaving the
>> slash recorded in the source term from step 1.
>>
>> You do need the backslash in your original query because a slash
>> introduces a regex query term. It is added by the escape method you call,
>> but the escaping will be gone by the time your analyzer is called.
>>
>> So, just try a simple, unescaped slash in your char mapping table.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Luis Pureza
>> Sent: Tuesday, June 17, 2014 1:43 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene QueryParser/Analyzer inconsistency
>>
>>
>> Hi,
>>
>> I'm experience a puzzling behaviour with the QueryParser and was hoping
>> someone around here can help me.
>>
>> I have a very simple Analyzer that tries to replace forward slashes (/) by
>> spaces. Because QueryParser forces me to escape strings with slashes
>> before
>> parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
>> with a single space. The analyzer is defined as follows:
>>
>> @Override
>> protected TokenStreamComponents createComponents(String field, Reader in)
>> {
>>NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>>builder.add("\\/", " ");
>>Reader mappingFilter = new MappingCharFilter(builder.build(), in);
>>
>>Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
>>return new TokenStreamComponents(tokenizer);
>> }
>>
>> Then I use this analyzer in the QueryParser to parse a string with dashes:
>>
>> String text = QueryParser.escape("one/two");
>> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
>> MyAnalyzer(Version.LUCENE_48));
>> System.err.println(parser.parse(text));
>>
>> The expected output would be
>>
>> f:one f:two
>>
>> However, I get:
>>
>> f:one/two
>>
>> The puzzling thing is that when I debug the analyzer, it tokenizes the
>> input string correctly, returning two tokens instead of one.
>>
>> What is going on?
>>
>> Many thanks,
>>
>> Luís Pureza
>>
>> P.S.: I was able to fix this issue temporarily by creating my own
>> tokenizer
>> that tokenizes on whitespace and slashes. However, I still don't
>> understand
>> what's going on.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: Lucene QueryParser/Analyzer inconsistency

2014-06-18 Thread Luis Pureza
Thanks, that did work.



On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky 
wrote:

> Yeah, this is kind of tricky and confusing! Here's what happens:
>
> 1. The query parser "parses" the input string into individual source
> terms, each delimited by white space. The escape is removed in this
> process, but... no analyzer has been called at this stage.
>
> 2. The query parser (generator) calls the analyzer for each source term.
> Your analyzer is called at this stage, but... the escape is already gone,
> so... the  mapping rule is not triggered, leaving the
> slash recorded in the source term from step 1.
>
> You do need the backslash in your original query because a slash
> introduces a regex query term. It is added by the escape method you call,
> but the escaping will be gone by the time your analyzer is called.
>
> So, just try a simple, unescaped slash in your char mapping table.
>
> -- Jack Krupansky
>
> -Original Message- From: Luis Pureza
> Sent: Tuesday, June 17, 2014 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Lucene QueryParser/Analyzer inconsistency
>
>
> Hi,
>
> I'm experience a puzzling behaviour with the QueryParser and was hoping
> someone around here can help me.
>
> I have a very simple Analyzer that tries to replace forward slashes (/) by
> spaces. Because QueryParser forces me to escape strings with slashes before
> parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
> with a single space. The analyzer is defined as follows:
>
> @Override
> protected TokenStreamComponents createComponents(String field, Reader in) {
>NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>builder.add("\\/", " ");
>Reader mappingFilter = new MappingCharFilter(builder.build(), in);
>
>Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
>return new TokenStreamComponents(tokenizer);
> }
>
> Then I use this analyzer in the QueryParser to parse a string with dashes:
>
> String text = QueryParser.escape("one/two");
> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
> MyAnalyzer(Version.LUCENE_48));
> System.err.println(parser.parse(text));
>
> The expected output would be
>
> f:one f:two
>
> However, I get:
>
> f:one/two
>
> The puzzling thing is that when I debug the analyzer, it tokenizes the
> input string correctly, returning two tokens instead of one.
>
> What is going on?
>
> Many thanks,
>
> Luís Pureza
>
> P.S.: I was able to fix this issue temporarily by creating my own tokenizer
> that tokenizes on whitespace and slashes. However, I still don't understand
> what's going on.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Jack Krupansky

Yeah, this is kind of tricky and confusing! Here's what happens:

1. The query parser "parses" the input string into individual source terms, 
each delimited by white space. The escape is removed in this process, but... 
no analyzer has been called at this stage.


2. The query parser (generator) calls the analyzer for each source term. 
Your analyzer is called at this stage, but... the escape is already gone, 
so... the  mapping rule is not triggered, leaving the 
slash recorded in the source term from step 1.


You do need the backslash in your original query because a slash introduces 
a regex query term. It is added by the escape method you call, but the 
escaping will be gone by the time your analyzer is called.


So, just try a simple, unescaped slash in your char mapping table.

-- Jack Krupansky

-Original Message- 
From: Luis Pureza

Sent: Tuesday, June 17, 2014 1:43 PM
To: java-user@lucene.apache.org
Subject: Lucene QueryParser/Analyzer inconsistency

Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
   NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
   builder.add("\\/", " ");
   Reader mappingFilter = new MappingCharFilter(builder.build(), in);

   Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
   return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape("one/two");
QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Luis Pureza
Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces "\/"
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add("\\/", " ");
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape("one/two");
QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on.