Re: Lucene QueryParser/Analyzer inconsistency

2014-06-19 Thread Luis Pureza
Unfortunately I spoke too soon. While the original example seems to have
been fixed, I'm still getting some unexpected results.

As per your suggestion, I modified the Analyzer to:

@Override
protected TokenStreamComponents createComponents(String field, Reader
in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(/,  ); // Transform all forward slashes into
whitespace
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version,
mappingFilter);
return new TokenStreamComponents(tokenizer);
}

When I try this:

QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(QueryParser.escape(one/two)));

I get

f:one f:two

as expected.

However, if I change the text to hello one/two, I get:

f:hello f:one/two

I can't figure out what's going on. My custom tokenizer seems to work well,
but I'd rather use Lucene's built-ins.

Thank you,

Luis



On Wed, Jun 18, 2014 at 3:38 PM, Luis Pureza pur...@gmail.com wrote:

 Thanks, that did work.



 On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky j...@basetechnology.com
 wrote:

 Yeah, this is kind of tricky and confusing! Here's what happens:

 1. The query parser parses the input string into individual source
 terms, each delimited by white space. The escape is removed in this
 process, but... no analyzer has been called at this stage.

 2. The query parser (generator) calls the analyzer for each source term.
 Your analyzer is called at this stage, but... the escape is already gone,
 so... the backslashslash mapping rule is not triggered, leaving the
 slash recorded in the source term from step 1.

 You do need the backslash in your original query because a slash
 introduces a regex query term. It is added by the escape method you call,
 but the escaping will be gone by the time your analyzer is called.

 So, just try a simple, unescaped slash in your char mapping table.

 -- Jack Krupansky

 -Original Message- From: Luis Pureza
 Sent: Tuesday, June 17, 2014 1:43 PM
 To: java-user@lucene.apache.org
 Subject: Lucene QueryParser/Analyzer inconsistency


 Hi,

 I'm experience a puzzling behaviour with the QueryParser and was hoping
 someone around here can help me.

 I have a very simple Analyzer that tries to replace forward slashes (/) by
 spaces. Because QueryParser forces me to escape strings with slashes
 before
 parsing, I added a MappingCharFilter to the analyzer that replaces \/
 with a single space. The analyzer is defined as follows:

 @Override
 protected TokenStreamComponents createComponents(String field, Reader in)
 {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(\\/,  );
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
return new TokenStreamComponents(tokenizer);
 }

 Then I use this analyzer in the QueryParser to parse a string with dashes:

 String text = QueryParser.escape(one/two);
 QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
 MyAnalyzer(Version.LUCENE_48));
 System.err.println(parser.parse(text));

 The expected output would be

 f:one f:two

 However, I get:

 f:one/two

 The puzzling thing is that when I debug the analyzer, it tokenizes the
 input string correctly, returning two tokens instead of one.

 What is going on?

 Many thanks,

 Luís Pureza

 P.S.: I was able to fix this issue temporarily by creating my own
 tokenizer
 that tokenizes on whitespace and slashes. However, I still don't
 understand
 what's going on.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Lucene QueryParser/Analyzer inconsistency

2014-06-18 Thread Luis Pureza
Thanks, that did work.



On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky j...@basetechnology.com
wrote:

 Yeah, this is kind of tricky and confusing! Here's what happens:

 1. The query parser parses the input string into individual source
 terms, each delimited by white space. The escape is removed in this
 process, but... no analyzer has been called at this stage.

 2. The query parser (generator) calls the analyzer for each source term.
 Your analyzer is called at this stage, but... the escape is already gone,
 so... the backslashslash mapping rule is not triggered, leaving the
 slash recorded in the source term from step 1.

 You do need the backslash in your original query because a slash
 introduces a regex query term. It is added by the escape method you call,
 but the escaping will be gone by the time your analyzer is called.

 So, just try a simple, unescaped slash in your char mapping table.

 -- Jack Krupansky

 -Original Message- From: Luis Pureza
 Sent: Tuesday, June 17, 2014 1:43 PM
 To: java-user@lucene.apache.org
 Subject: Lucene QueryParser/Analyzer inconsistency


 Hi,

 I'm experience a puzzling behaviour with the QueryParser and was hoping
 someone around here can help me.

 I have a very simple Analyzer that tries to replace forward slashes (/) by
 spaces. Because QueryParser forces me to escape strings with slashes before
 parsing, I added a MappingCharFilter to the analyzer that replaces \/
 with a single space. The analyzer is defined as follows:

 @Override
 protected TokenStreamComponents createComponents(String field, Reader in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(\\/,  );
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
return new TokenStreamComponents(tokenizer);
 }

 Then I use this analyzer in the QueryParser to parse a string with dashes:

 String text = QueryParser.escape(one/two);
 QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
 MyAnalyzer(Version.LUCENE_48));
 System.err.println(parser.parse(text));

 The expected output would be

 f:one f:two

 However, I get:

 f:one/two

 The puzzling thing is that when I debug the analyzer, it tokenizes the
 input string correctly, returning two tokens instead of one.

 What is going on?

 Many thanks,

 Luís Pureza

 P.S.: I was able to fix this issue temporarily by creating my own tokenizer
 that tokenizes on whitespace and slashes. However, I still don't understand
 what's going on.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Luis Pureza
Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces \/
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(\\/,  );
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape(one/two);
QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on.


Re: Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Jack Krupansky

Yeah, this is kind of tricky and confusing! Here's what happens:

1. The query parser parses the input string into individual source terms, 
each delimited by white space. The escape is removed in this process, but... 
no analyzer has been called at this stage.


2. The query parser (generator) calls the analyzer for each source term. 
Your analyzer is called at this stage, but... the escape is already gone, 
so... the backslashslash mapping rule is not triggered, leaving the 
slash recorded in the source term from step 1.


You do need the backslash in your original query because a slash introduces 
a regex query term. It is added by the escape method you call, but the 
escaping will be gone by the time your analyzer is called.


So, just try a simple, unescaped slash in your char mapping table.

-- Jack Krupansky

-Original Message- 
From: Luis Pureza

Sent: Tuesday, June 17, 2014 1:43 PM
To: java-user@lucene.apache.org
Subject: Lucene QueryParser/Analyzer inconsistency

Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces \/
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
   NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
   builder.add(\\/,  );
   Reader mappingFilter = new MappingCharFilter(builder.build(), in);

   Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
   return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape(one/two);
QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org