Re: Lucene QueryParser/Analyzer inconsistency
Unfortunately I spoke too soon. While the original example seems to have been fixed, I'm still getting some unexpected results. As per your suggestion, I modified the Analyzer to: @Override protected TokenStreamComponents createComponents(String field, Reader in) { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add("/", " "); // Transform all forward slashes into whitespace Reader mappingFilter = new MappingCharFilter(builder.build(), in); Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); return new TokenStreamComponents(tokenizer); } When I try this: QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new MyAnalyzer(Version.LUCENE_48)); System.err.println(parser.parse(QueryParser.escape("one/two"))); I get f:one f:two as expected. However, if I change the text to "hello one/two", I get: f:hello f:one/two I can't figure out what's going on. My custom tokenizer seems to work well, but I'd rather use Lucene's built-ins. Thank you, Luis On Wed, Jun 18, 2014 at 3:38 PM, Luis Pureza wrote: > Thanks, that did work. > > > > On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky > wrote: > >> Yeah, this is kind of tricky and confusing! Here's what happens: >> >> 1. The query parser "parses" the input string into individual source >> terms, each delimited by white space. The escape is removed in this >> process, but... no analyzer has been called at this stage. >> >> 2. The query parser (generator) calls the analyzer for each source term. >> Your analyzer is called at this stage, but... the escape is already gone, >> so... the mapping rule is not triggered, leaving the >> slash recorded in the source term from step 1. >> >> You do need the backslash in your original query because a slash >> introduces a regex query term. It is added by the escape method you call, >> but the escaping will be gone by the time your analyzer is called. >> >> So, just try a simple, unescaped slash in your char mapping table. >> >> -- Jack Krupansky >> >> -Original Message- From: Luis Pureza >> Sent: Tuesday, June 17, 2014 1:43 PM >> To: java-user@lucene.apache.org >> Subject: Lucene QueryParser/Analyzer inconsistency >> >> >> Hi, >> >> I'm experience a puzzling behaviour with the QueryParser and was hoping >> someone around here can help me. >> >> I have a very simple Analyzer that tries to replace forward slashes (/) by >> spaces. Because QueryParser forces me to escape strings with slashes >> before >> parsing, I added a MappingCharFilter to the analyzer that replaces "\/" >> with a single space. The analyzer is defined as follows: >> >> @Override >> protected TokenStreamComponents createComponents(String field, Reader in) >> { >>NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); >>builder.add("\\/", " "); >>Reader mappingFilter = new MappingCharFilter(builder.build(), in); >> >>Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); >>return new TokenStreamComponents(tokenizer); >> } >> >> Then I use this analyzer in the QueryParser to parse a string with dashes: >> >> String text = QueryParser.escape("one/two"); >> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new >> MyAnalyzer(Version.LUCENE_48)); >> System.err.println(parser.parse(text)); >> >> The expected output would be >> >> f:one f:two >> >> However, I get: >> >> f:one/two >> >> The puzzling thing is that when I debug the analyzer, it tokenizes the >> input string correctly, returning two tokens instead of one. >> >> What is going on? >> >> Many thanks, >> >> Luís Pureza >> >> P.S.: I was able to fix this issue temporarily by creating my own >> tokenizer >> that tokenizes on whitespace and slashes. However, I still don't >> understand >> what's going on. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: Lucene QueryParser/Analyzer inconsistency
Thanks, that did work. On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky wrote: > Yeah, this is kind of tricky and confusing! Here's what happens: > > 1. The query parser "parses" the input string into individual source > terms, each delimited by white space. The escape is removed in this > process, but... no analyzer has been called at this stage. > > 2. The query parser (generator) calls the analyzer for each source term. > Your analyzer is called at this stage, but... the escape is already gone, > so... the mapping rule is not triggered, leaving the > slash recorded in the source term from step 1. > > You do need the backslash in your original query because a slash > introduces a regex query term. It is added by the escape method you call, > but the escaping will be gone by the time your analyzer is called. > > So, just try a simple, unescaped slash in your char mapping table. > > -- Jack Krupansky > > -Original Message- From: Luis Pureza > Sent: Tuesday, June 17, 2014 1:43 PM > To: java-user@lucene.apache.org > Subject: Lucene QueryParser/Analyzer inconsistency > > > Hi, > > I'm experience a puzzling behaviour with the QueryParser and was hoping > someone around here can help me. > > I have a very simple Analyzer that tries to replace forward slashes (/) by > spaces. Because QueryParser forces me to escape strings with slashes before > parsing, I added a MappingCharFilter to the analyzer that replaces "\/" > with a single space. The analyzer is defined as follows: > > @Override > protected TokenStreamComponents createComponents(String field, Reader in) { >NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); >builder.add("\\/", " "); >Reader mappingFilter = new MappingCharFilter(builder.build(), in); > >Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); >return new TokenStreamComponents(tokenizer); > } > > Then I use this analyzer in the QueryParser to parse a string with dashes: > > String text = QueryParser.escape("one/two"); > QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new > MyAnalyzer(Version.LUCENE_48)); > System.err.println(parser.parse(text)); > > The expected output would be > > f:one f:two > > However, I get: > > f:one/two > > The puzzling thing is that when I debug the analyzer, it tokenizes the > input string correctly, returning two tokens instead of one. > > What is going on? > > Many thanks, > > Luís Pureza > > P.S.: I was able to fix this issue temporarily by creating my own tokenizer > that tokenizes on whitespace and slashes. However, I still don't understand > what's going on. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Lucene QueryParser/Analyzer inconsistency
Yeah, this is kind of tricky and confusing! Here's what happens: 1. The query parser "parses" the input string into individual source terms, each delimited by white space. The escape is removed in this process, but... no analyzer has been called at this stage. 2. The query parser (generator) calls the analyzer for each source term. Your analyzer is called at this stage, but... the escape is already gone, so... the mapping rule is not triggered, leaving the slash recorded in the source term from step 1. You do need the backslash in your original query because a slash introduces a regex query term. It is added by the escape method you call, but the escaping will be gone by the time your analyzer is called. So, just try a simple, unescaped slash in your char mapping table. -- Jack Krupansky -Original Message- From: Luis Pureza Sent: Tuesday, June 17, 2014 1:43 PM To: java-user@lucene.apache.org Subject: Lucene QueryParser/Analyzer inconsistency Hi, I'm experience a puzzling behaviour with the QueryParser and was hoping someone around here can help me. I have a very simple Analyzer that tries to replace forward slashes (/) by spaces. Because QueryParser forces me to escape strings with slashes before parsing, I added a MappingCharFilter to the analyzer that replaces "\/" with a single space. The analyzer is defined as follows: @Override protected TokenStreamComponents createComponents(String field, Reader in) { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add("\\/", " "); Reader mappingFilter = new MappingCharFilter(builder.build(), in); Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); return new TokenStreamComponents(tokenizer); } Then I use this analyzer in the QueryParser to parse a string with dashes: String text = QueryParser.escape("one/two"); QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new MyAnalyzer(Version.LUCENE_48)); System.err.println(parser.parse(text)); The expected output would be f:one f:two However, I get: f:one/two The puzzling thing is that when I debug the analyzer, it tokenizes the input string correctly, returning two tokens instead of one. What is going on? Many thanks, Luís Pureza P.S.: I was able to fix this issue temporarily by creating my own tokenizer that tokenizes on whitespace and slashes. However, I still don't understand what's going on. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene QueryParser/Analyzer inconsistency
Hi, I'm experience a puzzling behaviour with the QueryParser and was hoping someone around here can help me. I have a very simple Analyzer that tries to replace forward slashes (/) by spaces. Because QueryParser forces me to escape strings with slashes before parsing, I added a MappingCharFilter to the analyzer that replaces "\/" with a single space. The analyzer is defined as follows: @Override protected TokenStreamComponents createComponents(String field, Reader in) { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add("\\/", " "); Reader mappingFilter = new MappingCharFilter(builder.build(), in); Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); return new TokenStreamComponents(tokenizer); } Then I use this analyzer in the QueryParser to parse a string with dashes: String text = QueryParser.escape("one/two"); QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new MyAnalyzer(Version.LUCENE_48)); System.err.println(parser.parse(text)); The expected output would be f:one f:two However, I get: f:one/two The puzzling thing is that when I debug the analyzer, it tokenizes the input string correctly, returning two tokens instead of one. What is going on? Many thanks, Luís Pureza P.S.: I was able to fix this issue temporarily by creating my own tokenizer that tokenizes on whitespace and slashes. However, I still don't understand what's going on.