Re: Lucene QueryParser/Analyzer inconsistency
Unfortunately I spoke too soon. While the original example seems to have been fixed, I'm still getting some unexpected results. As per your suggestion, I modified the Analyzer to: @Override protected TokenStreamComponents createComponents(String field, Reader in) { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add("/", " "); // Transform all forward slashes into whitespace Reader mappingFilter = new MappingCharFilter(builder.build(), in); Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); return new TokenStreamComponents(tokenizer); } When I try this: QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new MyAnalyzer(Version.LUCENE_48)); System.err.println(parser.parse(QueryParser.escape("one/two"))); I get f:one f:two as expected. However, if I change the text to "hello one/two", I get: f:hello f:one/two I can't figure out what's going on. My custom tokenizer seems to work well, but I'd rather use Lucene's built-ins. Thank you, Luis On Wed, Jun 18, 2014 at 3:38 PM, Luis Pureza wrote: > Thanks, that did work. > > > > On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky > wrote: > >> Yeah, this is kind of tricky and confusing! Here's what happens: >> >> 1. The query parser "parses" the input string into individual source >> terms, each delimited by white space. The escape is removed in this >> process, but... no analyzer has been called at this stage. >> >> 2. The query parser (generator) calls the analyzer for each source term. >> Your analyzer is called at this stage, but... the escape is already gone, >> so... the mapping rule is not triggered, leaving the >> slash recorded in the source term from step 1. >> >> You do need the backslash in your original query because a slash >> introduces a regex query term. It is added by the escape method you call, >> but the escaping will be gone by the time your analyzer is called. >> >> So, just try a simple, unescaped slash in your char mapping table. >> >> -- Jack Krupansky >> >> -Original Message- From: Luis Pureza >> Sent: Tuesday, June 17, 2014 1:43 PM >> To: java-user@lucene.apache.org >> Subject: Lucene QueryParser/Analyzer inconsistency >> >> >> Hi, >> >> I'm experience a puzzling behaviour with the QueryParser and was hoping >> someone around here can help me. >> >> I have a very simple Analyzer that tries to replace forward slashes (/) by >> spaces. Because QueryParser forces me to escape strings with slashes >> before >> parsing, I added a MappingCharFilter to the analyzer that replaces "\/" >> with a single space. The analyzer is defined as follows: >> >> @Override >> protected TokenStreamComponents createComponents(String field, Reader in) >> { >>NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); >>builder.add("\\/", " "); >>Reader mappingFilter = new MappingCharFilter(builder.build(), in); >> >>Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); >>return new TokenStreamComponents(tokenizer); >> } >> >> Then I use this analyzer in the QueryParser to parse a string with dashes: >> >> String text = QueryParser.escape("one/two"); >> QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new >> MyAnalyzer(Version.LUCENE_48)); >> System.err.println(parser.parse(text)); >> >> The expected output would be >> >> f:one f:two >> >> However, I get: >> >> f:one/two >> >> The puzzling thing is that when I debug the analyzer, it tokenizes the >> input string correctly, returning two tokens instead of one. >> >> What is going on? >> >> Many thanks, >> >> Luís Pureza >> >> P.S.: I was able to fix this issue temporarily by creating my own >> tokenizer >> that tokenizes on whitespace and slashes. However, I still don't >> understand >> what's going on. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: Lucene QueryParser/Analyzer inconsistency
Thanks, that did work. On Tue, Jun 17, 2014 at 8:49 PM, Jack Krupansky wrote: > Yeah, this is kind of tricky and confusing! Here's what happens: > > 1. The query parser "parses" the input string into individual source > terms, each delimited by white space. The escape is removed in this > process, but... no analyzer has been called at this stage. > > 2. The query parser (generator) calls the analyzer for each source term. > Your analyzer is called at this stage, but... the escape is already gone, > so... the mapping rule is not triggered, leaving the > slash recorded in the source term from step 1. > > You do need the backslash in your original query because a slash > introduces a regex query term. It is added by the escape method you call, > but the escaping will be gone by the time your analyzer is called. > > So, just try a simple, unescaped slash in your char mapping table. > > -- Jack Krupansky > > -Original Message- From: Luis Pureza > Sent: Tuesday, June 17, 2014 1:43 PM > To: java-user@lucene.apache.org > Subject: Lucene QueryParser/Analyzer inconsistency > > > Hi, > > I'm experience a puzzling behaviour with the QueryParser and was hoping > someone around here can help me. > > I have a very simple Analyzer that tries to replace forward slashes (/) by > spaces. Because QueryParser forces me to escape strings with slashes before > parsing, I added a MappingCharFilter to the analyzer that replaces "\/" > with a single space. The analyzer is defined as follows: > > @Override > protected TokenStreamComponents createComponents(String field, Reader in) { >NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); >builder.add("\\/", " "); >Reader mappingFilter = new MappingCharFilter(builder.build(), in); > >Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); >return new TokenStreamComponents(tokenizer); > } > > Then I use this analyzer in the QueryParser to parse a string with dashes: > > String text = QueryParser.escape("one/two"); > QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new > MyAnalyzer(Version.LUCENE_48)); > System.err.println(parser.parse(text)); > > The expected output would be > > f:one f:two > > However, I get: > > f:one/two > > The puzzling thing is that when I debug the analyzer, it tokenizes the > input string correctly, returning two tokens instead of one. > > What is going on? > > Many thanks, > > Luís Pureza > > P.S.: I was able to fix this issue temporarily by creating my own tokenizer > that tokenizes on whitespace and slashes. However, I still don't understand > what's going on. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Lucene QueryParser/Analyzer inconsistency
Hi, I'm experience a puzzling behaviour with the QueryParser and was hoping someone around here can help me. I have a very simple Analyzer that tries to replace forward slashes (/) by spaces. Because QueryParser forces me to escape strings with slashes before parsing, I added a MappingCharFilter to the analyzer that replaces "\/" with a single space. The analyzer is defined as follows: @Override protected TokenStreamComponents createComponents(String field, Reader in) { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add("\\/", " "); Reader mappingFilter = new MappingCharFilter(builder.build(), in); Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter); return new TokenStreamComponents(tokenizer); } Then I use this analyzer in the QueryParser to parse a string with dashes: String text = QueryParser.escape("one/two"); QueryParser parser = new QueryParser(Version.LUCENE_48, "f", new MyAnalyzer(Version.LUCENE_48)); System.err.println(parser.parse(text)); The expected output would be f:one f:two However, I get: f:one/two The puzzling thing is that when I debug the analyzer, it tokenizes the input string correctly, returning two tokens instead of one. What is going on? Many thanks, Luís Pureza P.S.: I was able to fix this issue temporarily by creating my own tokenizer that tokenizes on whitespace and slashes. However, I still don't understand what's going on.