[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-8186: Attachment: LUCENE-8186.patch > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > Attachments: LUCENE-8186.patch > > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated LUCENE-8186: Description: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does the filtering work. was: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does the filtering work. > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated LUCENE-8186: Description: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does the filtering work. was: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseAnalyzer, does the filtering work. > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new > LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org