[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035783#comment-16035783 ] ASF subversion and git services commented on LUCENE-7705: - Commit e4a43cf59a12ca39eb8278cc2533d409d792185a in lucene-solr's branch refs/heads/branch_6x from Erick [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e4a43cf ] LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token len (test fix) (cherry picked from commit 15a8a24) > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035782#comment-16035782 ] ASF subversion and git services commented on LUCENE-7705: - Commit 2eacf13def4dc9fbea1de9c79150c05682b0cdec in lucene-solr's branch refs/heads/master from Erick [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2eacf13 ] LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length, fix test failure. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035781#comment-16035781 ] ASF subversion and git services commented on LUCENE-7705: - Commit 15a8a2415280d50c982fcd4fca893a3c3224da14 in lucene-solr's branch refs/heads/master from Erick [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=15a8a24 ] LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token len (test fix) > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030223#comment-16030223 ] Erick Erickson commented on LUCENE-7705: OK, I see what's happening. I noted earlier that the way this has always been implemented, multiple tokens are emitted when the token length is exceeded. In this case, the token sent in the doc is "letter". So two tokens are emitted: "let" and "ter". With positions incremented between I think. The search is against "lett". For some reason, the parsed query in 6x is: PhraseQuery(letter0:"let t") while in master it's: letter0:let letter0:t Even this is wrong, it just happens to succeed because the default operator is OR, so the fact that and the tokens in the index do not include a bare "t" finds the doc by chance, not design. I think the right solution is to stop emitting tokens for a particular value once maxTokenLen is exceeded. I'll raise a new JIRA and we can debate it here. This is _not_ any change in behavior resulting from the changes in this JIRA, the tests just expose something that's always been the case but nobody's noticed. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
Ah, ok. that one fails for me too on 6x, was trying on trunk. Thanks... Digging. On Tue, May 30, 2017 at 12:44 PM, Steve Rowe (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030016#comment-16030016 > ] > > Steve Rowe commented on LUCENE-7705: > > > My seed reproduces for me both on Linux and on my Macbook pro (Sierra > 10.12.5, Oracle JDK 1.8.0_112). Note that the original failure was on > branch_6x (and that's where I repro'd). > >> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the >> max token length >> - >> >> Key: LUCENE-7705 >> URL: https://issues.apache.org/jira/browse/LUCENE-7705 >> Project: Lucene - Core >> Issue Type: Improvement >>Reporter: Amrit Sarkar >>Assignee: Erick Erickson >>Priority: Minor >> Fix For: master (7.0), 6.7 >> >> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, >> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, >> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch >> >> >> SOLR-10186 >> [~erickerickson]: Is there a good reason that we hard-code a 256 character >> limit for the CharTokenizer? In order to change this limit it requires that >> people copy/paste the incrementToken into some new class since >> incrementToken is final. >> KeywordTokenizer can easily change the default (which is also 256 bytes), >> but to do so requires code rather than being able to configure it in the >> schema. >> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes >> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) >> (Factories) it would take adding a c'tor to the base class in Lucene and >> using it in the factory. >> Any objections? > > > > -- > This message was sent by Atlassian JIRA > (v6.3.15#6346) > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030016#comment-16030016 ] Steve Rowe commented on LUCENE-7705: My seed reproduces for me both on Linux and on my Macbook pro (Sierra 10.12.5, Oracle JDK 1.8.0_112). Note that the original failure was on branch_6x (and that's where I repro'd). > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029981#comment-16029981 ] Erick Erickson commented on LUCENE-7705: Hmmm, Steve's seed succeeds on my macbook pro, and I beasted this test 100 times on my mac pro without failures. Not quite sure what to do next... I wonder if SOLR-10562 is biting us again? Although that doesn't really make sense as the line before this failure finds the same document. I'll beast this on a different system I have access to when I can (it's shared). > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029872#comment-16029872 ] Steve Rowe commented on LUCENE-7705: Here's another reproducing failure [https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/3612]: {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers -Dtests.seed=FE4BE1CA39C9E0DA -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Australia/NSW -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.07s J1 | TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<< [junit4]> Throwable #1: java.lang.RuntimeException: Exception during query [junit4]>at __randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0) [junit4]>at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895) [junit4]>at org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104) [junit4]>at java.lang.Thread.run(Thread.java:748) [junit4]> Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound=1] [junit4]>xml response was: [junit4]> [junit4]> 00 [junit4]> [junit4]>request was:q=letter0:lett&wt=xml [junit4]>at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:888) [...] [junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): {lowerCase0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), whiteSpace=Lucene50(blocksize=128), letter=BlockTreeOrds(blocksize=128), lowerCase=Lucene50(blocksize=128), unicodeWhiteSpace=BlockTreeOrds(blocksize=128), letter0=FST50, unicodeWhiteSpace0=FST50, keyword0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), keyword=Lucene50(blocksize=128), whiteSpace0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))}, docValues:{}, maxPointsInLeafNode=579, maxMBSortInHeap=5.430197160407458, sim=RandomSimilarity(queryNorm=false,coord=crazy): {}, locale=vi-VN, timezone=Australia/NSW [junit4] 2> NOTE: Linux 4.10.0-21-generic amd64/Oracle Corporation 1.8.0_131 (64-bit)/cpus=8,threads=1,free=234332328,total=536870912 {noformat} > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
Let me take a look. Tests passed for me, probably some of the random stuff? On Tue, May 30, 2017 at 9:39 AM, Tomás Fernández Löbbe (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029694#comment-16029694 > ] > > Tomás Fernández Löbbe commented on LUCENE-7705: > --- > > TestMaxTokenLenTokenizer seems to be failing in Jenkins > {noformat} > 1 tests failed. > FAILED: > org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers > > Error Message: > Exception during query > > Stack Trace: > java.lang.RuntimeException: Exception during query > at > __randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0) > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895) > at > org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57) > at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > at > org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) > at > com.carrotsearch.r
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029694#comment-16029694 ] Tomás Fernández Löbbe commented on LUCENE-7705: --- TestMaxTokenLenTokenizer seems to be failing in Jenkins {noformat} 1 tests failed. FAILED: org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers Error Message: Exception during query Stack Trace: java.lang.RuntimeException: Exception during query at __randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895) at org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57) at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound=1] xml r
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027974#comment-16027974 ] Erick Erickson commented on LUCENE-7705: Oh bother. commit number. Commit to master is: 906679adc80f0fad1e5c311b03023c7bd95633d7 Commit to 6x is: 3db93a1f0980bd098af892630c6dc95c4c962e14 > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025386#comment-16025386 ] Robert Muir commented on LUCENE-7705: - Tests such as this are not effective. If no exception is thrown the test will pass. {noformat} +try { + new LetterTokenizer(newAttributeFactory(), 0); +} catch (Exception e) { + assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 0", e.getMessage()); +} {noformat} I would use expectThrows in this case instead. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006260#comment-16006260 ] Robert Muir commented on LUCENE-7705: - I don't think its within the scope of this issue. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005894#comment-16005894 ] Erick Erickson commented on LUCENE-7705: Fair enough, I suppose you could get log flies 20 bazillion bytes long. WDYT about generating multiple tokens when the length is exceeded? > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005892#comment-16005892 ] Robert Muir commented on LUCENE-7705: - Sorry, I think its a very bad idea to log tokens in lucene's analysis chains for any reason. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005888#comment-16005888 ] Erick Erickson commented on LUCENE-7705: One nit and one suggestion and one question in addition to Robert's comments: The nit: there's a pattern of a bunch of these: updateJ("{\"add\":{\"doc\": {\"id\":1,\"letter\":\"letter\"}},\"commit\":{}}",null); . . then: assertU(commit()); It's unnecessary to do the commit with the updateJ calls. the commit at the end will take care of it all. It's a little less efficient to commit with each doc. Frankly I doubt that'd be measurable, performance wise, but let's take them out anyway. The suggestion: When we do stop accruing characters e.g. in CharTokenizer, let's log an INFO level message to that effect, something like log.info("Splitting token at {} chars", maxTokenLen); That way people will have a clue where to look. I think INFO is appropriate rather than WARN or ERROR since it's perfectly legitimate to truncate input, I'm thinking OCR text for instance. Maybe dump the token we've accumulated so far? I worded it at "splitting" because (and there are tests for this) that the next token picks up where the first left off. So if the max length is 3, and the input is "whitespace", we get several tokens as a result, "whi", "tes", "pac", and "e". I suppose that means that the offsets are also incremented. Is that really what we want here? Or should we instead throw away the extra tokens? [~rcmuir], what do you think is correct? This is not a _change_ in behavior, the current code does the same thing just a hard-coded 255 limit. I'm checking if this is intended behavior. If we do want to throw away the extra, we could spin through the buffer until we encountered a non char then return the part < maxTokenLen. If we did that we could also log the entire token and the truncated version if we wanted. Otherwise precommit passes and all tests pass. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005739#comment-16005739 ] Robert Muir commented on LUCENE-7705: - Why do the tokenizer ctor arg take {{Integer}} instead of {{int}}? They check for null and throws an exception in that case, but I think its better to encode the correct argument type in the method signature. This would also mean you can re-enable TestRandomChains for the newly-added ctors. I think if basic tokenizers like WhitespaceTokenizer have new ctors that take {{int}}, and throw exception on illegal values for this parameter because some aren't permitted, that these new ctors really could use a bit better docs: {quote} @param maxTokenLen max token length the tokenizer will emit {quote} Instead, maybe something like: {quote} @param maxTokenLen maximum token length the tokenizer will emit. Must not be negative. @throws IllegalArgumentException if maxTokenLen is invalid. {quote} Also I am concerned about the allowable value ranges and the disabling of tests as far as correctness. Instead if {{int}} is used instead of {{Integer}} this test can be re-enabled which may find/prevent crazy bugs. e.g. I worry a bit what will these Tokenizers do in each case if maxTokenLen is 0 (which is allowed), will they loop forever or crash, etc? The maximum value of the range is also suspicious. What happens with any {{CharTokenizer}}-based tokenizer if i supply a value > IO_BUFFER_SIZE? Maybe it behaves correctly, maybe not. I think {{Integer.MAX_VALUE}} is a trap for a number of reasons: enormous values will likely only blow up anyway (limited to 32766 in the index), if they don't they may create hellacious threadlocal memory usage in buffers, bad/malicious input could take out JVMs, etc. Maybe follow the example of StandardTokenizer here: {code} /** Absolute maximum sized token */ public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024; ... * @throws IllegalArgumentException if the given length is outside of the * range [1, {@value #MAX_TOKEN_LENGTH_LIMIT}]. ... if (length < 1) { throw new IllegalArgumentException("maxTokenLength must be greater than zero"); } else if (length > MAX_TOKEN_LENGTH_LIMIT) { throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT); } {code} > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005672#comment-16005672 ] Erick Erickson commented on LUCENE-7705: Amrit: I'm using LUCENE-7705 (note no .patch extension). Is that the right one? > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004169#comment-16004169 ] Amrit Sarkar commented on LUCENE-7705: -- Erick, we got everything right this time. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002994#comment-16002994 ] Erick Erickson commented on LUCENE-7705: Sorry 'bout that, rushing out the door this morning and didn't do the git add bits. We'll get it all right yet. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951168#comment-15951168 ] Erick Erickson commented on LUCENE-7705: Thanks! > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951158#comment-15951158 ] Robert Muir commented on LUCENE-7705: - I think that LUCENE-7762 discussion is unrelated: it has to do with exposing such things again at the Analyzer level, vs going the customanalyzer route > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951049#comment-15951049 ] Erick Erickson commented on LUCENE-7705: [~rcmuir][~mikemccand] Wanted to check whether the discussion at LUCENE-7762 applies here or not. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907737#comment-15907737 ] Erick Erickson commented on LUCENE-7705: May or may not depend on SOLR-10229, but we want to approach them both together. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897762#comment-15897762 ] Erick Erickson commented on LUCENE-7705: Patch looks good. I'm going to hang back on committing this until we figure out SOLR-10229 (control schema proliferation). The additional schema you put in here is about the only way currently to test Solr schemas, so that's perfectly appropriate. I'd just like to use this as a test case for what it would take to move constructing schemas to inside the tests rather than have each new case like this require another schema that we then have to maintain. But if SOLR-10229 takes very long I'll just commit this one and we can work out the rest later. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885144#comment-15885144 ] Amrit Sarkar commented on LUCENE-7705: -- I did some adjustments, including removing maxTokenLen for LowerCaseFilterFactory init and included hard test cases for the tokenizers at solr level: TestTokenizer.java {noformat} modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java new file: lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java modified: lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java modified: lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java modified: lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java new file: solr/core/src/test-files/solr/collection1/conf/schema-tokenizer-test.xml new file: solr/core/src/test/org/apache/solr/util/TestTokenizer.java {noformat} I think we have covered everything, all tests passed. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884983#comment-15884983 ] Erick Erickson commented on LUCENE-7705: Ah, good catch. So yeah, we'll have to remove the maxTokenLen from the original args and pass the modified map through. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884831#comment-15884831 ] Amrit Sarkar commented on LUCENE-7705: -- Erick, For every tokenizer init, two parameters are already included in their arguments as below: {noformat} {class=solr.LowerCaseTokenizerFactory, luceneMatchVersion=7.0.0} {noformat} which is consumed by AbstractAnalysisFactory while it instantiate: {noformat} originalArgs = Collections.unmodifiableMap(new HashMap<>(args)); System.out.println("orgs:: "+originalArgs); String version = get(args, LUCENE_MATCH_VERSION_PARAM); if (version == null) { luceneMatchVersion = Version.LATEST; } else { try { luceneMatchVersion = Version.parseLeniently(version); } catch (ParseException pe) { throw new IllegalArgumentException(pe); } } args.remove(CLASS_NAME); // consume the class arg } {noformat} _class_ parameter is useless, we don't have to worry about it, while it do look up for _luceneMatchVersion_ which is kind of sanity check for the versions, not sure anything important takes place at Version::parseLeniently(version) function. If we can confirm that, we can pass empty map there. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884805#comment-15884805 ] Erick Erickson commented on LUCENE-7705: Removing the extra arguments in the getMulitTermComponent() is certainly better than in any of the superclasses, you'd have the possibility of interfering with someone else's filter that _did_ coincidentally have a maxTokenLen parameter that should legitimately be passed through. I guess that removing it in getMultiTermComponent() is OK. At least the place that gets the maxTokenLength argument (i.e. the factory) being responsible for removing it before passing the args on to the Filter keeps things from sprawling. Although the other possibility is to just pass an empty map rather than munge the original ones. LowerCaseFilter's first act is to check whether the map is empty after all. Something like: return new LowerCaseFilterFactory(Collections.EMPTY_MAP); (untested). I see no justification for passing the original args anyway in this particular case, I'd guess it was just convenient. I think I like the EMPTY_MAP now that I think about it, but neither option is really all that superior IMO. The EMPTY_MAP will be slightly more efficient but I doubt it's really measurable. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884739#comment-15884739 ] Amrit Sarkar commented on LUCENE-7705: -- Erick, I wrote the test-cases and it is a problem, but removing "maxTokenLen" from original arguments which initialize LowerCaseFilterFactory makes sense, and it is not hack. We have to remove the argument for the FilterFactory init somewhere and it will be better if we do where we are making the call. I am not inclined towards removing this at FilterFactory init or AbstractAnalysisFactory func call. So we are left with two options, either we don't provide option for maxTokenLen for LowerCaseTokenizer or we remove the extra argument as you have done on getMultiTermComponent(). Let me know your thoughts. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884209#comment-15884209 ] Amrit Sarkar commented on LUCENE-7705: -- Roger that! I will incorporate test-cases for the stated tokenizers and try to fix the issue without removing _maxTokenLength_ from the method. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884014#comment-15884014 ] Erick Erickson commented on LUCENE-7705: Actually, I didn't encounter the error in LowerCaseFilterFactory until I tried it out from a fully-compiled Solr instance with the maxTokenLen in the managed_schema file. I was thinking that it might make sense to add the maxTokenLen to a couple of the schemas used by some of the test cases, leaving it at the default value or 256 just to get some test coverage. I think this is really the difference between a test case at the Lucene level and one based on the schema from a Solr level. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883979#comment-15883979 ] Amrit Sarkar commented on LUCENE-7705: -- Erick, Successfully able to pass all the tests in the current patch uploaded with minor corrections and rectifications in exiting test-classes. {noformat} modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java modified: lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java new file: lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java modified: lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java modified: lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java modified: lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java {noformat} Test failure fixes: 1. org.apache.lucene.analysis.core.TestRandomChains (suite): Added the four tokenizer constructors failing to brokenConstructors map to bypass them without delay. This class tends to check what arguments is legal for the constructors and create certain maps before-hand to check later. It doesn't take account of boxing/unboxing of primitive data types; hence when we are taking parameter in _"java.lang.Integer"_, while creating map it is unboxing it into _"int"_ itself and then fails because _"int.class"_ and _"java.lang.Integer.class"_ doesn't match which doesn't make sense. Either we can fix how the maps are getting created or we skip these constructors for now. 2. the getMultiTermComponent method constructed a LowerCaseFilterFactory with the original arguments including maxTokenLen, which then threw an error: Not sure what corrected that, but I see no suite failing, not even TestFactories which I suppose was throwing the error for incompatible constructors/noMethodFound etc. Kindly verify if we are still facing the issue or we need to harden the test cases for the same. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881610#comment-15881610 ] Erick Erickson commented on LUCENE-7705: These two tests fail: org.apache.lucene.analysis.core.TestUnicodeWhitespaceTokenizer.testParamsFactory org.apache.lucene.analysis.core.TestRandomChains (suite) TestUnicideWhitespaceTokenizer is because I added "to" to one of the exception messages, a trivial fix No idea what's happening with TestRandomChains though. And my nocommit insures that 'ant precommit' will fail too, that's rather the point. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881459#comment-15881459 ] Erick Erickson commented on LUCENE-7705: I think the patch I uploaded is a result of applying your most recent patch for SOLR-10186 but can you verify? We should probably consolidate two, I suggest we close the Solr one as a duplicate and continue iterating here. Erick > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Priority: Minor > Attachments: LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881239#comment-15881239 ] Amrit Sarkar commented on LUCENE-7705: -- I have cooked up a patch in SOLR-10186, and introduced new constructor in CharTokenizer and related Tokenizer factories, which takes _maxCharLen_ and _factory_ as parameters along with it. Kindly provide your feedback and any comments on introducing new constructors in the classes. Thanks. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Priority: Minor > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org