[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Steve Rowe (JIRA) Tue, 30 May 2017 11:22:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029872#comment-16029872
 ]


Steve Rowe commented on LUCENE-7705:
------------------------------------

Here's another reproducing failure 
[https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/3612]: 

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers 
-Dtests.seed=FE4BE1CA39C9E0DA -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=vi-VN -Dtests.timezone=Australia/NSW -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.07s J1 | 
TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during 
query
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0)
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895)
   [junit4]    >        at 
org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//result[@numFound=1]
   [junit4]    >        xml response was: <?xml version="1.0" encoding="UTF-8"?>
   [junit4]    > <response>
   [junit4]    > <lst name="responseHeader"><int name="status">0</int><int 
name="QTime">0</int></lst><result name="response" numFound="0" 
start="0"></result>
   [junit4]    > </response>
   [junit4]    >        request was:q=letter0:lett&wt=xml
   [junit4]    >        at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:888)
[...]
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): 
{lowerCase0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 whiteSpace=Lucene50(blocksize=128), letter=BlockTreeOrds(blocksize=128), 
lowerCase=Lucene50(blocksize=128), 
unicodeWhiteSpace=BlockTreeOrds(blocksize=128), letter0=FST50, 
unicodeWhiteSpace0=FST50, 
keyword0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 
id=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 keyword=Lucene50(blocksize=128), 
whiteSpace0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))},
 docValues:{}, maxPointsInLeafNode=579, maxMBSortInHeap=5.430197160407458, 
sim=RandomSimilarity(queryNorm=false,coord=crazy): {}, locale=vi-VN, 
timezone=Australia/NSW
   [junit4]   2> NOTE: Linux 4.10.0-21-generic amd64/Oracle Corporation 
1.8.0_131 (64-bit)/cpus=8,threads=1,free=234332328,total=536870912
{noformat}

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: master (7.0), 6.7
>
>         Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to