[ 
https://issues.apache.org/jira/browse/LUCENE-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385188#comment-16385188
 ] 

Robert Muir commented on LUCENE-8092:
-------------------------------------

CJKBigramFilter isn't really prepared to handle an arbitrary input graph (or 
maybe even synonyms): its looking for a flat stream of tokens that may include 
some CJK. 

It already has a ridiculously complex job, its like a shinglefilter but with 
crazy custom logic: but it does manage that to support the use-case across 
different tokenizer variants (StandardTokenizer, UAXURLTokenizer, 
ICUTokenizer). 

Maybe it should throw a clear exception if it encounters posinc=0 or poslen > 1 
? It would at least make it totally clear that it won't work, rather than the 
user getting a more vague exception from indexwriter. Ideally this would be 
detected earlier though (in construction of the chain). Unfortunately its not 
so easy to simply require that its input is a tokenizer, see the CJKAnalyzer 
use-case where the width-filter comes before, because that impacts bigramming.

> TestRandomChains failure
> ------------------------
>
>                 Key: LUCENE-8092
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8092
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Alan Woodward
>            Priority: Major
>
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-7.2/1/
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
> -Dtests.seed=C006DAD2E1FC77AF -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true 
> -Dtests.linedocsfile=/Users/romseygeek/projects/lucene-test-data/enwiki.random.lines.txt
>  -Dtests.locale=tr -Dtests.timezone=Europe/Simferopol -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> Reproduces locally on 7.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to