[ https://issues.apache.org/jira/browse/LUCENE-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385188#comment-16385188 ]
Robert Muir commented on LUCENE-8092: ------------------------------------- CJKBigramFilter isn't really prepared to handle an arbitrary input graph (or maybe even synonyms): its looking for a flat stream of tokens that may include some CJK. It already has a ridiculously complex job, its like a shinglefilter but with crazy custom logic: but it does manage that to support the use-case across different tokenizer variants (StandardTokenizer, UAXURLTokenizer, ICUTokenizer). Maybe it should throw a clear exception if it encounters posinc=0 or poslen > 1 ? It would at least make it totally clear that it won't work, rather than the user getting a more vague exception from indexwriter. Ideally this would be detected earlier though (in construction of the chain). Unfortunately its not so easy to simply require that its input is a tokenizer, see the CJKAnalyzer use-case where the width-filter comes before, because that impacts bigramming. > TestRandomChains failure > ------------------------ > > Key: LUCENE-8092 > URL: https://issues.apache.org/jira/browse/LUCENE-8092 > Project: Lucene - Core > Issue Type: Bug > Reporter: Alan Woodward > Priority: Major > > https://builds.apache.org/job/Lucene-Solr-NightlyTests-7.2/1/ > ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains > -Dtests.seed=C006DAD2E1FC77AF -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true > -Dtests.linedocsfile=/Users/romseygeek/projects/lucene-test-data/enwiki.random.lines.txt > -Dtests.locale=tr -Dtests.timezone=Europe/Simferopol -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > Reproduces locally on 7.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org