[
https://issues.apache.org/jira/browse/LUCENE-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385188#comment-16385188
]
Robert Muir commented on LUCENE-8092:
-------------------------------------
CJKBigramFilter isn't really prepared to handle an arbitrary input graph (or
maybe even synonyms): its looking for a flat stream of tokens that may include
some CJK.
It already has a ridiculously complex job, its like a shinglefilter but with
crazy custom logic: but it does manage that to support the use-case across
different tokenizer variants (StandardTokenizer, UAXURLTokenizer,
ICUTokenizer).
Maybe it should throw a clear exception if it encounters posinc=0 or poslen > 1
? It would at least make it totally clear that it won't work, rather than the
user getting a more vague exception from indexwriter. Ideally this would be
detected earlier though (in construction of the chain). Unfortunately its not
so easy to simply require that its input is a tokenizer, see the CJKAnalyzer
use-case where the width-filter comes before, because that impacts bigramming.
> TestRandomChains failure
> ------------------------
>
> Key: LUCENE-8092
> URL: https://issues.apache.org/jira/browse/LUCENE-8092
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Alan Woodward
> Priority: Major
>
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-7.2/1/
> ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains
> -Dtests.seed=C006DAD2E1FC77AF -Dtests.multiplier=2 -Dtests.nightly=true
> -Dtests.slow=true
> -Dtests.linedocsfile=/Users/romseygeek/projects/lucene-test-data/enwiki.random.lines.txt
> -Dtests.locale=tr -Dtests.timezone=Europe/Simferopol -Dtests.asserts=true
> -Dtests.file.encoding=UTF-8
> Reproduces locally on 7.2
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]