[ https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409361#comment-16409361 ]
Alan Woodward commented on LUCENE-8202: --------------------------------------- TestRandomChains has found two issues: * positionLength should be 1, rather than the shingle length. We don't have any intermediary tokens, only shingles, so we're not building graphs. TRC found this by feeding the output into FlattenGraphFilter, which then complained. * we need somehow limit either the length of the shingle, or the number of stacked positions we iterate through, as we can otherwise get a combinatorial explosion of terms. TRC found this by feeding long strings into a decompounding filter, and then building shingles of length 11. The decompounding filter was producing up to 50 tokens in the same position, which lead to 50^11 shingles being generated, resulting in OOM. I'm not sure of the best way of dealing with this one though - we could just limit shingle length to a maximum of 3 or 4, but that seems like too harsh a restriction for this. The other possibility would be to have a (configurable) maximum number of shingles emitted at a single position, and throw IllegalStateException if this is hit. > Add a FixedShingleFilter > ------------------------ > > Key: LUCENE-8202 > URL: https://issues.apache.org/jira/browse/LUCENE-8202 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch > > > In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and > emit arbitrary graphs, while duplicating all the functionality of the > existing ShingleFilter. This ends up being extremely hairy, and doesn't play > well with query parsers. > I'd like to step back and try and create a simpler shingle filter that can be > used for index-time phrase tokenization only. It will have a single fixed > shingle size, can deal with single-token synonyms, and won't emit unigrams. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org