[ 
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409361#comment-16409361
 ] 

Alan Woodward commented on LUCENE-8202:
---------------------------------------

TestRandomChains has found two issues:
 * positionLength should be 1, rather than the shingle length.  We don't have 
any intermediary tokens, only shingles, so we're not building graphs.  TRC 
found this by feeding the output into FlattenGraphFilter, which then complained.
 * we need somehow limit either the length of the shingle, or the number of 
stacked positions we iterate through, as we can otherwise get a combinatorial 
explosion of terms.  TRC found this by feeding long strings into a 
decompounding filter, and then building shingles of length 11.  The 
decompounding filter was producing up to 50 tokens in the same position, which 
lead to 50^11 shingles being generated, resulting in OOM.  I'm not sure of the 
best way of dealing with this one though - we could just limit shingle length 
to a maximum of 3 or 4, but that seems like too harsh a restriction for this.  
The other possibility would be to have a (configurable) maximum number of 
shingles emitted at a single position, and throw IllegalStateException if this 
is hit.

> Add a FixedShingleFilter
> ------------------------
>
>                 Key: LUCENE-8202
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8202
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and 
> emit arbitrary graphs, while duplicating all the functionality of the 
> existing ShingleFilter.  This ends up being extremely hairy, and doesn't play 
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be 
> used for index-time phrase tokenization only.  It will have a single fixed 
> shingle size, can deal with single-token synonyms, and won't emit unigrams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to