[ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060471#comment-13060471
 ] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

bq. java.lang.IllegalStateException: max arc size is too large (445)

Ahh -- to fix this we have to call Builder.setAllowArrayArcs(false), ie, 
disable the array arcs in the FST (and this binary search lookup for finding 
arcs!).  I had to do this also for MemoryCodec, since postings encoded as 
output per arc can be more than 256 bytes, in general.

This will hurt perf, ie, the arc lookup cannot use a binary search; it's 
because of a silly limitation in the FST representation, that we use a single 
byte to hold the max size of all arcs, so that if any arc is > 256 bytes we are 
unable to encode it as an array.  We could fix this (eg, use vInt), however, 
arcs with such widely varying sizes (due to widely varying outputs on each arc) 
will be very wasteful in space because all arcs will use up a fixed number of 
bytes when represented as an array.

For now I think we should just call the above method, and then test the 
resulting perf.

> HuperDuperSynonymsFilterâ„¢
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build 
> time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and 
> outputs.
> And we should be more efficient with the tokenStream api, e.g. using 
> save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to