[ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3233: -------------------------------- Attachment: LUCENE-3233.patch here's a rough start to building a datastructure that I think makes good tradeoffs between RAM and processing. No matter what, the processing on the filter-side will be hairy because of the 'interleaving' with the tokenstream. This one is just an FST<CharsRef,Int[]>(BYTE4) where Int is an ord to a BytesRefHash, containing the output Bytes for each term. This way, at input time we can walk the FST with codePointAt() On both sides, the Chars/Bytes are actually phrases, using \u0000 as a word separator. > HuperDuperSynonymsFilterâ„¢ > ------------------------- > > Key: LUCENE-3233 > URL: https://issues.apache.org/jira/browse/LUCENE-3233 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Robert Muir > Attachments: LUCENE-3233.patch > > > The current synonymsfilter uses a lot of ram and cpu, especially at build > time. > I think yesterday I heard about "huge synonyms files" three times. > So, I think we should use an FST-based structure, sharing the inputs and > outputs. > And we should be more efficient with the tokenStream api, e.g. using > save/restoreState instead of cloneAttributes() -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org