[ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060705#comment-13060705 ]
Robert Muir commented on LUCENE-3233: ------------------------------------- {quote} The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build? {quote} Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d in reality this is creating entries like this: a -> a a -> b a -> c a -> d b -> a b -> b ... in the current impl, this is done using some inefficient datastructures (like nested chararraymaps with Token), as well as calling merge(). In the FST impl, we don't use any nested structures (instead input and output entries are just phrases), and we explicitly deduplicate both inputs and outputs during construction, the FST output is just a List<Integer> basically pointing to ords in the deduplicated bytesrefhash. so during construction when you add() its just a hashmap lookup on the input phrase, a bytesrefhash get/put on the UTF16toUTF8WithHash to get the output ord, and an append to an arraylist. this code isn't really optimized right now and we can definitely speed it up even more in the future. but the main thing right now is to ensure the filter performance is good. > HuperDuperSynonymsFilterâ„¢ > ------------------------- > > Key: LUCENE-3233 > URL: https://issues.apache.org/jira/browse/LUCENE-3233 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Robert Muir > Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, > LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, > LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip > > > The current synonymsfilter uses a lot of ram and cpu, especially at build > time. > I think yesterday I heard about "huge synonyms files" three times. > So, I think we should use an FST-based structure, sharing the inputs and > outputs. > And we should be more efficient with the tokenStream api, e.g. using > save/restoreState instead of cloneAttributes() -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org