[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Robert Muir (JIRA) Wed, 06 Jul 2011 10:29:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060705#comment-13060705
 ]


Robert Muir commented on LUCENE-3233:
-------------------------------------

{quote}
The difference in build time is surprising to me. Any theory why 
SynonymFilterFactory takes so much more time to build?
{quote}

Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d
in reality this is creating entries like this:
a -> a
a -> b
a -> c
a -> d
b -> a
b -> b
...

in the current impl, this is done using some inefficient datastructures (like 
nested chararraymaps with Token),
as well as calling merge().

In the FST impl, we don't use any nested structures (instead input and output 
entries are just phrases), and we explicitly 
deduplicate both inputs and outputs during construction, the FST output is just 
a
List<Integer> basically pointing to ords in the deduplicated bytesrefhash.

so during construction when you add() its just a hashmap lookup on the input 
phrase, a bytesrefhash get/put on the UTF16toUTF8WithHash
to get the output ord, and an append to an arraylist.

this code isn't really optimized right now and we can definitely speed it up 
even more in the future. but the main thing
right now is to ensure the filter performance is good.


> HuperDuperSynonymsFilter™
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build 
> time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and 
> outputs.
> And we should be more efficient with the tokenStream api, e.g. using 
> save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilter™

Reply via email to