[jira] [Commented] (SOLR-2968) Hunspell very high memory use when loading dictionary

Robert Muir (Commented) (JIRA) Wed, 14 Dec 2011 05:14:03 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169314#comment-13169314
 ]


Robert Muir commented on SOLR-2968:
-----------------------------------

{quote}
By comparison Stempel using the same dictionary file works just fine with 1/8 
of that (and possibly lower values as well).
{quote}

I imagine Stempel's Trie is good, but have you also compared Morfologik 
(http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/morfologik/) 
?
Its precompiled FST might be the most space-efficient for polish.

But really I think Hunspell's dictionary structure should be more efficient, we 
could build the FST on-the-fly (if case-insensitive mode is off). But when 
this is on, entries must be merged.

Instead it might be better for the hunspell stuff to support loading FSTs 
(where we would do any case-sensitivity tweaking/merging of entries, then build 
FST).
It might be possible to re-use some of the same code from SOLR-2888 that does a 
similar thing to build a suggester FST.

In my opinion its worth it to build the FST not just for the words, but also 
the affixes (in some files these are humungous too!)

For lucene I think we would just allow HunspellDictionary to also be 
instantiated from these FST inputstreams. The solr factory / configuration 
would need
to be tweaked to make this easy and intuitive.

                
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: SOLR-2968
>                 URL: https://issues.apache.org/jira/browse/SOLR-2968
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2968) Hunspell very high memory use when loading dictionary

Reply via email to