[ 
https://issues.apache.org/jira/browse/LUCENE-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187824#comment-13187824
 ] 

Robert Muir commented on LUCENE-3699:
-------------------------------------

Uwe improved the memory usage a lot too (e.g. parallel arrays)... thanks for 
this!

Our uncompressed size is 9MB now, which i think is good for this dataset.

My motivation for improving the size stuff is not because kuromoji was ever 
really wasteful,
instead I think size comes with the territory for CJK (see your JVM/ICU if you 
don't believe me).

On the server, it doesn't really matter: but smaller size can make kuromoji 
more attractive for 
other use cases like integration into desktop. Also its also to make retrieval 
of attributes efficient 
for the analysis chain (e.g. getting part of speech just means reading a short).

Finally, languages aren't static and we can only anticipate dictionaries to 
grow in size in the future and
maybe have even more attributes (e.g. naist-jdic being 25% larger).
                
> kuromoji dictionary could be more compact
> -----------------------------------------
>
>                 Key: LUCENE-3699
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3699
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3699.patch, LUCENE-3699_more.patch
>
>
> Reading thru the ipadic documentation, i realized we are storing a lot of 
> redundant information,
> for example the connection costs for bigram weights are based on 
> POS+inflection data, so its redundant 
> to also separately encode POS and inflection data for each entry.
> With the patch the dictionary access is also faster and simpler, and 
> TokenInfoDictionary is 1.5MB smaller.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to