[jira] [Commented] (LUCENE-3699) kuromoji dictionary could be more compact

Dawid Weiss (Commented) (JIRA) Tue, 17 Jan 2012 00:07:29 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187522#comment-13187522
 ]


Dawid Weiss commented on LUCENE-3699:
-------------------------------------

If it's something that is statically compiled (in batch mode) then one could 
reorder states (nodes) to minimize vlength of arc pointers globally. This is 
something I did for fst5 automata and it worked very nice (because the 
distribution of in-node degrees is exponential-like so moving a few nodes with 
many in-links decreases the global automaton size in a significant way). 

I don't think there is any fast algorithm to do this. I used a simple 
heuristic: calculate in-link degree for each state, sort in descending order, 
then re-order N top-most nodes so that they're at the front of the serialized 
automaton. Pick N using any heuristic you like (constant, in-link cutoff, I 
used a sort of simulated annealing approach and probed around).

The presentation about the paper in question is here:
http://ciaa-fsmnlp-2011.univ-tours.fr/ciaa/upload/files/Weiss-Daciuk.pdf

I can't publish the PDF of the paper publicly (Springer below), but I can send 
a PDF copy if somebody is interested. The concept should be clear without the 
paper anyway :)
http://www.springerlink.com/content/60r47952k610l822/
                
> kuromoji dictionary could be more compact
> -----------------------------------------
>
>                 Key: LUCENE-3699
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3699
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3699.patch, LUCENE-3699_more.patch
>
>
> Reading thru the ipadic documentation, i realized we are storing a lot of 
> redundant information,
> for example the connection costs for bigram weights are based on 
> POS+inflection data, so its redundant 
> to also separately encode POS and inflection data for each entry.
> With the patch the dictionary access is also faster and simpler, and 
> TokenInfoDictionary is 1.5MB smaller.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3699) kuromoji dictionary could be more compact

Reply via email to