[ 
https://issues.apache.org/jira/browse/LUCENE-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187532#comment-13187532
 ] 

Robert Muir commented on LUCENE-3699:
-------------------------------------

Dawid, currently the FST is not really the biggest culprit:

{noformat}
-rw-r--r--   1 rmuir  staff    65568 Jan 16 16:35 CharacterDefinition.dat
-rw-r--r--   1 rmuir  staff  2624540 Jan 16 16:35 ConnectionCosts.dat
-rw-r--r--   1 rmuir  staff  4337216 Jan 17 03:22 TokenInfoDictionary$buffer.dat
-rw-r--r--   1 rmuir  staff  1954846 Jan 16 16:35 TokenInfoDictionary$fst.dat
-rw-r--r--   1 rmuir  staff    54870 Jan 16 16:35 
TokenInfoDictionary$posDict.dat
-rw-r--r--   1 rmuir  staff   392165 Jan 17 03:22 
TokenInfoDictionary$targetMap.dat
-rw-r--r--   1 rmuir  staff      311 Jan 17 03:22 UnknownDictionary$buffer.dat
-rw-r--r--   1 rmuir  staff     4111 Jan 16 16:35 UnknownDictionary$posDict.dat
-rw-r--r--   1 rmuir  staff       69 Jan 16 16:35 
UnknownDictionary$targetMap.dat
{noformat}

as far as the FST, our output is just an increasing ord (according to term sort 
order), 
so I think it should be pretty good? Is there something more efficient than 
this?

Basically there are about 330k headwords, and 390k words. so some words have 
different
parts of speech/reading etc for the same surface form.

The $fst.dat is currently FST<int> where int is just an ord into 
$targetMap.dat, which is
really a int[][] (it maps the output ord from the fst into an int[] containing 
the offsets
of all word entries for that surface form). 

But the 'meat' describing the entries is in $buffer.dat. for each word this is 
its cost,
part of speech, base form (stem), reading, pronunciation, etc, etc. As you see 
we
are down to about 11 bytes per lemma on average, but still this 'metadata' is 
the biggest,
thats what i was working on shrinking in this issue.

                
> kuromoji dictionary could be more compact
> -----------------------------------------
>
>                 Key: LUCENE-3699
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3699
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3699.patch, LUCENE-3699_more.patch
>
>
> Reading thru the ipadic documentation, i realized we are storing a lot of 
> redundant information,
> for example the connection costs for bigram weights are based on 
> POS+inflection data, so its redundant 
> to also separately encode POS and inflection data for each entry.
> With the patch the dictionary access is also faster and simpler, and 
> TokenInfoDictionary is 1.5MB smaller.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to