[ 
https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664649#comment-15664649
 ] 

Matt Post commented on JOSHUA-315:
----------------------------------

This has been addressed in commit 885389d513b5d0f3f68b59c3b17a776584b3a208. If 
you add the word "count" to the list of thrax features in the thrax config 
file, a sixth field will be extracted with the rule count, e.g.,

    [X] ||| de ||| of ||| 0.72572 0.29124 1 0 0.39357 0.17023 ||| 0-0 ||| 
2565758
    [X] ||| de ||| to ||| 2.89509 2.10811 1 0 2.87285 2.08282 ||| 0-0 ||| 215020
    [X] ||| de ||| in ||| 3.11663 2.17583 1 0 2.91081 2.34837 ||| 0-0 ||| 207011
    ...

This is then used by the filter-rules.pl script (with the flag -t 100) to prune 
remove all rules except the top 100 most frequent, for each source side. This 
has been added to the pipeline. The grammars seem to be about 5% smaller and 
should have only a positive effect on running time.

> Thrax keeps all rules
> ---------------------
>
>                 Key: JOSHUA-315
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-315
>             Project: Joshua
>          Issue Type: Bug
>            Reporter: Matt Post
>             Fix For: 6.2
>
>
> When extracting rules, Thrax keeps *all* options for each target side. For 
> large bitexts and common source sides (e.g., "de" for Spanish–English), there 
> can be tens of thousands of translations, due to errors in the alignments and 
> phenomena like garbage collection. The decoder throws out all but the top 
> num_translation_options of these (default 20), but before doing so, it has to 
> score all the target side options with all feature functions, include the 
> language model. This slows down "warming up" of the model and means that the 
> first sentences to use these items are very slow to translation.
> I have updated scripts/training/filter-rules.pl to filter out using Thrax's 
> rarity penalty field, but it would be much better if Thrax were to keep only 
> the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to