[ https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673452#comment-15673452 ]
Matt Post commented on JOSHUA-315: ---------------------------------- Yeah, I had expected a bigger savings, too. I should quantify it in terms of runtime, as well. > Thrax keeps all rules > --------------------- > > Key: JOSHUA-315 > URL: https://issues.apache.org/jira/browse/JOSHUA-315 > Project: Joshua > Issue Type: Bug > Reporter: Matt Post > Fix For: 6.2 > > > When extracting rules, Thrax keeps *all* options for each target side. For > large bitexts and common source sides (e.g., "de" for Spanish–English), there > can be tens of thousands of translations, due to errors in the alignments and > phenomena like garbage collection. The decoder throws out all but the top > num_translation_options of these (default 20), but before doing so, it has to > score all the target side options with all feature functions, include the > language model. This slows down "warming up" of the model and means that the > first sentences to use these items are very slow to translation. > I have updated scripts/training/filter-rules.pl to filter out using Thrax's > rarity penalty field, but it would be much better if Thrax were to keep only > the most 100 frequent translation options for each source side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)