Hi folks, I thought I'd let you know about a problem I discovered with Thrax. Can you spot it?
$ ls -lh grammar.gz -rw-r--r-- 1 mpost staff 2.2G Oct 6 13:55 grammar.gz $ gzip -cd 9/grammar.gz | cut -d\| -f4 | uniq -c | sort -n | tail 8448 las 8643 a 9440 que 9595 se 9696 , 10617 los 10885 el 11687 en 11932 de 12738 la As you can see, for lots of source sides, there are tons of target options. The first time any rule is used, all the target sides are scored with estimateRule() in order to sort them (including a call to the LM), and then all but the top 20 (configurable with -num_translation_options) are discarded. This is a big waste: the useless rules are stored on disk, and while the compute-time waste is constant-time, it does make a difference in "warming up" the decoder and, of course, memory usage. The problem is that Thrax takes all target sides it finds during training. It would be good to add an option to Thrax that only keeps the top X translation options for each source side (where X is maybe 100). matt