On second thought, this isn't a bug. The penalty only penalizes low-count pairs, as designed.
The problem is that I need rules counts, but I think the solution is to follow Moses route, and add those counts as a subsequent field. matt > On Oct 14, 2016, at 2:27 PM, Felix Hieber <felix.hie...@gmail.com> wrote: > > Hi Matt, > Good catch! If you go for 1 + log(count) [any reason for the '1 +'?] it > probably shouldn't be called RarityPenalty anymore :) > > Cheers, > Felix > > On Fri, 14 Oct 2016 at 18:34, Matt Post <p...@cs.jhu.edu> wrote: > > And by "very highly attested word pairs", I mean "any word pair with a > count ≥ 15" (!). > > I am changing this to return > > 1 + Math.log(annotation.count()) > > and will commit this after testing. > > matt > > >> On Oct 14, 2016, at 12:25 PM, Matt Post <p...@cs.jhu.edu> wrote: >> >> Hi folks, >> >> There is a bug in Thrax related to floating point underflow and the > computation of the rarity penalty. I'm training large models over Europarl > and other datasets for the Spanish–English language pack, and in an attempt > to filter the models down to the hundred most frequent candidates, am > finding that often the rarity penalty is 0. For example: >> >> [X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459 > PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1 >> >> "australia" occurs many times in the training corpus, so there is no > reason that RarityPenalty should be 0. >> >> Note that the rarity penalty is not a raw count, but is computed as >> >> @Override >> public Writable score(RuleWritable r, Annotation annotation) { >> return new FloatWritable((float) Math.exp(1 - annotation.count())); >> } >> >> > https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java >> >> So the problem seems to be that, for very highly-attested word pairs, the > counts are so high that Math.exp() here is negative and gets truncated to 0 > when only five decimal places are requested. >> >> I wonder, why the Math.exp(1-x) dance on this value? Why not just have > the rarity penalty return the log count? >> >> matt