Hi Matt,
Good catch! If you go for 1 + log(count) [any reason for the '1 +'?] it
probably shouldn't be called RarityPenalty anymore :)

Cheers,
Felix

On Fri, 14 Oct 2016 at 18:34, Matt Post <p...@cs.jhu.edu> wrote:

And by "very highly attested word pairs", I mean "any word pair with a
count ≥ 15" (!).

I am changing this to return

        1 + Math.log(annotation.count())

and will commit this after testing.

matt


> On Oct 14, 2016, at 12:25 PM, Matt Post <p...@cs.jhu.edu> wrote:
>
> Hi folks,
>
> There is a bug in Thrax related to floating point underflow and the
computation of the rarity penalty. I'm training large models over Europarl
and other datasets for the Spanish–English language pack, and in an attempt
to filter the models down to the hundred most frequent candidates, am
finding that often the rarity penalty is 0. For example:
>
> [X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459
PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1
>
> "australia" occurs many times in the training corpus, so there is no
reason that RarityPenalty should be 0.
>
> Note that the rarity penalty is not a raw count, but is computed as
>
>  @Override
>  public Writable score(RuleWritable r, Annotation annotation) {
>    return new FloatWritable((float) Math.exp(1 - annotation.count()));
>  }
>
>
https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java
>
> So the problem seems to be that, for very highly-attested word pairs, the
counts are so high that Math.exp() here is negative and gets truncated to 0
when only five decimal places are requested.
>
> I wonder, why the Math.exp(1-x) dance on this value? Why not just have
the rarity penalty return the log count?
>
> matt

Reply via email to