Hi Marco,
I happen to be up late tonight debugging this very same problem.  What
are the odds?  Here's what I know so far:

1) Once you hit this problem, you're never going to recover, so it's
good to put in an exit(1) in GIZA when you've detected it.

2) I think this has to do with a numerical underflow that is caused by
too long sentences.  Generally, GIZA++ does not support sentences that
are longer than 100 words (101 including NULL) and will truncate them
if they exceed the internally specified maximum.  I attempted to
increase this constant, but when I tried to align corpora with longer
sentences, I started seeing this error, and I'm fairly confident that
sentence length is the issue.  Looking at the implementation of the
HMM alignment model (GIZA, by the way, makes Moses look like a work of
art), it seems that there is no normalization being used in the
forward/backward trellises, which can most definitely lead to
underflow errors (see Fred Jelinek's book on speech recognition, maybe
Section 2.10 if I recall correctly, for a discussion of this problem).

Anyway, my recommendation is to try to get around this by filtering
your corpus based on sentence length/sentence length ratio (please,
let me know if this solution works for you!).  Once I confirm this is
the problem, I'll look into adding trellis normalization to GIZA.

--Chris

On Dec 16, 2007 1:43 AM, marco turchi <[EMAIL PROTECTED]> wrote:
> Dear experts,
> I have run a full Moses process, training, optimization and testing... I
> have sent all the output of these processes into a file. At the end, I have
> seen that this file is huge 65 Gb, and the bleu score completely different
> from other experiments with the same number of sentences...
> I have investigated looking inside the outpup file, and i have seen that
> Giza has found this error:
> -----------
> Hmm: Iteration 4
> Reading more sentence pairs into memory ...
> ERROR2: nan nan nanN:
>
> and after this error, I get a lot of lines full of number and then
> ERROR: nan nan nan 52 38
> ERROR: nan nan nan 52 38
> ERROR: nan nan nan 52 38
> ERROR: nan nan nan 52 38
> ERROR: nan nan nan 52 38
> ERROR: nan nan nan 52 38
> ERROR: nan nan nan 52 38
>
>
> and so on...
>  the training phase gives me 63872760 of output lines....
> do u know what it happens?
>
> if I run again the same experiment, will I get the same strange behaviours?
> or  have I just been unlucky?
>
> Thanks a lot
>  Marco
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to