Hi Marco, I happen to be up late tonight debugging this very same problem. What are the odds? Here's what I know so far:
1) Once you hit this problem, you're never going to recover, so it's good to put in an exit(1) in GIZA when you've detected it. 2) I think this has to do with a numerical underflow that is caused by too long sentences. Generally, GIZA++ does not support sentences that are longer than 100 words (101 including NULL) and will truncate them if they exceed the internally specified maximum. I attempted to increase this constant, but when I tried to align corpora with longer sentences, I started seeing this error, and I'm fairly confident that sentence length is the issue. Looking at the implementation of the HMM alignment model (GIZA, by the way, makes Moses look like a work of art), it seems that there is no normalization being used in the forward/backward trellises, which can most definitely lead to underflow errors (see Fred Jelinek's book on speech recognition, maybe Section 2.10 if I recall correctly, for a discussion of this problem). Anyway, my recommendation is to try to get around this by filtering your corpus based on sentence length/sentence length ratio (please, let me know if this solution works for you!). Once I confirm this is the problem, I'll look into adding trellis normalization to GIZA. --Chris On Dec 16, 2007 1:43 AM, marco turchi <[EMAIL PROTECTED]> wrote: > Dear experts, > I have run a full Moses process, training, optimization and testing... I > have sent all the output of these processes into a file. At the end, I have > seen that this file is huge 65 Gb, and the bleu score completely different > from other experiments with the same number of sentences... > I have investigated looking inside the outpup file, and i have seen that > Giza has found this error: > ----------- > Hmm: Iteration 4 > Reading more sentence pairs into memory ... > ERROR2: nan nan nanN: > > and after this error, I get a lot of lines full of number and then > ERROR: nan nan nan 52 38 > ERROR: nan nan nan 52 38 > ERROR: nan nan nan 52 38 > ERROR: nan nan nan 52 38 > ERROR: nan nan nan 52 38 > ERROR: nan nan nan 52 38 > ERROR: nan nan nan 52 38 > > > and so on... > the training phase gives me 63872760 of output lines.... > do u know what it happens? > > if I run again the same experiment, will I get the same strange behaviours? > or have I just been unlucky? > > Thanks a lot > Marco > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support