Hi Chris, Thanks a lot, I stay up till late too trying to understand if I have done some mistakes... Anyway, I have used the Moses script (Clean...) to filter out the long sentences. The maximum length that I accept is 40 words. I'm sorry but I guess that this is not the reason of my problem. :-( Thanks a lot Marco
On Dec 16, 2007 7:41 AM, Chris Dyer <[EMAIL PROTECTED]> wrote: > Hi Marco, > I happen to be up late tonight debugging this very same problem. What > are the odds? Here's what I know so far: > > 1) Once you hit this problem, you're never going to recover, so it's > good to put in an exit(1) in GIZA when you've detected it. > > 2) I think this has to do with a numerical underflow that is caused by > too long sentences. Generally, GIZA++ does not support sentences that > are longer than 100 words (101 including NULL) and will truncate them > if they exceed the internally specified maximum. I attempted to > increase this constant, but when I tried to align corpora with longer > sentences, I started seeing this error, and I'm fairly confident that > sentence length is the issue. Looking at the implementation of the > HMM alignment model (GIZA, by the way, makes Moses look like a work of > art), it seems that there is no normalization being used in the > forward/backward trellises, which can most definitely lead to > underflow errors (see Fred Jelinek's book on speech recognition, maybe > Section 2.10 if I recall correctly, for a discussion of this problem). > > Anyway, my recommendation is to try to get around this by filtering > your corpus based on sentence length/sentence length ratio (please, > let me know if this solution works for you!). Once I confirm this is > the problem, I'll look into adding trellis normalization to GIZA. > > --Chris > > On Dec 16, 2007 1:43 AM, marco turchi <[EMAIL PROTECTED]> wrote: > > Dear experts, > > I have run a full Moses process, training, optimization and testing... I > > have sent all the output of these processes into a file. At the end, I > have > > seen that this file is huge 65 Gb, and the bleu score completely > different > > from other experiments with the same number of sentences... > > I have investigated looking inside the outpup file, and i have seen that > > Giza has found this error: > > ----------- > > Hmm: Iteration 4 > > Reading more sentence pairs into memory ... > > ERROR2: nan nan nanN: > > > > and after this error, I get a lot of lines full of number and then > > ERROR: nan nan nan 52 38 > > ERROR: nan nan nan 52 38 > > ERROR: nan nan nan 52 38 > > ERROR: nan nan nan 52 38 > > ERROR: nan nan nan 52 38 > > ERROR: nan nan nan 52 38 > > ERROR: nan nan nan 52 38 > > > > > > and so on... > > the training phase gives me 63872760 of output lines.... > > do u know what it happens? > > > > if I run again the same experiment, will I get the same strange > behaviours? > > or have I just been unlucky? > > > > Thanks a lot > > Marco > > > > _______________________________________________ > > Moses-support mailing list > > Moses-support@mit.edu > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support