I haven't looked into what's causing the particular problem on this corpus, but another known problem with the GIZA HMM model is that it doesn't do a fairly standard kind of normalization in the forward-backward training, which causes underflow errors in some sentences (especially quite long ones), which also leads to this problem.
It seems that different systems handle very small floating point numbers differently, so this seems to be a bigger or smaller problem with different builds, but this also may interact with the fix the Qin is reporting. Qin, have you been able to determine if your fix corrects the problem with the German-English alignment? Chris On Thu, Feb 28, 2008 at 12:50 PM, Qin Gao <[EMAIL PROTECTED]> wrote: > Hi, Wilson, > > As I mentioned, GIZA++ may have a bug on HMM training stage, it will add > some random number to count table, and maybe it is the reason. You may > check the archive of the mailing list for the description of the bug, > also, you can simply comment out the lines marked with //*******// in > Array2.h to fix it. > > inline T*begin(){ > #ifdef __STL_DEBUG //*******// > if( h1==0||h2==0)return 0; > #endif //*******// > return &(p[0]); > } > inline T*end(){ > #ifdef __STL_DEBUG //*******// > if( h1==0||h2==0)return 0; > #endif //*******// > return &(p[0])+p.size(); > } > > You may also be interested in trying a new version of Multi-threaded > GIZA++ with the bug fixed, and a much faster speed here > > http://www.cs.cmu.edu/~qing/ > > Best, > Qin > > > > Wilson, Kevin wrote: > > > > Hello all, > > > > I'm currently trying to train Moses on aligned subtitles obtained from > > the opus corpus website. The files have been cleaned and formatted in > > a similar way to the standard Europarl files. > > > > There are a series of NAN errors after Giza begins the HMM stage of > > training. The corpus has been cleaned using the appropriate script and > > the sentence length has been limited to 40, although many sentences > > are much less than this. > > > > I'm guessing there's some strange characters messing things up or > > something like that, but wondered if others had encountered this issue > > and could possibly provide advice. > > > > Many thanks, > > > > Kevin. > > > > *Kevin A. Wilson, MS* > > > > Research Computing Division > > > > RTI International > > > > 3040 Cornwallis Road > > > > P.O. Box 12194 > > > > Research Triangle Park > > > > NC 27709-2194 > > > > (919) 485-5521 > > > > > > www.rti.org <http://www.rti.org/> > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Moses-support mailing list > > Moses-support@mit.edu > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support