I haven't looked into what's causing the particular problem on this
corpus, but another known problem with the GIZA HMM model is that it
doesn't do a fairly standard kind of normalization in the
forward-backward training, which causes underflow errors in some
sentences (especially quite long ones), which also leads to this
problem.

It seems that different systems handle very small floating point
numbers differently, so this seems to be a bigger or smaller problem
with different builds, but this also may interact with the fix the Qin
is reporting.  Qin, have you been able to determine if your fix
corrects the problem with the German-English alignment?

Chris

On Thu, Feb 28, 2008 at 12:50 PM, Qin Gao <[EMAIL PROTECTED]> wrote:
> Hi, Wilson,
>
>  As I mentioned, GIZA++ may have a bug on HMM training stage, it will add
>  some random number to count table, and maybe it is the reason. You may
>  check the archive of the mailing list for the description of the bug,
>  also, you can simply comment out the lines marked with //*******// in
>  Array2.h to fix it.
>
>  inline T*begin(){
>  #ifdef __STL_DEBUG //*******//
>  if( h1==0||h2==0)return 0;
>  #endif //*******//
>  return &(p[0]);
>  }
>  inline T*end(){
>  #ifdef __STL_DEBUG //*******//
>  if( h1==0||h2==0)return 0;
>  #endif //*******//
>  return &(p[0])+p.size();
>  }
>
>  You may also be interested in trying a new version of Multi-threaded
>  GIZA++ with the bug fixed, and a much faster speed here
>
>  http://www.cs.cmu.edu/~qing/
>
>  Best,
>  Qin
>
>
>
>  Wilson, Kevin wrote:
>  >
>  > Hello all,
>  >
>  > I'm currently trying to train Moses on aligned subtitles obtained from
>  > the opus corpus website. The files have been cleaned and formatted in
>  > a similar way to the standard Europarl files.
>  >
>  > There are a series of NAN errors after Giza begins the HMM stage of
>  > training. The corpus has been cleaned using the appropriate script and
>  > the sentence length has been limited to 40, although many sentences
>  > are much less than this.
>  >
>  > I'm guessing there's some strange characters messing things up or
>  > something like that, but wondered if others had encountered this issue
>  > and could possibly provide advice.
>  >
>  > Many thanks,
>  >
>  > Kevin.
>  >
>  > *Kevin A. Wilson, MS*
>  >
>  > Research Computing Division
>  >
>  > RTI International
>  >
>  > 3040 Cornwallis Road
>  >
>  > P.O. Box 12194
>  >
>  > Research Triangle Park
>  >
>  > NC 27709-2194
>  >
>  > (919) 485-5521
>  >
>
>
> > www.rti.org <http://www.rti.org/>
>  >
>  > ------------------------------------------------------------------------
>  >
>  > _______________________________________________
>  > Moses-support mailing list
>  > Moses-support@mit.edu
>  > http://mailman.mit.edu/mailman/listinfo/moses-support
>  >
>
>  _______________________________________________
>  Moses-support mailing list
>  Moses-support@mit.edu
>  http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to