Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

Tom Hoar Thu, 24 Mar 2011 17:59:29 -0700

We pick the random set from across the entire collection of documents.
The documents are retrieved as the file system orders them (not
alphabetically sorted). Your comment, "picked in consecutive order" is
interesting. I've often wondered if the order could affect a system's
performance. It's easy enough for me to randomize both the collection
line order and the test set line order.


The large variance in BLEU would normally be alarming, but this is on a
very small sample corpus of only 40,000 lines. We use the sample corpus
to validate the system installs properly. We haven't seen such large
variations in multi-million pair corpora, but they do range 2-4 BLEU
points. 

Tom


-----Original Message-----
From: Hieu Hoang <hieuho...@gmail.com>
To: moses-support@mit.edu
Subject: Re: [Moses-support] Nondeterminism during decoding: same
config, different n-best lists
Date: Thu, 24 Mar 2011 20:43:49 +0000

There may be some systematic differences between the randomly choosen
test sets, eg. the sentences are from the same documents 'cos they were
picked in consecutive order from a multi-doc corpus. Otherwise, I'll be
worried about such a large BLEU variation.



also, see here on the evils of MERT
   http://www.mail-archive.com/moses-support@mit.edu/msg00216.html


On 24/03/2011 16:06, Tom Hoar wrote: 

> We often run multiple trainings on the exact same bitext corpus but
> pull different random samples for each run. We've observed drastically
> different BLEU scores between different runs with BLEUs ranging from
> 30 to 45. This is from exactly the same training data except for the
> randomly-pulled tuning and evaluation sets. We've assumed this
> difference is due to both the random differences in the sets, floating
> point variations between various machines and not using
> --predictable-seeds.
> 
> Tom
> 
> 
> 
> -----Original Message-----
> From: Hieu Hoang <hieuho...@gmail.com>
> Reply-to: h...@hoang.co.uk
> To: John Burger <j...@mitre.org>
> Cc: Moses-support <moses-support@mit.edu>
> Subject: Re: [Moses-support] Nondeterminism during decoding: same
> config, different n-best lists
> Date: Thu, 24 Mar 2011 15:51:48 +0000
> 
> there's little differences in floating point between OS and gcc
> versions. One of the regression test fails because of rounding errors,
> depending on which machine you run it on. Other than truncating the
> scores, there's not a lot we can do.
> 
> The mert perl scripts also dabbles in the scores and that may be
> another source of divergence
> 
> On 24 March 2011 15:07, John Burger <j...@mitre.org> wrote: 
> 
>         Lane Schwartz wrote:
>         
>         > I've examined the n-best lists, and it seems there are at
>         least a
>         > couple of interesting cases. In the simplest case, several
>         > translations of a given sentence produce the exact same
>         score, and
>         > these tied translations appear in different order during
>         different
>         
>         > runs. This is a bit odd, but [not] terribly worrisome. The
>         stranger 
>         > case is when there are two different decoding runs, and for
>         a given
>         > sentence, there are translations that appear only in run A,
>         and
>         > different translations that only appear in run B.
>         
>         
>         Both these cases are relevant to something we've occasionally
>         seen,
>         which is non-determinism during =tuning=.  This is not
>         surprising
>         given the above, since tuning of course involves decoding.
>          It's hard
>         to reproduce, but we have sometimes seen very different
>         weights coming
>         out of MERT for the exact same system configurations.  The
>         problem
>         here is that even very small differences in tuning can result
>         in
>         substantial differences in test results, because of how
>         twitchy BLEU is.
>         
>         Like many folks, we typically run MERT on a cluster.  This
>         brings up
>         another source of non-determinism we've theorized about.  Some
>         of our
>         clusters are heterogenous, and we've wondered if there might
>         be minor
>         differences in floating point behavior from machine to
>         machine.  The
>         assignment of different chunks of the tuning data to different
>         machines is typically non-deterministic, so this might carry
>         over to
>         the actual weights that come out of MERT.
>         
>         Does anyone know how robust the floating point usage in the
>         decoder is
>         under these circumstances?
>         
>         Thanks.
>         
>         - John Burger
>           MITRE 
>         
>         _______________________________________________
>         Moses-support mailing list
>         Moses-support@mit.edu
>         http://mailman.mit.edu/mailman/listinfo/moses-support
>         
>         
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

Reply via email to