We know that there is nondeterminism during optimization, yet virtually all papers report results based on a single MERT run. We know that results can very widely based on language pair and data sets, but a large majority of papers report results on a single language pair, and often for a single data set.
While these issues are widely known at the informal level, I think that Suzy's point is well taken. I think there would be value in published studies showing just how wide the gap due to nondeterminism can be expected to be. It may be that such studies already exist, and I'm just not aware of them. Does anyone know of any? Cheers, Lane On Fri, Mar 25, 2011 at 7:03 AM, Barry Haddow <bhad...@inf.ed.ac.uk> wrote: > Hi > > This is an issue which is not just faced by SMT, but probably by all > research > fields. Evidence from one paper doesn't generally prove or disprove that a > technique works, you need to consider lots of evidence, from different > workers in different labs. > > As a young field, SMT has its own problems in building up good experimental > practices, which are not helped by the tendency to over-sell in research > papers, and ignore the non-determinism in many parts of the pipeline. > Non-reproducibility is also a problem, as much of the code used in papers > is > not released, and the complete list of settings required to rerun an > experiment are rarely given. These problems have been acknowledged, and > initiatives proposed to address them, but they're far from solved, > > best regards - Barry > > On Friday 25 March 2011 10:44, Miles Osborne wrote: > > this is something that I have been concerned about for a long time > > now. and things are actually worse than this, since often only a > > single language pair / test set / training set is used. claims cannot > > be made on the basis of such shaky evidence, > > > > Miles > > > > On 25 March 2011 09:42, Suzy Howlett <s...@showlett.id.au> wrote: > > > I've been thinking about the issue of nondeterminism and am somewhat > > > concerned because typically MT results/papers give just a single > > > performance figure for each system. As there is an element of > > > nondeterministic behaviour, it would seem prudent to run several > repeats > > > of each system and give mean and standard deviation information > instead. > > > Of course, this has a practicality trade-off, so an investigation is > > > warranted to determine the scale of the problem. Is anyone interested > in > > > collaborating on a paper or CL squib to address the issue, and bring it > > > to the attention of the MT community (and CL community at large)? > > > > > > Suzy > > > > > > On 25/03/11 11:58 AM, Tom Hoar wrote: > > >> We pick the random set from across the entire collection of documents. > > >> The documents are retrieved as the file system orders them (not > > >> alphabetically sorted). Your comment, "picked in consecutive order" is > > >> interesting. I've often wondered if the order could affect a system's > > >> performance. It's easy enough for me to randomize both the collection > > >> line order and the test set line order. > > >> > > >> The large variance in BLEU would normally be alarming, but this is on > a > > >> very small sample corpus of only 40,000 lines. We use the sample > corpus > > >> to validate the system installs properly. We haven't seen such large > > >> variations in multi-million pair corpora, but they do range 2-4 BLEU > > >> points. > > >> > > >> Tom > > >> > > >> > > >> -----Original Message----- > > >> *From*: Hieu Hoang <hieuho...@gmail.com > > >> <mailto:hieu%20hoang%20%3chieuho...@gmail.com%3e>> > > >> *To*: moses-support@mit.edu <mailto:moses-support@mit.edu> > > >> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same > > >> config, different n-best lists > > >> *Date*: Thu, 24 Mar 2011 20:43:49 +0000 > > >> > > >> There may be some systematic differences between the randomly choosen > > >> test sets, eg. the sentences are from the same documents 'cos they > were > > >> picked in consecutive order from a multi-doc corpus. Otherwise, I'll > be > > >> worried about such a large BLEU variation. > > >> > > >> > > >> > > >> also, see here on the evils of MERT > > >> http://www.mail-archive.com/moses-support@mit.edu/msg00216.html > > >> > > >> On 24/03/2011 16:06, Tom Hoar wrote: > > >>> We often run multiple trainings on the exact same bitext corpus but > > >>> pull different random samples for each run. We've observed > drastically > > >>> different BLEU scores between different runs with BLEUs ranging from > > >>> 30 to 45. This is from exactly the same training data except for the > > >>> randomly-pulled tuning and evaluation sets. We've assumed this > > >>> difference is due to both the random differences in the sets, > floating > > >>> point variations between various machines and not using > > >>> --predictable-seeds. > > >>> > > >>> Tom > > >>> > > >>> > > >>> > > >>> -----Original Message----- > > >>> *From*: Hieu Hoang <hieuho...@gmail.com > > >>> <mailto:hieu%20hoang%20%3chieuho...@gmail.com%3e>> > > >>> *Reply-to*: h...@hoang.co.uk <mailto:h...@hoang.co.uk> > > >>> *To*: John Burger <j...@mitre.org > > >>> <mailto:john%20burger%20%3cj...@mitre.org%3e>> > > >>> *Cc*: Moses-support <moses-support@mit.edu > > >>> <mailto:moses-support%20%3cmoses-supp...@mit.edu%3e>> > > >>> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same > > >>> config, different n-best lists > > >>> *Date*: Thu, 24 Mar 2011 15:51:48 +0000 > > >>> > > >>> there's little differences in floating point between OS and gcc > > >>> versions. One of the regression test fails because of rounding > errors, > > >>> depending on which machine you run it on. Other than truncating the > > >>> scores, there's not a lot we can do. > > >>> > > >>> The mert perl scripts also dabbles in the scores and that may be > > >>> another source of divergence > > >>> > > >>> On 24 March 2011 15:07, John Burger <j...@mitre.org > > >>> <mailto:j...@mitre.org>> wrote: > > >>> > > >>> Lane Schwartz wrote: > > >>> > > >>> > I've examined the n-best lists, and it seems there are at least > a > > >>> > couple of interesting cases. In the simplest case, several > > >>> > translations of a given sentence produce the exact same score, > > >>> and > these tied translations appear in different order during > > >>> different > > >>> > > >>> > runs. This is a bit odd, but [not] terribly worrisome. The > > >>> stranger > case is when there are two different decoding runs, and > for > > >>> a given > sentence, there are translations that appear only in run A, > > >>> and > different translations that only appear in run B. > > >>> > > >>> > > >>> Both these cases are relevant to something we've occasionally > seen, > > >>> which is non-determinism during =tuning=. This is not surprising > > >>> given the above, since tuning of course involves decoding. It's > > >>> hard to reproduce, but we have sometimes seen very different weights > > >>> coming out of MERT for the exact same system configurations. The > > >>> problem here is that even very small differences in tuning can result > > >>> in substantial differences in test results, because of how twitchy > BLEU > > >>> is. > > >>> > > >>> Like many folks, we typically run MERT on a cluster. This brings > up > > >>> another source of non-determinism we've theorized about. Some of > > >>> our clusters are heterogenous, and we've wondered if there might be > > >>> minor differences in floating point behavior from machine to machine. > > >>> The assignment of different chunks of the tuning data to different > > >>> machines is typically non-deterministic, so this might carry over to > > >>> the actual weights that come out of MERT. > > >>> > > >>> Does anyone know how robust the floating point usage in the > decoder > > >>> is under these circumstances? > > >>> > > >>> Thanks. > > >>> > > >>> - John Burger > > >>> MITRE > > >>> > > >>> _______________________________________________ > > >>> Moses-support mailing list > > >>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> > > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > > >>> > > >>> > > >>> > > >>> _______________________________________________ > > >>> Moses-support mailing list > > >>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> > > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > > >>> > > >>> > > >>> _______________________________________________ > > >>> Moses-support mailing list > > >>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> > > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > > >> > > >> _______________________________________________ > > >> Moses-support mailing list > > >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> > > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > >> > > >> > > >> > > >> > > >> _______________________________________________ > > >> Moses-support mailing list > > >> Moses-support@mit.edu > > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > > -- > > > Suzy Howlett > > > http://www.showlett.id.au/ > > > _______________________________________________ > > > Moses-support mailing list > > > Moses-support@mit.edu > > > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love"
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support