While re-running an experiment (using the exact same configuration, same models, translating the same data) I noticed that occasionally, I get a slightly different 1-best list.
Upon further examination, running Moses with the same config multiple times often (but not always) produces slightly different n-best lists. This is a bit worrisome from the perspective of being able to re-run an experiment to reproduce results. Is this a known issue? I've examined the n-best lists, and it seems there are at least a couple of interesting cases. In the simplest case, several translations of a given sentence produce the exact same score, and these tied translations appear in different order during different runs. This is a bit odd, but terribly worrisome. The stranger case is when there are two different decoding runs, and for a given sentence, there are translations that appear only in run A, and different translations that only appear in run B. I have these n-best lists available for review - I'll post them in a separate email in this thread. Cheers, Lane
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support