Hi again,
as announced, I have implemented duplicate removal. As I had suspected, there
are quite some duplicates in the later iterations, but not as many as I had
assumed. Still, I found it interesting to see how this varies from sentence to
sentence.
My implementation is very modular (duplicate removal could easily be made
optional) and, to get a fair run-time comparison, follows the style of MOSES. I
do indeed get run-time savings in the later iterations - see the logs below.
In summary, the speed-ups are not overwhelming, but I don't see any reason why
this should not go into the git-repository. I will contact (some of) the
developers offline about this.
It was pointed out to me that the decoder takes up most of the tuning time. But
in my experiments, MERT approaches it in the later iterations. Since tuning
takes up a major part of the experimental time, I propose that speeding up the
decoder (and maybe MERT) is added to the list
(http://www.statmt.org/moses/?n=Moses.GetInvolved). In quite a lot of places
std::vector is used, and that's really slow. std::valarray might be a first
option, if for some reason external classes are not desired.
Best regards,
Thomas
This is how the git-MERT behaves (Europarl It->En, 500K training sentences, 750
dev sentences, phrase-based MOSES):
starting MERT
Data loaded : [1] seconds
Stopping... : [35] seconds
Data loaded : [1] seconds
Stopping... : [65] seconds
Data loaded : [2] seconds
Stopping... : [102] seconds
Data loaded : [3] seconds
Stopping... : [146] seconds
Data loaded : [3] seconds
Stopping... : [229] seconds
Data loaded : [3] seconds
Stopping... : [298] seconds
Data loaded : [5] seconds
Stopping... : [393] seconds
Data loaded : [4] seconds
Stopping... : [387] seconds
Data loaded : [5] seconds
Stopping... : [394] seconds
Data loaded : [5] seconds
Stopping... : [499] seconds
Data loaded : [6] seconds
Stopping... : [486] seconds
Data loaded : [7] seconds
Stopping... : [541] seconds
Data loaded : [7] seconds
Stopping... : [577] seconds
Data loaded : [8] seconds
Stopping... : [630] seconds
Data loaded : [8] seconds
Stopping... : [655] seconds
Data loaded : [9] seconds
Stopping... : [698] seconds
And here's how my implementation behaves (same task, same system, similar
workload):
starting MERT
Data loaded : [1] seconds
Stopping... : [36] seconds
Data loaded : [1] seconds
Stopping... : [65] seconds
Data loaded : [2] seconds
Stopping... : [107] seconds
Data loaded : [3] seconds
Stopping... : [142] seconds
Data loaded : [3] seconds
Stopping... : [222] seconds
Data loaded : [4] seconds
Stopping... : [286] seconds
Data loaded : [4] seconds
Stopping... : [362] seconds
Data loaded : [5] seconds
Stopping... : [363] seconds
Data loaded : [5] seconds
Stopping... : [382] seconds
Data loaded : [6] seconds
Stopping... : [436] seconds
Data loaded : [7] seconds
Stopping... : [413] seconds
Data loaded : [8] seconds
Stopping... : [446] seconds
Data loaded : [8] seconds
Stopping... : [472] seconds
Data loaded : [8] seconds
Stopping... : [496] seconds
Data loaded : [9] seconds
Stopping... : [495] seconds
Data loaded : [9] seconds
Stopping... : [648] seconds
Data loaded : [10] seconds
Stopping... : [485] seconds
________________________________
Von: Thomas Schoenemann <thomas_schoenem...@yahoo.de>
An: Barry Haddow <bhad...@staffmail.ed.ac.uk>; "moses-support@mit.edu"
<moses-support@mit.edu>
Gesendet: 19:29 Mittwoch, 30.November 2011
Betreff: Re: [Moses-support] Removing duplicates when merging nbest lists for
MERT
Hi!
Well, it doesn't have to be the same target translation, just the exact same
score vector (and the same feature vector, of course). I agree that MERT is
working correctly, my mail was always about the efficiency. In my experiments
MERT took the major part of the running time, and I believe others have the
same problem. So I care about getting it faster.
If you don't plan on an implementation, I will write my own (i.e. modify the
existing one). I can then get back to you once I know it is working (and
faster).
Concerning pro-mode, I think that behavior could be simulated by storing
weights for each list-entry. It would be a little more complicated to
implement, though.
Cheers,
Thomas
________________________________
Von: Barry Haddow <bhad...@staffmail.ed.ac.uk>
An: moses-support@mit.edu; Thomas Schoenemann <thomas_schoenem...@yahoo.de>
Gesendet: 12:35 Mittwoch, 30.November 2011
Betreff: Re: [Moses-support] Removing duplicates when merging nbest lists for
MERT
Hi Thomas
Yes, you're correct, mert doesn't remove duplicates in the nbest lists. It's
something that we intended to do (and probably mentioned in the mert paper)
but
somehow never got around to it.
As Lane pointed out, you have to be careful to do the duplicate removal
correctly. You can only consider hypotheses to be duplicates if they have the
same target text, and the same feature values.
The mert optimisation actually does duplicate removal implicitly, during the
optimisation, since duplicate hypotheses contribute the same line to the
envelope. However removing duplicates in the extractor could potentially be
more efficient.
For pro however, duplicates could make a difference to the optimisation, as
they affect the sampling. I recently re-implemented the pro extraction to make
it more efficient, and again did intend to do de-duping, but haven't got around
to it yet. It would be interesting to know if de-duping makes a difference to
the outcome.
cheers - Barry
On Tuesday 29 Nov 2011 20:06:20 Thomas Schoenemann wrote:
> Hi
everyone!
>
> We all know that MERT gets slower in the later iterations. This is not
> surprising as the n-best lists of all previous iterations are merged. I
> believe this is quite important for translation performance.
>
> Still, it seems important to me to get the merged lists as small as
> possible. A quick inspection of mert/extractor indicates that duplicates
> are _not_ removed. Can anyone confirm this? And is this really not done
> anywhere else, e.g. in mert/mert ?
>
> Removing duplicates in the extractor should be easy to implement and I
> don't think it will take more running time than one gains from smaller
> list.
>
> Best,
> Thomas (currently University of Pisa)
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support