Re: [Moses-support] Major bug found in Moses

Read, James C Fri, 19 Jun 2015 06:33:30 -0700

You are not interested in discovering which phrase pairs contributed most to 
increases in BLEU scores so that we can bypass an ineffective search algorithm 
and construct a reliable phrase pair based rule based system with lower 
computational cost and higher likelihood of better results?

I would like to see you stare investors in the face and make that claim. And 
manage to keep a straight face.

James

________________________________
From: Lane Schwartz <dowob...@gmail.com>
Sent: Wednesday, June 17, 2015 9:11 PM
To: Read, James C
Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

James,

The underlying questions that you appear to be posing are these: When the 
search space is simplified by decoding without a language model, to what extent 
is the decoder able to identify hypotheses that have the best model score? 
Second, does filtering the phrase table in a particular way change the answer 
to this question? Third, how is the BLEU score (or any other metric) affected 
by these questions?

These are valid questions.

Unfortunately, as Kenneth, Amittai, and Hieu have pointed out, the experiment 
that you have designed does not provide you with all of what you need to be 
able to answer these questions.

Recall that we don't really deal with probabilities when decoding. Yes, some of 
our features are trained as probability models. But the decoder searches using 
a weighted combination of scores. Lots of them. Even the phrase table is 
comprised of (at least) four distinct scores (phrase translation score and 
lexical translation score, in both directions).

Decoding is a search problem. Specifically, it is a search through all possible 
translations to attempt to identify the one with the highest score according to 
this weighted combination of component scores.

There are two problems then, that we have to deal with:

First is this. Even if all we care about is the ultimate weighted combination 
of component scores, the search space is so vast (it's NP complete) that we 
cannot hope to exhaustively search through it in a reasonable amount of time, 
even for sentences that are only of moderate length. This means that we have to 
resort to pruning.

Second is this. We don't really care about finding solutions that are optimal 
according to the weighted combination of component scores. We care about 
getting translations that are fluent and mean the same thing as the original 
sentence. Since we don't know how to measure adequacy and fluency 
automatically, we resort to imperfect metrics that can be calculated 
automatically, like BLEU. This is fine, but it makes the search problem (which 
was already intractably large) even worse.

The decoder only knows how to search by finding solutions that are good 
according to the weighted combination of component scores. If we want 
translations that are good according to some metric (like BLEU), then we need 
to attempt to formulate the weights such that solutions that are good according 
to the weighted combination of component scores are also good according to the 
desired metric (BLEU).

The mechanism by which this is performed is tuning.

Your decoder, by necessity, is operating using pruning. As such, your decoder 
is only operating in a confined region of the overall search space. The 
question then is, what region of the search space would you prefer to have your 
decoder operate in. If you choose not to run tuning, then you are choosing to 
have your decoder operate in an arbitrary region of the search space. If you 
chose to run tuning, then you are choosing to have your decoder operate in a 
region of the search space where you have reason to believe contains good 
translations according to your metric.

Another way to think about this is as follows. If you choose not to run tuning, 
and you obtain translations that are good according to the metric (BLEU), this 
is great, but it doesn't tell you much. If you obtain translations that are bad 
according to the metric, this is to be expected.

What your experiments have shown is this:

The complexity of the search space is greater when you use all available phrase 
pairs than it is when you pre-select only the best phrase pairs. When you 
choose to not tune and not use and LM, and then decode in the simpler space, 
you get better BLEU scores than when you decode in the more complex space.

This is not a surprising result. It is in fact the expected result.

Why is this the expected result? Two reasons.

First, because search involves pruning. If you simplify the search space (by 
allowing the decoder to search using only the best phrase pairs), then it 
becomes easier for the decoder to find translations that are closer to optimal 
according to the weighted combination of scores, simply because the decoder is 
searching through a much smaller (and higher quality) sub-region of the search 
space.

Second, because by choosing not to tune, the weights with which you are 
decoding are arbitrary. Not tuning effectively says, I don't care whether or 
not my decoder scores should correspond with my metric scores.

I hope this helps. I know it can be very discouraging when papers get rejected. 
It is certainly possible that there are bugs in Moses. But the experiment that 
you have run does not provide any evidence of that so far. I know it seems 
incredible that people could not care about a very large BLEU point swing. But 
if the baseline with tuning and with an LM is (for example) 35 BLEU, and you 
show that no tuning and no LM gets you 29 BLEU with a filtered TM and 6 BLEU 
with an unfiltered TM, that's not necessarily a surprising or very interesting 
result.

Lane

On Wed, Jun 17, 2015 at 11:24 AM, Read, James C 
<jcr...@essex.ac.uk<mailto:jcr...@essex.ac.uk>> wrote:

Which features would you like me to tune? The whole purpose of the exercise was 
to eliminate all variables except the TM and to keep constant those that could 
not be eliminated so that I could see which types of phrase pairs contribute 
most to increases in BLEU score in a TM only setup.

Now you are saying I have to tune but tuning won't work without a LM. So how do 
you expect a researcher to be able to understand how well the TM component of 
the system is working if you are going to insist that I must include a LM for 
tuning to work.

Clearly the system is broken. It is designed to work well with a LM and poorly 
without. When clearly good results can be obtained with a functional TM and 
well chosen phrase pairs.

James

________________________________________
From: moses-support-boun...@mit.edu<mailto:moses-support-boun...@mit.edu> 
<moses-support-boun...@mit.edu<mailto:moses-support-boun...@mit.edu>> on behalf 
of Kenneth Heafield <mo...@kheafield.com<mailto:mo...@kheafield.com>>
Sent: Wednesday, June 17, 2015 7:13 PM
To: moses-support@mit.edu<mailto:moses-support@mit.edu>
Subject: Re: [Moses-support] Major bug found in Moses

I'll bite.

The moses.ini files ship with bogus feature weights.  One is required to
tune the system to discover good weights for their system.  You did not
tune.  The results of an untuned system are meaningless.

So for example if the feature weights are all zeros, then the scores are
all zero.  The system will arbitrarily pick some awful translation from
a large space of translations.

The filter looks at one feature p(target | source).  So now you've
constrained the awful untuned model to a slightly better region of the
search space.

In other words, all you've done is a poor approximation to manually
setting the weight to 1.0 on p(target | source) and the rest to 0.

The problem isn't that you are running without a language model (though
we generally do not care what happens without one).  The problem is that
you did not tune the feature weights.

Moreover, as Marcin is pointing out, I wouldn't necessarily expect
tuning to work without an LM.

On 06/17/15 11:56, Read, James C wrote:
> Actually the approximation I expect to be:
>
> p(e|f)=p(f|e)
>
> Why would you expect this to give poor results if the TM is well trained? 
> Surely the results of my filtering experiments provve otherwise.
>
> James
>
> ________________________________________
> From: moses-support-boun...@mit.edu<mailto:moses-support-boun...@mit.edu> 
> <moses-support-boun...@mit.edu<mailto:moses-support-boun...@mit.edu>> on 
> behalf of Rico Sennrich <rico.sennr...@gmx.ch<mailto:rico.sennr...@gmx.ch>>
> Sent: Wednesday, June 17, 2015 5:32 PM
> To: moses-support@mit.edu<mailto:moses-support@mit.edu>
> Subject: Re: [Moses-support] Major bug found in Moses
>
> Read, James C <jcread@...> writes:
>
>> I have been unable to find a logical explanation for this behaviour other
> than to conclude that there must be some kind of bug in Moses which causes a
> TM only run of Moses to perform poorly in finding the most likely
> translations according to the TM when
>>  there are less likely phrase pairs included in the race.
> I may have overlooked something, but you seem to have removed the language
> model from your config, and used default weights. your default model will
> thus (roughly) implement the following model:
>
> p(e|f) = p(e|f)*p(f|e)
>
> which is obviously wrong, and will give you poor results. This is not a bug
> in the code, but a poor choice of models and weights. Standard steps in SMT
> (like tuning the model weights on a development set, and including a
> language model) will give you the desired results.
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu<mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu<mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support

--
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

Reply via email to