subject:"Re\: \[Moses\-support\] Major bug found in Moses"

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

Improvements in 37 BLEU points over the default behaviour was not enough to 
show that there are problems with the default?


James



From: Raphael Payen raphael.pa...@gmail.com
Sent: Sunday, June 21, 2015 5:29 PM
To: Read, James C
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

James, did you try the modifications Philip suggested (removing the word 
penalty and lowering p(f|e)?
(I doubt it will be enough to get a best paper award, but it would probably 
improve your bleu, that's always a good start :) )



On Friday, June 19, 2015, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

So, all I did was filter out the less likely phrase pairs and the BLEU score 
shot up. Was that such a stroke of genius? Was that not blindingly obvious?


Your telling me that redesigning the search algorithm to prefer higher scoring 
phrase pairs is all we need to do to get a best paper at ACL?


James



From: Lane Schwartz dowob...@gmail.com
Sent: Friday, June 19, 2015 7:40 PM
To: Read, James C
Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk wrote:

What I take issue with is the en-masse denial that there is a problem with the 
system if it behaves in such a way with no LM + no pruning and/or tuning.

There is no mass denial taking place.

Regardless of whether or not you tune, the decoder will do its best to find 
translations with the highest model score. That is the expected behavior.

What I have tried to tell you, and what other people have tried to tell you, is 
that translations with high model scores are not necessarily good translations.

We all want our models to be such that high model scores correspond to good 
translations, and that low model scores correspond with bad translations. But 
unfortunately, our models do not innately have this characteristic. We all know 
this. We also know a good way to deal with this shortcoming, namely tuning. 
Tuning is the process by which we attempt to ensure that high model scores 
correspond to high quality translations, and that low model scores correspond 
to low quality translations.

If you can design models that naturally correspond with translation quality 
without tuning, that's great. If you can do that, you've got a great shot at 
winning a Best Paper award at ACL.

In the meantime, you may want to consider an apology for your rude behavior and 
unprofessional attitude.

Goodbye.
Lane

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

I think it is you who seems to have missed the point. 

If the default behaviour is giving BLEU scores considerably lower than the BLEU 
score obtained from merely selecting the most likely translation of each phrase 
then there is evidently something very wrong with the default behaviour.

If you can't see something as blindingly simple as that then at this point I'm 
thinking this really isn't a field I want anything to do wiht.

James


From: Matthias Huck mh...@inf.ed.ac.uk
Sent: Friday, June 19, 2015 10:45 PM
To: Read, James C
Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

Hi James,

Well, it's pretty straightforward: The decoder's job is to find the
hypothesis with the maximum model score. That's why everybody builds
models which assign high model score to high-quality translations.
Unfortunately, you missed this last point in your own work.

Cheers,
Matthias


On Fri, 2015-06-19 at 14:15 +, Read, James C wrote:
 I'm gonna try once more. This is what he said:

 the decoder's job is NOT to find the high quality translation

 The  next time I have a panel of potential investors in front of me
 I'm gonna pass that line by them and see how it goes down. I stress
 the words HIGH QUALITY TRANSLATION.

 Please promise me that the next time you put in a bid for funding you
 will guarantee your prospective funders that under no circumstances
 will you attempt to design a system which searches for HIGH QUALITY
 TRANSLATION.

 James

 
 From: Matthias Huck mh...@inf.ed.ac.uk
 Sent: Friday, June 19, 2015 5:08 PM
 To: Read, James C
 Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses

 Hi James,

 Yes, he just said that.

 The decoder's job is to find the hypothesis with the maximum model
 score. That's one reason why your work is flawed. You did not care at
 all whether your model score correlates with BLEU or not.

 Cheers,
 Matthias


 On Fri, 2015-06-19 at 13:24 +, Read, James C wrote:
  I quote:
 
 
  the decoder's job is NOT to find the high quality translation
 
 
 
  Did you REALLY just say that?
 
 
  James
 
 
 
 
  __
  From: Hieu Hoang hieuho...@gmail.com
  Sent: Wednesday, June 17, 2015 9:00 PM
  To: Read, James C
  Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug
  Subject: Re: [Moses-support] Major bug found in Moses
 
  the decoder's job is NOT to find the high quality translation (as
  measured by bleu). It's job is to find translations with high model
  score.
 
 
  you need the tuning to make sure high quality translation correlates
  with high model score. If you don't tune, it's pot luck what quality
  you get.
 
 
  You should tune with the features you use
 
 
  Hieu Hoang
  Researcher
 
  New York University, Abu Dhabi
 
  http://www.hoang.co.uk/hieu
 
 
  On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote:
  The analogy doesn't seem to be helping me understand just how
  exactly it is a desirable quality of a TM to
 
  a) completely break down if no LM is used (thank you for
  showing that such is not always the case)
  b) be dependent on a tuning step to help it find the higher
  scoring translations
 
  What you seem to be essentially saying is that the TM cannot
  find the higher scoring translations because I didn't pretune
  the system to do so. And I am supposed to accept that such is
  a desirable quality of a system whose very job is to find the
  higher scoring translations.
 
  Further, I am still unclear which features you prequire a
  system to be tuned on. At the very least it seems that I have
  discovered the selection process that tuning seems to be
  making up for in some unspecified and altogether opaque way.
 
  James
 
 
  
  From: Hieu Hoang hieuho...@gmail.com
  Sent: Wednesday, June 17, 2015 8:34 PM
  To: Read, James C; Kenneth Heafield; moses-support@mit.edu
  Cc: Arnold, Doug
  Subject: Re: [Moses-support] Major bug found in Moses
 
  4 BLEU is nothing to sniff at :) I was answering Ken's tangent
  aspersion
  that LM are needed for tuning.
 
  I have some sympathy for you. You're looking at ways to
  improve
  translation by reducing the search space. I've bashed my head
  against
  this wall for a while as well without much success.
 
  However, as everyone is telling you, you haven't understood
  the role of
  tuning. Without tuning, you're pointing your lab rat to some
  random part
  of the search space, instead of away from the furry animal

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Matthias Huck

Hi James,

Irrespective of the fact that you need to tune the weights of the
log-linear model: 

Let me provide more references in order to shed light on how well
established simple pruning techniques are in our field as well as in
related fields (namely, automatic speech recognition).

This list of references might not be what you are looking for, but maybe
other readers can benefit.


V. Steinbiss, B. Tran, H. Ney. Improvements in beam search. In Proc.
of the Int. Conf. on Spoken Language Processing (ICSLP’94), pages
2143-2146, Yokohama, Japan, Sept. 1994.
http://www.steinbiss.de/vst94d.pdf

R. Zens, F. J. Och, and H. Ney. Phrase-Based Statistical Machine
Translation. In German Conf. on Artificial Intelligence (KI), pages
18-32, Aachen, Germany, Sept. 2002.
https://www-i6.informatik.rwth-aachen.de/publications/download/434/Zens-KI-2002.pdf

Philipp Koehn. Pharaoh: a beam search decoder for phrase-based
statistical machine translation models. In Proc. of the AMTA, pages
115-124, Washington, DC, USA, Sept./Oct. 2004.
http://homepages.inf.ed.ac.uk/pkoehn/publications/pharaoh-amta2004.pdf

Robert C. Moore and Chris Quirk. Faster Beam-Search Decoding for Phrasal
Statistical Machine Translation. In Proc. of MT Summit XI, European
Association for Machine Translation, Sept. 2007.
http://research.microsoft.com/pubs/68097/mtsummit2007_beamsearch.pdf

Richard Zens and Hermann Ney. Improvements in Dynamic Programming Beam
Search for Phrase-based Statistical Machine Translation. In Proc. of the
International Workshop on Spoken Language Translation (IWSLT), Honolulu,
HI, USA, Oct. 2008.
http://www.mt-archive.info/05/IWSLT-2008-Zens.pdf


Cheers,
Matthias



On Wed, 2015-06-24 at 13:11 +, Read, James C wrote:
 Thank you for reading very careful the draft paper I provided a link
 to and noticing that the Johnson paper is duly cited there. Given that
 you had already noticed this I shall not proceed to explain the
 blinding obvious differences between my very simple filter and their
 filter based on Fisher's exact test.
 
 Other than that it seems painfully clear that the point I meant to
 make has not been understood entirely. If the default behaviour
 produces BLEU scores considerably lower than merely selecting the most
 likely translation of each phrase then evidently there is something
 very wrong with the default behaviour. If we cannot agree on something
 as obvious as that then I really can't see this discussion making any
 productive progress.
 
 James
 
 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Friday, June 19, 2015 8:25 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses
 
 [sorry for the garbled message before]
 
 you are right. The idea is pretty obvious. It roughly corresponds to
 'Histogram pruning' in this paper:
 
 Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
 Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
 Empirical Methods in Natural Language Processing and Computational
 Natural Language Learning (EMNLP-CoNLL), pp. 972-983.
 
 The idea has been described in the literature before that (for instance,
 Johnson et al. (2007) only use the top 30 phrase pairs per source
 phrase), and may have been used in practice for even longer. If you read
 the paper above, you will find that histogram pruning does not improve
 translation quality on a state-of-the-art SMT system, and performs
 poorly compared to more advanced pruning techniques.
 
 On 19.06.2015 17:49, Read, James C. wrote:
  So, all I did was filter out the less likely phrase pairs and the BLEU 
  score shot up. Was that such a stroke of genius? Was that not blindingly 
  obvious?
 
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread amittai axelrod

what *i* would do is tune my systems.

~amittai

On 6/24/15 09:15, Read, James C wrote:
 Thank you for such an invitation. Let's see. Given the choice of

 a) reading through thousands of lines of code trying to figure out why the 
 default behaviour performs considerably worse than merely selecting the most 
 likely translation of each phrase or
 b) spending much less time implementing a simple system that does just that

 which one would you do?

 For all know maybe I've already implemented such a system that does just that 
 and not only that improves considerably on such a basic benchmark. But given 
 that on this list we don't seem to be able to accept that there is a problem 
 with the default behaviour of Moses I can only conclude that nobody would be 
 interested in access to the code of such a system.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Friday, June 19, 2015 7:52 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 if we don't understand the problem, how can we possibly fix it?
 all the relevant code is open source. go for it!

 ~amittai

 On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?


 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

  What I take issue with is the en-masse denial that there is a
  problem with the system if it behaves in such a way with no LM + no
  pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily
 good translations.

 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.

 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.

 In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.

 Goodbye.
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

As the title of this thread makes clear the purpose of reporting the bug was 
not to invite a discussion about conclusions made in my draft paper. Clearly a 
community that builds its career around research in SMT is unlikely to agree 
with those kinds of conclusions. The purpose was to report the flaw in the 
default behaviour of Moses in the hope that we could all agree that something 
ought to be done about it.


So far you seem to be the only one who has come even close to acknowledging 
that there is a problem with Moses default behaviour.


James



From: Lane Schwartz dowob...@gmail.com
Sent: Wednesday, June 24, 2015 4:43 PM
To: Read, James C
Cc: Rico Sennrich; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On Wed, Jun 24, 2015 at 8:11 AM, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

Other than that it seems painfully clear that the point I meant to make has not 
been understood entirely. If the default behaviour produces BLEU scores 
considerably lower than merely selecting the most likely translation of each 
phrase then evidently there is something very wrong with the default behaviour. 
If we cannot agree on something as obvious as that then I really can't see this 
discussion making any productive progress.


James,

I understand your point. I think that the others who have responded also 
understand your point.

We simply disagree with your conclusion.

I encourage you to consider the possibility that if the many experts in this 
field who have responded all think that your conclusion is flawed, then there 
might be something to that.

I will agree, though, that this is a good time to conclude this discussion.

Sincerely,
Lane Schwartz



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

So you still think it's fine that the default would perform at 37 BLEU points 
less than just selecting the most likely translation of each phrase? 

You know I think I would have to try really hard to design a system that 
performed so poorly.

James


From: amittai axelrod amit...@umiacs.umd.edu
Sent: Wednesday, June 24, 2015 5:36 PM
To: Read, James C; Lane Schwartz
Cc: moses-support@mit.edu; Philipp Koehn
Subject: Re: [Moses-support] Major bug found in Moses

what *i* would do is tune my systems.

~amittai

On 6/24/15 09:15, Read, James C wrote:
 Thank you for such an invitation. Let's see. Given the choice of

 a) reading through thousands of lines of code trying to figure out why the 
 default behaviour performs considerably worse than merely selecting the most 
 likely translation of each phrase or
 b) spending much less time implementing a simple system that does just that

 which one would you do?

 For all know maybe I've already implemented such a system that does just that 
 and not only that improves considerably on such a basic benchmark. But given 
 that on this list we don't seem to be able to accept that there is a problem 
 with the default behaviour of Moses I can only conclude that nobody would be 
 interested in access to the code of such a system.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Friday, June 19, 2015 7:52 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 if we don't understand the problem, how can we possibly fix it?
 all the relevant code is open source. go for it!

 ~amittai

 On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?


 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

  What I take issue with is the en-masse denial that there is a
  problem with the system if it behaves in such a way with no LM + no
  pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily
 good translations.

 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.

 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.

 In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.

 Goodbye.
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

Thank you for such an invitation. Let's see. Given the choice of 

a) reading through thousands of lines of code trying to figure out why the 
default behaviour performs considerably worse than merely selecting the most 
likely translation of each phrase or
b) spending much less time implementing a simple system that does just that

which one would you do?

For all know maybe I've already implemented such a system that does just that 
and not only that improves considerably on such a basic benchmark. But given 
that on this list we don't seem to be able to accept that there is a problem 
with the default behaviour of Moses I can only conclude that nobody would be 
interested in access to the code of such a system.

James


From: amittai axelrod amit...@umiacs.umd.edu
Sent: Friday, June 19, 2015 7:52 PM
To: Read, James C; Lane Schwartz
Cc: moses-support@mit.edu; Philipp Koehn
Subject: Re: [Moses-support] Major bug found in Moses

if we don't understand the problem, how can we possibly fix it?
all the relevant code is open source. go for it!

~amittai

On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?


 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 What I take issue with is the en-masse denial that there is a
 problem with the system if it behaves in such a way with no LM + no
 pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily
 good translations.

 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.

 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.

 In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.

 Goodbye.
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Lane Schwartz

On Wed, Jun 24, 2015 at 8:11 AM, Read, James C jcr...@essex.ac.uk wrote:


 Other than that it seems painfully clear that the point I meant to make
 has not been understood entirely. If the default behaviour produces BLEU
 scores considerably lower than merely selecting the most likely translation
 of each phrase then evidently there is something very wrong with the
 default behaviour. If we cannot agree on something as obvious as that then
 I really can't see this discussion making any productive progress.



James,

I understand your point. I think that the others who have responded also
understand your point.

We simply disagree with your conclusion.

I encourage you to consider the possibility that if the many experts in
this field who have responded all think that your conclusion is flawed,
then there might be something to that.

I will agree, though, that this is a good time to conclude this discussion.

Sincerely,
Lane Schwartz
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Hasegawa-Johnson, Mark Allan

It would be really wonderful if Moses had an out-of-the-box example that ran 
without further tuning.  Would you be willing to create that for us?  We would 
greatly appreciate it.

The open source community exists on a somewhat different model than the 
commercial software community.  In the open-source community, if a feature 
doesn't exist, and if you believe it should exist, then the correct response is 
may I contribute this feature to the codebase, please? 

The fact that no such feature currently exists in Moses means that none of its 
current users have ever had a need for it.  That probably means that all of its 
current users are machine translation experts, who have no need for an 
out-of-the-box example that runs without tuning.  You are quite correct that it 
would be nice to expand the user base, so that it includes people who are not 
machine translation experts, but just want a tool that runs reasonably well 
out-of-the-box.  Since nobody is paid to maintain Moses, however, nobody has 
ever yet had sufficient incentive to create such an example.  If you believe 
that you have sufficient incentive to create such an example, then please do; 
we would appreciate it.

Thanks.


-Original Message-
From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu] On 
Behalf Of Read, James C
Sent: Wednesday, June 24, 2015 10:29 AM
To: John D. Burger
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Please allow me to give a synthesis of my understanding of your response:

a) we understand that out of the box Moses performs notably less well than 
merely selecting the most likely translation for each phrase
b) we don't see this as a problem because for years we've been applying a 
different type of fix
c) we have no intention of rectifying the problem or even acknowledging that 
there is a problem
d) we would rather continue performing this gratuitous step and insisting that 
our users perform it also

Please explain to me. Why even bother running the training process if you have 
already decided that the default setup should not be designed to maximise on 
the probabilities learned during that step?

James


From: John D. Burger j...@mitre.org
Sent: Wednesday, June 24, 2015 6:03 PM
To: Read, James C
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

 On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote:

 So you still think it's fine that the default would perform at 37 BLEU points 
 less than just selecting the most likely translation of each phrase?

Yes, I'm pretty sure we all think that's fine, because one of the steps of 
building a system is tuning.

Is this really the essence of your complaint? That the behavior without tuning 
is not very good?

(Please try to reply without your usual snarkiness.)

- John Burger
  MITRE

 You know I think I would have to try really hard to design a system that 
 performed so poorly.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Wednesday, June 24, 2015 5:36 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 what *i* would do is tune my systems.

 ~amittai

 On 6/24/15 09:15, Read, James C wrote:
 Thank you for such an invitation. Let's see. Given the choice of

 a) reading through thousands of lines of code trying to figure out 
 why the default behaviour performs considerably worse than merely 
 selecting the most likely translation of each phrase or
 b) spending much less time implementing a simple system that does 
 just that

 which one would you do?

 For all know maybe I've already implemented such a system that does just 
 that and not only that improves considerably on such a basic benchmark. But 
 given that on this list we don't seem to be able to accept that there is a 
 problem with the default behaviour of Moses I can only conclude that nobody 
 would be interested in access to the code of such a system.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Friday, June 19, 2015 7:52 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 if we don't understand the problem, how can we possibly fix it?
 all the relevant code is open source. go for it!

 ~amittai

 On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the 
 BLEU score shot up. Was that such a stroke of genius? Was that not 
 blindingly obvious?


 Your telling me that redesigning the search algorithm to prefer 
 higher scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread John D. Burger


 On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote:
 
 So you still think it's fine that the default would perform at 37 BLEU points 
 less than just selecting the most likely translation of each phrase? 

Yes, I'm pretty sure we all think that's fine, because one of the steps of 
building a system is tuning.

Is this really the essence of your complaint? That the behavior without tuning 
is not very good? 

(Please try to reply without your usual snarkiness.)

- John Burger
  MITRE

 You know I think I would have to try really hard to design a system that 
 performed so poorly.
 
 James
 
 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Wednesday, June 24, 2015 5:36 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses
 
 what *i* would do is tune my systems.
 
 ~amittai
 
 On 6/24/15 09:15, Read, James C wrote:
 Thank you for such an invitation. Let's see. Given the choice of
 
 a) reading through thousands of lines of code trying to figure out why the 
 default behaviour performs considerably worse than merely selecting the most 
 likely translation of each phrase or
 b) spending much less time implementing a simple system that does just that
 
 which one would you do?
 
 For all know maybe I've already implemented such a system that does just 
 that and not only that improves considerably on such a basic benchmark. But 
 given that on this list we don't seem to be able to accept that there is a 
 problem with the default behaviour of Moses I can only conclude that nobody 
 would be interested in access to the code of such a system.
 
 James
 
 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Friday, June 19, 2015 7:52 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses
 
 if we don't understand the problem, how can we possibly fix it?
 all the relevant code is open source. go for it!
 
 ~amittai
 
 On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?
 
 
 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?
 
 
 James
 
 
 
 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:
 
 What I take issue with is the en-masse denial that there is a
 problem with the system if it behaves in such a way with no LM + no
 pruning and/or tuning.
 
 
 There is no mass denial taking place.
 
 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.
 
 What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily
 good translations.
 
 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.
 
 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.
 
 In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.
 
 Goodbye.
 Lane
 
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

Thank you for reading very careful the draft paper I provided a link to and 
noticing that the Johnson paper is duly cited there. Given that you had already 
noticed this I shall not proceed to explain the blinding obvious differences 
between my very simple filter and their filter based on Fisher's exact test.

Other than that it seems painfully clear that the point I meant to make has not 
been understood entirely. If the default behaviour produces BLEU scores 
considerably lower than merely selecting the most likely translation of each 
phrase then evidently there is something very wrong with the default behaviour. 
If we cannot agree on something as obvious as that then I really can't see this 
discussion making any productive progress.

James


From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Rico Sennrich rico.sennr...@gmx.ch
Sent: Friday, June 19, 2015 8:25 PM
To: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

[sorry for the garbled message before]

you are right. The idea is pretty obvious. It roughly corresponds to
'Histogram pruning' in this paper:

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

The idea has been described in the literature before that (for instance,
Johnson et al. (2007) only use the top 30 phrase pairs per source
phrase), and may have been used in practice for even longer. If you read
the paper above, you will find that histogram pruning does not improve
translation quality on a state-of-the-art SMT system, and performs
poorly compared to more advanced pruning techniques.

On 19.06.2015 17:49, Read, James C. wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU score 
 shot up. Was that such a stroke of genius? Was that not blindingly obvious?



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Lane Schwartz

On Wed, Jun 24, 2015 at 9:05 AM, Read, James C jcr...@essex.ac.uk wrote:

  As the title of this thread makes clear the purpose of reporting the bug
 was not to invite a discussion about conclusions made in my draft paper.
 Clearly a community that builds its career around research in SMT is
 unlikely to agree with those kinds of conclusions. The purpose was to
 report the flaw in the default behaviour of Moses in the hope that we could
 all agree that something ought to be done about it.

 So far you seem to be the only one who has come even close to
 acknowledging that there is a problem with Moses default behaviour.


James,

I wasn't talking about the conclusion in your paper. I was talking about
the conclusion in your email:

If the default behaviour produces BLEU scores considerably lower than
 merely selecting the most likely translation of each phrase then evidently
 there is something very wrong with the default behaviour.


Your conclusion, quoted above, is seriously flawed.

There is not something very wrong with the default behavior of Moses. You
have not exposed a bug in Moses.

What you have exposed is your own lack of understanding of modern
statistical machine translation, and your unwillingness to listen when
others take the time to explain how and why you are mistaken.

I am happy to help explain things to people who are willing to listen.
However, you have shown yourself to be not only rude but obstinate and
willfully ignorant. I hope that others who find this thread may find it
informative. You appear to have learned nothing from it.

Until you become willing to listen to others, and until you take a
statistical machine translation class and are willing to pay attention to
what you learn there, I don't see any point in taking the time to explain
things further. As far as I am concerned, this discussion is over.

Sincerely,
Lane Schwartz
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Read, James C

Please allow me to give a synthesis of my understanding of your response:

a) we understand that out of the box Moses performs notably less well than 
merely selecting the most likely translation for each phrase
b) we don't see this as a problem because for years we've been applying a 
different type of fix
c) we have no intention of rectifying the problem or even acknowledging that 
there is a problem
d) we would rather continue performing this gratuitous step and insisting that 
our users perform it also

Please explain to me. Why even bother running the training process if you have 
already decided that the default setup should not be designed to maximise on 
the probabilities learned during that step?

James


From: John D. Burger j...@mitre.org
Sent: Wednesday, June 24, 2015 6:03 PM
To: Read, James C
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

 On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote:

 So you still think it's fine that the default would perform at 37 BLEU points 
 less than just selecting the most likely translation of each phrase?

Yes, I'm pretty sure we all think that's fine, because one of the steps of 
building a system is tuning.

Is this really the essence of your complaint? That the behavior without tuning 
is not very good?

(Please try to reply without your usual snarkiness.)

- John Burger
  MITRE

 You know I think I would have to try really hard to design a system that 
 performed so poorly.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Wednesday, June 24, 2015 5:36 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 what *i* would do is tune my systems.

 ~amittai

 On 6/24/15 09:15, Read, James C wrote:
 Thank you for such an invitation. Let's see. Given the choice of

 a) reading through thousands of lines of code trying to figure out why the 
 default behaviour performs considerably worse than merely selecting the most 
 likely translation of each phrase or
 b) spending much less time implementing a simple system that does just that

 which one would you do?

 For all know maybe I've already implemented such a system that does just 
 that and not only that improves considerably on such a basic benchmark. But 
 given that on this list we don't seem to be able to accept that there is a 
 problem with the default behaviour of Moses I can only conclude that nobody 
 would be interested in access to the code of such a system.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Friday, June 19, 2015 7:52 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 if we don't understand the problem, how can we possibly fix it?
 all the relevant code is open source. go for it!

 ~amittai

 On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?


 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 What I take issue with is the en-masse denial that there is a
 problem with the system if it behaves in such a way with no LM + no
 pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily
 good translations.

 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.

 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Jorg Tiedemann

James,

(1) Did you ever look at the model scores? The decoder's job is to find the 
hypotheses with the highest model score and if your baseline system finds 
translations with higher model scores than your filtered system then there is 
no bug in Moses.

(2) You should stop talking about BLEU scores as some kind of evidence that 
there is a bug in the software. We have an unpublished paper in which we show 
that using BLEU as the objective function to optimize translations in decoding 
results in terrible translations, too: 
http://www2.lingfil.uu.se/SLTC2014/abstracts/sltc2014_submission_21.pdf

(3) Tuning is part of the training procedure for log-linear models. There is no 
point in leaving it out (as many others have told you already).

(4) Stop driving on the wrong side of the street ...


Jörg


On Jun 24, 2015, at 5:21 PM, Read, James C wrote:

 May I humbly suggest that we do some market research and see how many 
 institutions/organisations out there dream about an MT system that out of the 
 box performs at 37 BLEU points less that merely substituting each phrase for 
 its most likely translation? I dare say that most users would expect a system 
 to perform *better* than such a blatantly obvious baseline out of the box.
 
 So, please, can we stop trying to play the academic high ground here and just 
 accept that the default behaviour of Moses is much less than desirable?
 
 James
 
 
 From: Lane Schwartz dowob...@gmail.com
 Sent: Wednesday, June 24, 2015 5:56 PM
 To: Read, James C
 Cc: Rico Sennrich; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses
  
 
 On Wed, Jun 24, 2015 at 9:05 AM, Read, James C jcr...@essex.ac.uk wrote:
 As the title of this thread makes clear the purpose of reporting the bug was 
 not to invite a discussion about conclusions made in my draft paper. Clearly 
 a community that builds its career around research in SMT is unlikely to 
 agree with those kinds of conclusions. The purpose was to report the flaw in 
 the default behaviour of Moses in the hope that we could all agree that 
 something ought to be done about it.
 So far you seem to be the only one who has come even close to acknowledging 
 that there is a problem with Moses default behaviour.
  
 James,
 
 I wasn't talking about the conclusion in your paper. I was talking about the 
 conclusion in your email:
 
 If the default behaviour produces BLEU scores considerably lower than merely 
 selecting the most likely translation of each phrase then evidently there is 
 something very wrong with the default behaviour.
  
 Your conclusion, quoted above, is seriously flawed.
 
 There is not something very wrong with the default behavior of Moses. You 
 have not exposed a bug in Moses. 
 
 What you have exposed is your own lack of understanding of modern statistical 
 machine translation, and your unwillingness to listen when others take the 
 time to explain how and why you are mistaken.
 
 I am happy to help explain things to people who are willing to listen. 
 However, you have shown yourself to be not only rude but obstinate and 
 willfully ignorant. I hope that others who find this thread may find it 
 informative. You appear to have learned nothing from it.
 
 Until you become willing to listen to others, and until you take a 
 statistical machine translation class and are willing to pay attention to 
 what you learn there, I don't see any point in taking the time to explain 
 things further. As far as I am concerned, this discussion is over.
 
 Sincerely,
 Lane Schwartz
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-24 Thread Leusch, Gregor

John,

to my knowledge, you still have not reported BLEU scores for the following
experiment:
 The moses.ini in your unfiltered translation experiment should assign
weights of 0 0 0 1 to the TM features.

(requested by Matt on June 17).

Would you please run this experiment and report the results? Otherwise you
are asking the decoder to select phrases with the highest sum of all
scores, but expecting it instead to select the phrase with only the fourth
score being the highest, which are even by primary school math two
completely different things.


Gregor




-Original Message-
From: moses-support-boun...@mit.edu on behalf of Read, James C
jcr...@essex.ac.uk
Date: Wednesday 24 June 2015 17:29
To: John D. Burger j...@mitre.org
Cc: moses-support@mit.edu moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Please allow me to give a synthesis of my understanding of your response:

a) we understand that out of the box Moses performs notably less well
than merely selecting the most likely translation for each phrase
b) we don't see this as a problem because for years we've been applying a
different type of fix
c) we have no intention of rectifying the problem or even acknowledging
that there is a problem
d) we would rather continue performing this gratuitous step and insisting
that our users perform it also

Please explain to me. Why even bother running the training process if you
have already decided that the default setup should not be designed to
maximise on the probabilities learned during that step?

James


From: John D. Burger j...@mitre.org
Sent: Wednesday, June 24, 2015 6:03 PM
To: Read, James C
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

 On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote:

 So you still think it's fine that the default would perform at 37 BLEU
points less than just selecting the most likely translation of each
phrase?

Yes, I'm pretty sure we all think that's fine, because one of the steps
of building a system is tuning.

Is this really the essence of your complaint? That the behavior without
tuning is not very good?

(Please try to reply without your usual snarkiness.)

- John Burger
  MITRE

 You know I think I would have to try really hard to design a system
that performed so poorly.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Wednesday, June 24, 2015 5:36 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 what *i* would do is tune my systems.

 ~amittai

 On 6/24/15 09:15, Read, James C wrote:
 Thank you for such an invitation. Let's see. Given the choice of

 a) reading through thousands of lines of code trying to figure out why
the default behaviour performs considerably worse than merely selecting
the most likely translation of each phrase or
 b) spending much less time implementing a simple system that does just
that

 which one would you do?

 For all know maybe I've already implemented such a system that does
just that and not only that improves considerably on such a basic
benchmark. But given that on this list we don't seem to be able to
accept that there is a problem with the default behaviour of Moses I
can only conclude that nobody would be interested in access to the code
of such a system.

 James

 
 From: amittai axelrod amit...@umiacs.umd.edu
 Sent: Friday, June 19, 2015 7:52 PM
 To: Read, James C; Lane Schwartz
 Cc: moses-support@mit.edu; Philipp Koehn
 Subject: Re: [Moses-support] Major bug found in Moses

 if we don't understand the problem, how can we possibly fix it?
 all the relevant code is open source. go for it!

 ~amittai

 On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not
blindingly
 obvious?


 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
---
-
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 What I take issue with is the en-masse denial that there is a
 problem with the system if it behaves in such a way with no LM +
no
 pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you

Re: [Moses-support] Major bug found in Moses

2015-06-22 Thread Marcin Junczys-Dowmunt

That would make very cool student projects.
Also that video is acing it, even the voice-over is synthetic :)

On 23.06.2015 00:27, Ondrej Bojar wrote:
 ...and I wouldn't be surprised to find Moses also behind this Java-to-C# 
 automatic translation:

 https://www.youtube.com/watch?v=CHDDNnRm-g8

 O.

 - Original Message -
 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 To: moses-support@mit.edu
 Sent: Friday, 19 June, 2015 19:21:45
 Subject: Re: [Moses-support] Major bug found in Moses
 On that interesting idea that moses should be naturally good at
 translating things, just for general considerations.

 Since some said this thread has educational value I would like to share
 something that might not be obvious due to the SMT-biased posts here.
 Moses is also the _leading_ tool for automatic grammatical error
 correction (GEC) right now. The first and third system of the CoNLL
 shared task 2014 were based on Moses. By now I have results that surpass
 the CoNLL results by far by adding some specialized features to Moses
 (which thanks to Hieu is very easy).

 It even gets good results for GEC when you do crazy things like
 inverting the TM (so it should actually make the input worse) provided
 you tune on the correct metric and for the correct task. The interaction
 of all the other features after tuning makes that possible.

 So, if anything, Moses is just a very flexible text-rewriting tool.
 Tuning (and data) turns into a translator, GEC tool, POS-tagger,
 Chunker, Semantic Tagger etc.

 On 19.06.2015 18:40, Lane Schwartz wrote:
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

  What I take issue with is the en-masse denial that there is a
  problem with the system if it behaves in such a way with no LM +
  no pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you, and what other people have tried to
 tell you, is that translations with high model scores are not
 necessarily good translations.

 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.

 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.

 In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.

 Goodbye.
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-21 Thread Raphael Payen

James, did you try the modifications Philip suggested (removing the word
penalty and lowering p(f|e)?
(I doubt it will be enough to get a best paper award, but it would probably
improve your bleu, that's always a good start :) )



On Friday, June 19, 2015, Read, James C jcr...@essex.ac.uk wrote:

  So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?


  Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


  James


  --
 *From:* Lane Schwartz dowob...@gmail.com
 javascript:_e(%7B%7D,'cvml','dowob...@gmail.com');
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 javascript:_e(%7B%7D,'cvml','moses-support@mit.edu');
 *Subject:* Re: [Moses-support] Major bug found in Moses

   On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 javascript:_e(%7B%7D,'cvml','jcr...@essex.ac.uk'); wrote:

   What I take issue with is the en-masse denial that there is a problem
 with the system if it behaves in such a way with no LM + no pruning and/or
 tuning.


  There is no mass denial taking place.

  Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

  What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily good
 translations.

  We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with this
 shortcoming, namely tuning. Tuning is the process by which we attempt to
 ensure that high model scores correspond to high quality translations, and
 that low model scores correspond to low quality translations.

  If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.

  In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.

  Goodbye.
 Lane


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-20 Thread Rico Sennrich

On 19/06/15 19:21, Marcin Junczys-Dowmunt wrote:
 So, if anything, Moses is just a very flexible text-rewriting tool.
 Tuning (and data) turns into a translator, GEC tool, POS-tagger,
 Chunker, Semantic Tagger etc.
that's a good point, and the basis of some criticism that can be 
levelled at the Moses community: because Moses is so flexible, the 
responsibility is on the user to find the right configuration for a 
task. I think it is getting harder to find out about all of the 
settings/models necessary to reproduce a state-of-the-art system, 
especially outside of an established SMT research group. The results is 
a high barrier of entry, and frustration on all sides when somebody 
performs experiments with default settings.

To stay with the example of phrase table pruning: this is widely used, 
and I used count-based pruning, threshold pruning based on p(e|f), and 
histogram pruning based on the model score in my WMT submission. Can and 
should we make a wider effort to facilitate the reproduction of systems 
by disseminating settings or configuration files? This dissemination is 
partially done by system description papers, but they cannot cover all 
settings [this would make for a very boring paper]. I put some effort 
into documenting my WMT submission by releasing EMS configuration files 
( https://github.com/rsennrich/wmt2014-scripts/tree/master/example ), 
and I would be happy to see this done more often.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-20 Thread Marcin Junczys-Dowmunt

I like the idea very much. I would need to discuss this with my 
collegues, but I guess we can publish recipes for the MT engines we use 
in production at WIPO and the UN. They are modelled after some of your 
WMT systems, but tuned for speed and small size.

On 20.06.2015 15:42, Adam Lopez wrote:

 Can and
 should we make a wider effort to facilitate the reproduction of
 systems
 by disseminating settings or configuration files? This
 dissemination is
 partially done by system description papers, but they cannot cover all
 settings [this would make for a very boring paper]. I put some effort
 into documenting my WMT submission by releasing EMS configuration
 files
 ( https://github.com/rsennrich/wmt2014-scripts/tree/master/example ),
 and I would be happy to see this done more often.


 Compare with speech recognition, where the major open source toolkit 
 is Kaldi. One of its stated goals is to collect a set of recipes for 
 reproducing state-of-the-art results.
 http://kaldi.sourceforge.net/about.html

 I don't know how well they've succeeded at this. But it's an admirable 
 goal.


 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-20 Thread Adam Lopez


 Can and
 should we make a wider effort to facilitate the reproduction of systems
 by disseminating settings or configuration files? This dissemination is
 partially done by system description papers, but they cannot cover all
 settings [this would make for a very boring paper]. I put some effort
 into documenting my WMT submission by releasing EMS configuration files
 ( https://github.com/rsennrich/wmt2014-scripts/tree/master/example ),
 and I would be happy to see this done more often.


Compare with speech recognition, where the major open source toolkit is
Kaldi. One of its stated goals is to collect a set of recipes for
reproducing state-of-the-art results.
http://kaldi.sourceforge.net/about.html

I don't know how well they've succeeded at this. But it's an admirable goal.
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

According to your book which I have on my desk the job of the TM is to model 
the most likely translations  and the job of the decoder is to intelligently 
search the space of translations to find the most likely one/s (I'm 
paraphrasing of course).


Would you like to retract that position and republish a next edition of your 
book which openly states that Moses when used with no LM or tuning or pruning 
can and should be expected to perform very poorly and select only the least 
likely translations?


Don't you in the slightest find it worrying that like at least 90% of you code 
base could be thrown out of the window and high scoring results can be obtained 
with a simple phrase pair based rule based system?


Which would you prefer? Would you prefer to consume computational resources 
calculating probabilites or get straight to the answer with simple logic and 
low computational requirements?


BE HONEST!


James



From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Philipp Koehn p...@jhu.edu
Sent: Thursday, June 18, 2015 9:39 PM
To: Burger, John D.
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Hi,

I am great fan of open source software, but there is a danger to
view its inner workings as a black box - which leads to the
strange theories of what is going on, instead of real understanding.
But we can try to understand it.

In the reported experiment, the language model was removed,
while the rest of the system was left unchanged.

The default untuned weights that train-model.perl assigns to a
model are the following:

WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
Distortion0= 0.3

Since no language model is used, a positive distortion cost will
lead the decoder to not use any reordering at all. That's a
good thing in this case.

The word penalty is used to counteract the language model's
preference for short translations. Unchecked, there is now a
bias towards too long translations.

Then there is the translation model with its equal weights for
p(e|f) and p(f|e). The p(e|f) weight and scores are fine and well.
However, p(f|e) only make sense if you have the Bayes theorem
in your mind and a language model in your back. But in the
reported setup, there is now a bias to translate into rare English
phrases, since these will have high p(f|e) scores.

My best guess is that the reported setup translates common
function words (such as prepositions) into very long rare English
phrases - word penalty likes it, p(f|e) likes it, p(e|f) does not mind
enough - which produces a lot of rubbish.

By filtering for p(e|f) those junky phrases are removed from the
phrase table, restricting the decoder to more reasonable choices.

I content that this is not a bug in the software, but a bug in usage.

-phi


On Thu, Jun 18, 2015 at 11:32 AM, Burger, John D. 
j...@mitre.orgmailto:j...@mitre.org wrote:
On Jun 17, 2015, at 11:54, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

 The question remains why isn't the system capable of finding the most likely 
 translations without the LM?

Even if it weren't ill-posed, I don't find this to be an interesting question 
at all. This is like trying to improve automobile transmissions by disabling 
the steering. These are the parts we have, and they all work together.

It's not as if human translators don't use their own internal language models.

- John Burger
  MITRE

 Evidently, if you filter the phrase table then the LM is not as important as 
 you might feel. The question remains why isn't the system capable of finding 
 the most likely translations without the LM? Why do I need to filter to help 
 the system find them? This is undesirable behaviour. Clearly a bug.

 I include the code I used for filtering. As you can see the 4th score only 
 was used as a filtering criteria.

 #!/usr/bin/perl -w
 #
 # Program filters phrase table to leave only phrase pairs
 # with probability above a threshold
 #
 use strict;
 use warnings;
 use Getopt::Long;

 my $phrase;
 my $min;
 my $phrase_table;
 my $filtered_table;

 GetOptions( 'min=f' = \$min,
 'out=s' = \$filtered_table,
 'in=s'  = \$phrase_table);
 die ERROR: must give threshold and phrase table input file and output 
 file\n unless ($min  $phrase_table  $filtered_table);
 die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table);
 open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase 
 table $phrase_table\n;;
 open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase 
 table $filtered_table\n;;

 while (my $line = PHRASETABLE)
 {
 chomp $line;
 my @columns = split ('\|\|\|', $line);

 # check that file is a well formatted phrase table
 if (scalar @columns  4)
 {
 die ERROR: input file is not a well formatted phrase table. 
 A phrase

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Lane Schwartz

James,

You may see the techniques that exist as outdated, wrong-headed, and
inefficient. You have the right to hold that opinion. It may even be that
history proves you right. Progress in science is made by people posing
questions - often questions that challenge the status quo - and then doing
experiments to answer those questions.

However, it is incumbent upon you, the proponent of a new idea, to design
good experiments to attempt to prove or disprove your new hypothesis.
Dispassionately showing the relative merits and shortcomings of your
technique with the existing state of the art is part of that process.

I, along with numerous other people on this list, have attempted in good
faith to answer your questions, and to provide you with our perspective
based on our collective understanding of the problem.

You, in turn, have responded belligerently.

I suggest that you have a frank conversation with your academic advisor or
other appropriate mentor regarding your future. If you intend to pursue a
successful career in science, academia, government, or industry, you would
do well to reconsider the manner in which you interact with other people,
especially people with whom you disagree.

In the meantime, I would respectfully request that until you learn how to
respectfully interact with other adults that you refrain from posting to
this mailing list.

Sincerely,
Lane Schwartz



On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk wrote:

  According to your book which I have on my desk the job of the TM is to
 model the most likely translations  and the job of the decoder is to
 intelligently search the space of translations to find the most likely
 one/s (I'm paraphrasing of course).


  Would you like to retract that position and republish a next edition of
 your book which openly states that Moses when used with no LM or tuning or
 pruning can and should be expected to perform very poorly and select only
 the least likely translations?


  Don't you in the slightest find it worrying that like at least 90% of
 you code base could be thrown out of the window and high scoring results
 can be obtained with a simple phrase pair based rule based system?


  Which would you prefer? Would you prefer to consume computational
 resources calculating probabilites or get straight to the answer with
 simple logic and low computational requirements?


  BE HONEST!


  James


  --
 *From:* moses-support-boun...@mit.edu moses-support-boun...@mit.edu on
 behalf of Philipp Koehn p...@jhu.edu
 *Sent:* Thursday, June 18, 2015 9:39 PM
 *To:* Burger, John D.
 *Cc:* moses-support@mit.edu

 *Subject:* Re: [Moses-support] Major bug found in Moses

  Hi,

  I am great fan of open source software, but there is a danger to
 view its inner workings as a black box - which leads to the
 strange theories of what is going on, instead of real understanding.
 But we can try to understand it.

  In the reported experiment, the language model was removed,
 while the rest of the system was left unchanged.

  The default untuned weights that train-model.perl assigns to a
 model are the following:

  WordPenalty0= -1
 PhrasePenalty0= 0.2
 TranslationModel0= 0.2 0.2 0.2 0.2
 Distortion0= 0.3

  Since no language model is used, a positive distortion cost will
 lead the decoder to not use any reordering at all. That's a
 good thing in this case.

  The word penalty is used to counteract the language model's
 preference for short translations. Unchecked, there is now a
 bias towards too long translations.

  Then there is the translation model with its equal weights for
 p(e|f) and p(f|e). The p(e|f) weight and scores are fine and well.
 However, p(f|e) only make sense if you have the Bayes theorem
 in your mind and a language model in your back. But in the
 reported setup, there is now a bias to translate into rare English
 phrases, since these will have high p(f|e) scores.

  My best guess is that the reported setup translates common
 function words (such as prepositions) into very long rare English
 phrases - word penalty likes it, p(f|e) likes it, p(e|f) does not mind
 enough - which produces a lot of rubbish.

  By filtering for p(e|f) those junky phrases are removed from the
 phrase table, restricting the decoder to more reasonable choices.

  I content that this is not a bug in the software, but a bug in usage.

  -phi


 On Thu, Jun 18, 2015 at 11:32 AM, Burger, John D. j...@mitre.org wrote:

 On Jun 17, 2015, at 11:54, Read, James C jcr...@essex.ac.uk wrote:

  The question remains why isn't the system capable of finding the most
 likely translations without the LM?

 Even if it weren't ill-posed, I don't find this to be an interesting
 question at all. This is like trying to improve automobile transmissions by
 disabling the steering. These are the parts we have, and they all work
 together.

 It's not as if human translators don't use their own internal language
 models.

 - John Burger

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

If you want to use an automobile analogy then the TM is the engine which powers 
the vehicle. You as an investor have a few choices before you. Your objective 
is to make the car fun faster. Would you invest your money in:

a) the guy the says it is a desirable feature to keep an inefficient fuel 
guzzling motor that breaks down constantly such that you need to get out and 
push it (tuning) so it would be much more preferable to optimise the 
aerodynamics of the vehicle and install a rear window heater to keep your hands 
warm while your pushing it

b) the guy that says. Well here's a stroke of genius. Why don't we build a more 
powerful engine that uses less fuel and doesn't break down with no need to get 
out and push (tuning or pruning)

Honest replies only requested please.

James


From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Burger, John D. j...@mitre.org
Sent: Thursday, June 18, 2015 6:32 PM
To: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On Jun 17, 2015, at 11:54, Read, James C jcr...@essex.ac.uk wrote:

 The question remains why isn't the system capable of finding the most likely 
 translations without the LM?

Even if it weren't ill-posed, I don't find this to be an interesting question 
at all. This is like trying to improve automobile transmissions by disabling 
the steering. These are the parts we have, and they all work together.

It's not as if human translators don't use their own internal language models.

- John Burger
  MITRE

 Evidently, if you filter the phrase table then the LM is not as important as 
 you might feel. The question remains why isn't the system capable of finding 
 the most likely translations without the LM? Why do I need to filter to help 
 the system find them? This is undesirable behaviour. Clearly a bug.

 I include the code I used for filtering. As you can see the 4th score only 
 was used as a filtering criteria.

 #!/usr/bin/perl -w
 #
 # Program filters phrase table to leave only phrase pairs
 # with probability above a threshold
 #
 use strict;
 use warnings;
 use Getopt::Long;

 my $phrase;
 my $min;
 my $phrase_table;
 my $filtered_table;

 GetOptions( 'min=f' = \$min,
 'out=s' = \$filtered_table,
 'in=s'  = \$phrase_table);
 die ERROR: must give threshold and phrase table input file and output 
 file\n unless ($min  $phrase_table  $filtered_table);
 die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table);
 open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase 
 table $phrase_table\n;;
 open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase 
 table $filtered_table\n;;

 while (my $line = PHRASETABLE)
 {
 chomp $line;
 my @columns = split ('\|\|\|', $line);

 # check that file is a well formatted phrase table
 if (scalar @columns  4)
 {
 die ERROR: input file is not a well formatted phrase table. 
 A phrase table must have at least four colums each column separated by |||\n;
 }

 # get the probability and check it is less than the threshold
 my @scores = split /\s+/, $columns[2];
 if ($scores[3]  $min)
 {
 print FILTEREDTABLE $line.\n;;
 }
 }



 From: Matt Post p...@cs.jhu.edu
 Sent: Wednesday, June 17, 2015 5:25 PM
 To: Read, James C
 Cc: Marcin Junczys-Dowmunt; moses-support@mit.edu; Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses

 I think you are misunderstanding how decoding works. The highest-weighted 
 translation of each source phrase is not necessarily the one with the best 
 BLEU score. This is why the decoder retains many options, so that it can 
 search among them (together with their reorderings). The LM is an important 
 component in making these selections.

 Also, how did you weight the many probabilities attached to each phrase (to 
 determine which was the most probable)? The tuning phase of decoding selects 
 weights designed to optimize BLEU score. If you weighted them evenly, that is 
 going to exacerbate this experiment.

 matt



 On Jun 17, 2015, at 10:22 AM, Read, James C jcr...@essex.ac.uk wrote:

 All I did was break the link to the language model and then perform 
 filtering. How is that a methodoligical mistake? How else would one test the 
 efficacy of the TM in isolation?

 I remain convinced that this is undersirable behaviour and therefore a bug.

 James


 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 Sent: Wednesday, June 17, 2015 5:12 PM
 To: Read, James C
 Cc: Arnold, Doug; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Hi James
 No, not at all. I would say that is expected behaviour. It's how search 
 spaces and optimization works. If anything these are methodological mistakes 
 on your side, sorry.  You are doing weird thinds

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

I'm gonna try once more. This is what he said:

the decoder's job is NOT to find the high quality translation

The  next time I have a panel of potential investors in front of me I'm gonna 
pass that line by them and see how it goes down. I stress the words HIGH 
QUALITY TRANSLATION.

Please promise me that the next time you put in a bid for funding you will 
guarantee your prospective funders that under no circumstances will you attempt 
to design a system which searches for HIGH QUALITY TRANSLATION.

James


From: Matthias Huck mh...@inf.ed.ac.uk
Sent: Friday, June 19, 2015 5:08 PM
To: Read, James C
Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

Hi James,

Yes, he just said that.

The decoder's job is to find the hypothesis with the maximum model
score. That's one reason why your work is flawed. You did not care at
all whether your model score correlates with BLEU or not.

Cheers,
Matthias


On Fri, 2015-06-19 at 13:24 +, Read, James C wrote:
 I quote:


 the decoder's job is NOT to find the high quality translation



 Did you REALLY just say that?


 James




 __
 From: Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 9:00 PM
 To: Read, James C
 Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses

 the decoder's job is NOT to find the high quality translation (as
 measured by bleu). It's job is to find translations with high model
 score.


 you need the tuning to make sure high quality translation correlates
 with high model score. If you don't tune, it's pot luck what quality
 you get.


 You should tune with the features you use


 Hieu Hoang
 Researcher

 New York University, Abu Dhabi

 http://www.hoang.co.uk/hieu


 On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote:
 The analogy doesn't seem to be helping me understand just how
 exactly it is a desirable quality of a TM to

 a) completely break down if no LM is used (thank you for
 showing that such is not always the case)
 b) be dependent on a tuning step to help it find the higher
 scoring translations

 What you seem to be essentially saying is that the TM cannot
 find the higher scoring translations because I didn't pretune
 the system to do so. And I am supposed to accept that such is
 a desirable quality of a system whose very job is to find the
 higher scoring translations.

 Further, I am still unclear which features you prequire a
 system to be tuned on. At the very least it seems that I have
 discovered the selection process that tuning seems to be
 making up for in some unspecified and altogether opaque way.

 James


 
 From: Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 8:34 PM
 To: Read, James C; Kenneth Heafield; moses-support@mit.edu
 Cc: Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses

 4 BLEU is nothing to sniff at :) I was answering Ken's tangent
 aspersion
 that LM are needed for tuning.

 I have some sympathy for you. You're looking at ways to
 improve
 translation by reducing the search space. I've bashed my head
 against
 this wall for a while as well without much success.

 However, as everyone is telling you, you haven't understood
 the role of
 tuning. Without tuning, you're pointing your lab rat to some
 random part
 of the search space, instead of away from the furry animal
 with whiskers
 and towards the yellow cheesy thing

 On 17/06/2015 20:45, Read, James C wrote:
  Doesn't look like the LM is contributing all that much then
 does it?
 
  James
 
  
  From: moses-support-boun...@mit.edu
 moses-support-boun...@mit.edu on behalf of Hieu Hoang
 hieuho...@gmail.com
  Sent: Wednesday, June 17, 2015 7:35 PM
  To: Kenneth Heafield; moses-support@mit.edu
  Subject: Re: [Moses-support] Major bug found in Moses
 
  On 17/06/2015 20:13, Kenneth Heafield wrote:
  I'll bite.
 
  The moses.ini files ship with bogus feature weights.  One
 is required to
  tune the system to discover good weights for their system.
 You did not
  tune.  The results of an untuned system are meaningless.
 
  So for example if the feature weights are all zeros, then
 the scores are
  all zero.  The system will arbitrarily pick some awful
 translation from
  a large

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

You are not interested in discovering which phrase pairs contributed most to 
increases in BLEU scores so that we can bypass an ineffective search algorithm 
and construct a reliable phrase pair based rule based system with lower 
computational cost and higher likelihood of better results?


I would like to see you stare investors in the face and make that claim. And 
manage to keep a straight face.


James



From: Lane Schwartz dowob...@gmail.com
Sent: Wednesday, June 17, 2015 9:11 PM
To: Read, James C
Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

James,

The underlying questions that you appear to be posing are these: When the 
search space is simplified by decoding without a language model, to what extent 
is the decoder able to identify hypotheses that have the best model score? 
Second, does filtering the phrase table in a particular way change the answer 
to this question? Third, how is the BLEU score (or any other metric) affected 
by these questions?

These are valid questions.

Unfortunately, as Kenneth, Amittai, and Hieu have pointed out, the experiment 
that you have designed does not provide you with all of what you need to be 
able to answer these questions.

Recall that we don't really deal with probabilities when decoding. Yes, some of 
our features are trained as probability models. But the decoder searches using 
a weighted combination of scores. Lots of them. Even the phrase table is 
comprised of (at least) four distinct scores (phrase translation score and 
lexical translation score, in both directions).

Decoding is a search problem. Specifically, it is a search through all possible 
translations to attempt to identify the one with the highest score according to 
this weighted combination of component scores.

There are two problems then, that we have to deal with:

First is this. Even if all we care about is the ultimate weighted combination 
of component scores, the search space is so vast (it's NP complete) that we 
cannot hope to exhaustively search through it in a reasonable amount of time, 
even for sentences that are only of moderate length. This means that we have to 
resort to pruning.

Second is this. We don't really care about finding solutions that are optimal 
according to the weighted combination of component scores. We care about 
getting translations that are fluent and mean the same thing as the original 
sentence. Since we don't know how to measure adequacy and fluency 
automatically, we resort to imperfect metrics that can be calculated 
automatically, like BLEU. This is fine, but it makes the search problem (which 
was already intractably large) even worse.

The decoder only knows how to search by finding solutions that are good 
according to the weighted combination of component scores. If we want 
translations that are good according to some metric (like BLEU), then we need 
to attempt to formulate the weights such that solutions that are good according 
to the weighted combination of component scores are also good according to the 
desired metric (BLEU).

The mechanism by which this is performed is tuning.

Your decoder, by necessity, is operating using pruning. As such, your decoder 
is only operating in a confined region of the overall search space. The 
question then is, what region of the search space would you prefer to have your 
decoder operate in. If you choose not to run tuning, then you are choosing to 
have your decoder operate in an arbitrary region of the search space. If you 
chose to run tuning, then you are choosing to have your decoder operate in a 
region of the search space where you have reason to believe contains good 
translations according to your metric.

Another way to think about this is as follows. If you choose not to run tuning, 
and you obtain translations that are good according to the metric (BLEU), this 
is great, but it doesn't tell you much. If you obtain translations that are bad 
according to the metric, this is to be expected.

What your experiments have shown is this:

The complexity of the search space is greater when you use all available phrase 
pairs than it is when you pre-select only the best phrase pairs. When you 
choose to not tune and not use and LM, and then decode in the simpler space, 
you get better BLEU scores than when you decode in the more complex space.

This is not a surprising result. It is in fact the expected result.

Why is this the expected result? Two reasons.

First, because search involves pruning. If you simplify the search space (by 
allowing the decoder to search using only the best phrase pairs), then it 
becomes easier for the decoder to find translations that are closer to optimal 
according to the weighted combination of scores, simply because the decoder is 
searching through a much smaller (and higher quality) sub-region of the search 
space.

Second, because by choosing not to tune

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

So we've gone from


1) Acknowledging that the search algorithm performs poorly with no LM, tuning 
or pruning despite the fact the search space clearly contains high quality 
translations

2) to a public display of en-masse reluctance to acknowledge that such is an 
undesirable quality of the system

3) to resorting to censorship not only in the literature but also on a public 
mailing list rather than acknowledge point 2.


And your conclusion is that after being a witness to such behaviour I would 
still have a desire to contribute to this field?!? Why YES. I would love to 
keep banging my head against a brick wall. I have no other preferred past times.


James



From: Lane Schwartz dowob...@gmail.com
Sent: Friday, June 19, 2015 5:04 PM
To: Read, James C
Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

James,

You may see the techniques that exist as outdated, wrong-headed, and 
inefficient. You have the right to hold that opinion. It may even be that 
history proves you right. Progress in science is made by people posing 
questions - often questions that challenge the status quo - and then doing 
experiments to answer those questions.

However, it is incumbent upon you, the proponent of a new idea, to design good 
experiments to attempt to prove or disprove your new hypothesis. 
Dispassionately showing the relative merits and shortcomings of your technique 
with the existing state of the art is part of that process.

I, along with numerous other people on this list, have attempted in good faith 
to answer your questions, and to provide you with our perspective based on our 
collective understanding of the problem.

You, in turn, have responded belligerently.

I suggest that you have a frank conversation with your academic advisor or 
other appropriate mentor regarding your future. If you intend to pursue a 
successful career in science, academia, government, or industry, you would do 
well to reconsider the manner in which you interact with other people, 
especially people with whom you disagree.

In the meantime, I would respectfully request that until you learn how to 
respectfully interact with other adults that you refrain from posting to this 
mailing list.

Sincerely,
Lane Schwartz



On Fri, Jun 19, 2015 at 8:45 AM, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

According to your book which I have on my desk the job of the TM is to model 
the most likely translations  and the job of the decoder is to intelligently 
search the space of translations to find the most likely one/s (I'm 
paraphrasing of course).


Would you like to retract that position and republish a next edition of your 
book which openly states that Moses when used with no LM or tuning or pruning 
can and should be expected to perform very poorly and select only the least 
likely translations?


Don't you in the slightest find it worrying that like at least 90% of you code 
base could be thrown out of the window and high scoring results can be obtained 
with a simple phrase pair based rule based system?


Which would you prefer? Would you prefer to consume computational resources 
calculating probabilites or get straight to the answer with simple logic and 
low computational requirements?


BE HONEST!


James



From: moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu 
moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu on behalf 
of Philipp Koehn p...@jhu.edumailto:p...@jhu.edu
Sent: Thursday, June 18, 2015 9:39 PM
To: Burger, John D.
Cc: moses-support@mit.edumailto:moses-support@mit.edu

Subject: Re: [Moses-support] Major bug found in Moses

Hi,

I am great fan of open source software, but there is a danger to
view its inner workings as a black box - which leads to the
strange theories of what is going on, instead of real understanding.
But we can try to understand it.

In the reported experiment, the language model was removed,
while the rest of the system was left unchanged.

The default untuned weights that train-model.perl assigns to a
model are the following:

WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
Distortion0= 0.3

Since no language model is used, a positive distortion cost will
lead the decoder to not use any reordering at all. That's a
good thing in this case.

The word penalty is used to counteract the language model's
preference for short translations. Unchecked, there is now a
bias towards too long translations.

Then there is the translation model with its equal weights for
p(e|f) and p(f|e). The p(e|f) weight and scores are fine and well.
However, p(f|e) only make sense if you have the Bayes theorem
in your mind and a language model in your back. But in the
reported setup, there is now a bias to translate into rare English
phrases, since these will have high p(f|e) scores.

My best guess

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Marcin Junczys-Dowmunt

 

German joke: 

Ein Autofahrer hört im Radio die Durchsage: Achtung! Achtung! Auf der
N9 kommt Ihnen ein Geisterfahrer entgegen. Fahren Sie bitte ganz rechts
und überholen Sie nicht!
Der Autofahrer: Was heißt hier einer? Dutzende! Dutzende! 

Wdniu 2015-06-19 16:12, Read, James C napisał(a): 

 So we've gone from 
 
 1) Acknowledging that the search algorithm performs poorly with no LM, tuning 
 or pruning despite the fact the search space clearly contains high quality 
 translations 
 
 2) to a public display of en-masse reluctance to acknowledge that such is an 
 undesirable quality of the system 
 
 3) to resorting to censorship not only in the literature but also on a public 
 mailing list rather than acknowledge point 2. 
 
 And your conclusion is that after being a witness to such behaviour I would 
 still have a desire to contribute to this field?!? Why YES. I would love to 
 keep banging my head against a brick wall. I have no other preferred past 
 times. 
 
 James 
 
 -
 
 FROM: Lane Schwartz dowob...@gmail.com
 SENT: Friday, June 19, 2015 5:04 PM
 TO: Read, James C
 CC: Philipp Koehn; Burger, John D.; moses-support@mit.edu
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 James, 
 
 You may see the techniques that exist as outdated, wrong-headed, and 
 inefficient. You have the right to hold that opinion. It may even be that 
 history proves you right. Progress in science is made by people posing 
 questions - often questions that challenge the status quo - and then doing 
 experiments to answer those questions. 
 
 However, it is incumbent upon you, the proponent of a new idea, to design 
 good experiments to attempt to prove or disprove your new hypothesis. 
 Dispassionately showing the relative merits and shortcomings of your 
 technique with the existing state of the art is part of that process. 
 
 I, along with numerous other people on this list, have attempted in good 
 faith to answer your questions, and to provide you with our perspective based 
 on our collective understanding of the problem. 
 
 You, in turn, have responded belligerently. 
 
 I suggest that you have a frank conversation with your academic advisor or 
 other appropriate mentor regarding your future. If you intend to pursue a 
 successful career in science, academia, government, or industry, you would do 
 well to reconsider the manner in which you interact with other people, 
 especially people with whom you disagree. 
 
 In the meantime, I would respectfully request that until you learn how to 
 respectfully interact with other adults that you refrain from posting to this 
 mailing list. 
 
 Sincerely, 
 Lane Schwartz 
 
 On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk wrote:
 
 According to your book which I have on my desk the job of the TM is to model 
 the most likely translations and the job of the decoder is to intelligently 
 search the space of translations to find the most likely one/s (I'm 
 paraphrasing of course). 
 
 Would you like to retract that position and republish a next edition of your 
 book which openly states that Moses when used with no LM or tuning or pruning 
 can and should be expected to perform very poorly and select only the least 
 likely translations? 
 
 Don't you in the slightest find it worrying that like at least 90% of you 
 code base could be thrown out of the window and high scoring results can be 
 obtained with a simple phrase pair based rule based system? 
 
 Which would you prefer? Would you prefer to consume computational resources 
 calculating probabilites or get straight to the answer with simple logic and 
 low computational requirements? 
 
 BE HONEST! 
 
 James 
 
 -
 
 FROM: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Philipp Koehn p...@jhu.edu
 SENT: Thursday, June 18, 2015 9:39 PM
 TO: Burger, John D.
 CC: moses-support@mit.edu 
 
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 Hi, 
 
 I am great fan of open source software, but there is a danger to 
 view its inner workings as a black box - which leads to the 
 strange theories of what is going on, instead of real understanding. 
 But we can try to understand it. 
 
 In the reported experiment, the language model was removed, 
 while the rest of the system was left unchanged. 
 
 The default untuned weights that train-model.perl assigns to a 
 model are the following: 
 WordPenalty0= -1
 PhrasePenalty0= 0.2
 TranslationModel0= 0.2 0.2 0.2 0.2
 Distortion0= 0.3 
 
 Since no language model is used, a positive distortion cost will 
 lead the decoder to not use any reordering at all. That's a 
 good thing in this case. 
 
 The word penalty is used to counteract the language model's 
 preference for short translations. Unchecked, there is now a 
 bias towards too long translations. 
 
 Then there is the translation model with its equal weights for 
 p(e|f) and p(f|e). The p(e|f) weight and scores

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread amittai axelrod

speaking of cobbling together a good translation from imperfect parts:

google:

A motorist heard on the radio the announcement: Caution Caution On the 
N9 you will encounter a ghost driver Please drive far right and do not 
overtake!.!
The driver: What do you mean a dozens dozens?!

microsoft:

A motorist hears the announcement on the radio: 'warning! Caution! On 
the N9, a (s) satisfies you. Go quite right and not overtake!
The car driver: what do you mean one? Dozens! Dozens!

:)
~amittai

On 6/19/15 10:19, Marcin Junczys-Dowmunt wrote:
 German joke:

 Ein Autofahrer hört im Radio die Durchsage: Achtung! Achtung! Auf der
 N9 kommt Ihnen ein Geisterfahrer entgegen. Fahren Sie bitte ganz rechts
 und überholen Sie nicht!
 Der Autofahrer: Was heißt hier einer? Dutzende! Dutzende!

 Wdniu 2015-06-19 16:12, Read, James C napisał(a):

 So we've gone from

 1) Acknowledging that the search algorithm performs poorly with no LM,
 tuning or pruning despite the fact the search space clearly contains
 high quality translations

 2) to a public display of en-masse reluctance to acknowledge that such
 is an undesirable quality of the system

 3) to resorting to censorship not only in the literature but also on a
 public mailing list rather than acknowledge point 2.

 And your conclusion is that after being a witness to such behaviour I
 would still have a desire to contribute to this field?!? Why YES. I
 would love to keep banging my head against a brick wall. I have no
 other preferred past times.

 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 5:04 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 James,
 You may see the techniques that exist as outdated, wrong-headed, and
 inefficient. You have the right to hold that opinion. It may even be
 that history proves you right. Progress in science is made by people
 posing questions - often questions that challenge the status quo - and
 then doing experiments to answer those questions.
 However, it is incumbent upon you, the proponent of a new idea, to
 design good experiments to attempt to prove or disprove your new
 hypothesis. Dispassionately showing the relative merits and
 shortcomings of your technique with the existing state of the art is
 part of that process.
 I, along with numerous other people on this list, have attempted in
 good faith to answer your questions, and to provide you with our
 perspective based on our collective understanding of the problem.
 You, in turn, have responded belligerently.
 I suggest that you have a frank conversation with your academic
 advisor or other appropriate mentor regarding your future. If you
 intend to pursue a successful career in science, academia, government,
 or industry, you would do well to reconsider the manner in which you
 interact with other people, especially people with whom you disagree.
 In the meantime, I would respectfully request that until you learn how
 to respectfully interact with other adults that you refrain from
 posting to this mailing list.
 Sincerely,
 Lane Schwartz

 On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 According to your book which I have on my desk the job of the TM
 is to model the most likely translations  and the job of
 the decoder is to intelligently search the space of translations
 to find the most likely one/s (I'm paraphrasing of course).

 Would you like to retract that position and republish a next
 edition of your book which openly states that Moses when used with
 no LM or tuning or pruning can and should be expected to perform
 very poorly and select only the least likely translations?

 Don't you in the slightest find it worrying that like at least 90%
 of you code base could be thrown out of the window and high
 scoring results can be obtained with a simple phrase pair based
 rule based system?

 Which would you prefer? Would you prefer to consume computational
 resources calculating probabilites or get straight to the answer
 with simple logic and low computational requirements?

 BE HONEST!

 James



 
 *From:* moses-support-boun...@mit.edu
 mailto:moses-support-boun...@mit.edu
 moses-support-boun...@mit.edu
 mailto:moses-support-boun...@mit.edu on behalf of Philipp Koehn
 p...@jhu.edu mailto:p...@jhu.edu
 *Sent:* Thursday, June 18, 2015 9:39 PM
 *To:* Burger, John D.
 *Cc:* moses-support@mit.edu mailto:moses-support@mit.edu

 *Subject:* Re: [Moses-support] Major bug found in Moses
 Hi,
 I am great fan of open source software, but there is a danger to
 view its inner workings as a black box - which leads to the
 strange

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread amittai axelrod

* [i'm] the guy that says. Well here's a stroke of genius.
* a public display of en-masse reluctance to acknowledge that such is 
an undesirable quality of the system ?
* resorting to censorship not only in the literature but also on a 
public mailing list rather than acknowledge point 2 ?

heh -- i was right the first time:

On 6/17/15 13:20, amittai axelrod wrote:
  also, your argument could be easily mis-interpreted as this behavior 
is unexpected to me, ergo this is unexpected behavior, and that will 
unfortunately bias the listener against you, as that is the preferred 
argument structure of conspiracy theorists.

see also:
https://en.wikipedia.org/wiki/Crank_(person)#Common_characteristics_of_cranks

if you're ever at a conference, say hi. until then, well, you do you.
~amittai


On 6/19/15 10:12, Read, James C wrote:
 So we've gone from


 1) Acknowledging that the search algorithm performs poorly with no LM,
 tuning or pruning despite the fact the search space clearly contains
 high quality translations

 2) to a public display of en-masse reluctance to acknowledge that such
 is an undesirable quality of the system

 3) to resorting to censorship not only in the literature but also on a
 public mailing list rather than acknowledge point 2.


 And your conclusion is that after being a witness to such behaviour I
 would still have a desire to contribute to this field?!? Why YES. I
 would love to keep banging my head against a brick wall. I have no other
 preferred past times.


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 5:04 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 James,

 You may see the techniques that exist as outdated, wrong-headed, and
 inefficient. You have the right to hold that opinion. It may even be
 that history proves you right. Progress in science is made by people
 posing questions - often questions that challenge the status quo - and
 then doing experiments to answer those questions.

 However, it is incumbent upon you, the proponent of a new idea, to
 design good experiments to attempt to prove or disprove your new
 hypothesis. Dispassionately showing the relative merits and shortcomings
 of your technique with the existing state of the art is part of that
 process.

 I, along with numerous other people on this list, have attempted in good
 faith to answer your questions, and to provide you with our perspective
 based on our collective understanding of the problem.

 You, in turn, have responded belligerently.

 I suggest that you have a frank conversation with your academic advisor
 or other appropriate mentor regarding your future. If you intend to
 pursue a successful career in science, academia, government, or
 industry, you would do well to reconsider the manner in which you
 interact with other people, especially people with whom you disagree.

 In the meantime, I would respectfully request that until you learn how
 to respectfully interact with other adults that you refrain from posting
 to this mailing list.

 Sincerely,
 Lane Schwartz



 On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 According to your book which I have on my desk the job of the TM is
 to model the most likely translations  and the job of the decoder is
 to intelligently search the space of translations to find the most
 likely one/s (I'm paraphrasing of course).


 Would you like to retract that position and republish a next edition
 of your book which openly states that Moses when used with no LM or
 tuning or pruning can and should be expected to perform very poorly
 and select only the least likely translations?


 Don't you in the slightest find it worrying that like at least 90%
 of you code base could be thrown out of the window and high scoring
 results can be obtained with a simple phrase pair based rule based
 system?


 Which would you prefer? Would you prefer to consume computational
 resources calculating probabilites or get straight to the answer
 with simple logic and low computational requirements?


 BE HONEST!


 James



 
 *From:* moses-support-boun...@mit.edu
 mailto:moses-support-boun...@mit.edu
 moses-support-boun...@mit.edu
 mailto:moses-support-boun...@mit.edu on behalf of Philipp Koehn
 p...@jhu.edu mailto:p...@jhu.edu
 *Sent:* Thursday, June 18, 2015 9:39 PM
 *To:* Burger, John D.
 *Cc:* moses-support@mit.edu mailto:moses-support@mit.edu

 *Subject:* Re: [Moses-support] Major bug found in Moses
 Hi,

 I am great fan of open source software, but there is a danger to
 view its inner workings as a black box - which leads

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Lane Schwartz

James,

 1) Acknowledging that the search algorithm performs poorly with no LM,
 tuning or pruning despite the fact the search space clearly contains high
 quality translations

Yes. We all acknowledge this. If you have a better technique, that's great.
Show that it's better. Your paper does not do so.

2) to a public display of en-masse reluctance to acknowledge that such is
 an undesirable quality of the system

Yes, this is undesirable. If you have a better technique, that's great.
Show that it's better. Your paper does not do so.


 3) to resorting to censorship not only in the literature but also on a
 public mailing list rather than acknowledge point 2.

No one is trying to censor you in the literature. You wrote a paper that
got rejected. Lots of papers get rejected. Lots of GOOD papers get
rejected. The fact that yours got rejected does not mean that you're being
censored.

No one is trying to censor you on this list. We are simply requesting that
you conduct yourself like a well-mannered adult engaged in scientific
research.


By the way, your frequent mentions of investors are very much a non
sequitur. You may be looking for investors, and that's fine if you are. You
may want to keep in mind that not everyone is. Many of us are interested in
this as a field of scientific enquiry.


Lane
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Marcin Junczys-Dowmunt

 

That's actually presentation-worthy material :) I have to note that down
somewhere. 

W dniu 2015-06-19 16:23, amittai axelrod napisał(a): 

 speaking of cobbling together a good translation from imperfect parts:
 
 google:
 
 A motorist heard on the radio the announcement: Caution Caution On the N9 
 you will encounter a ghost driver Please drive far right and do not 
 overtake!.!
 The driver: What do you mean a dozens dozens?!
 
 microsoft:
 
 A motorist hears the announcement on the radio: 'warning! Caution! On the 
 N9, a (s) satisfies you. Go quite right and not overtake!
 The car driver: what do you mean one? Dozens! Dozens!
 
 :)
 ~amittai
 
 On 6/19/15 10:19, Marcin Junczys-Dowmunt wrote:
 
 German joke: Ein Autofahrer hört im Radio die Durchsage: Achtung! Achtung! 
 Auf der N9 kommt Ihnen ein Geisterfahrer entgegen. Fahren Sie bitte ganz 
 rechts und überholen Sie nicht! Der Autofahrer: Was heißt hier einer? 
 Dutzende! Dutzende! Wdniu 2015-06-19 16:12, Read, James C napisał(a):

 ___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

If I'm ever at a conference I'll come and introduce myself right after you 
present to all those present that:

1) A well designed search algorithm should select low quality translations 
despite the fact that the search space contains much higher quality 
translations.

I can't deal with this level of denial.

James


From: amittai axelrod amit...@umiacs.umd.edu
Sent: Friday, June 19, 2015 5:33 PM
To: Read, James C; Lane Schwartz
Cc: moses-support@mit.edu; Philipp Koehn
Subject: Re: [Moses-support] Major bug found in Moses

* [i'm] the guy that says. Well here's a stroke of genius.
* a public display of en-masse reluctance to acknowledge that such is
an undesirable quality of the system ?
* resorting to censorship not only in the literature but also on a
public mailing list rather than acknowledge point 2 ?

heh -- i was right the first time:

On 6/17/15 13:20, amittai axelrod wrote:
  also, your argument could be easily mis-interpreted as this behavior
is unexpected to me, ergo this is unexpected behavior, and that will
unfortunately bias the listener against you, as that is the preferred
argument structure of conspiracy theorists.

see also:
https://en.wikipedia.org/wiki/Crank_(person)#Common_characteristics_of_cranks

if you're ever at a conference, say hi. until then, well, you do you.
~amittai


On 6/19/15 10:12, Read, James C wrote:
 So we've gone from


 1) Acknowledging that the search algorithm performs poorly with no LM,
 tuning or pruning despite the fact the search space clearly contains
 high quality translations

 2) to a public display of en-masse reluctance to acknowledge that such
 is an undesirable quality of the system

 3) to resorting to censorship not only in the literature but also on a
 public mailing list rather than acknowledge point 2.


 And your conclusion is that after being a witness to such behaviour I
 would still have a desire to contribute to this field?!? Why YES. I
 would love to keep banging my head against a brick wall. I have no other
 preferred past times.


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 5:04 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 James,

 You may see the techniques that exist as outdated, wrong-headed, and
 inefficient. You have the right to hold that opinion. It may even be
 that history proves you right. Progress in science is made by people
 posing questions - often questions that challenge the status quo - and
 then doing experiments to answer those questions.

 However, it is incumbent upon you, the proponent of a new idea, to
 design good experiments to attempt to prove or disprove your new
 hypothesis. Dispassionately showing the relative merits and shortcomings
 of your technique with the existing state of the art is part of that
 process.

 I, along with numerous other people on this list, have attempted in good
 faith to answer your questions, and to provide you with our perspective
 based on our collective understanding of the problem.

 You, in turn, have responded belligerently.

 I suggest that you have a frank conversation with your academic advisor
 or other appropriate mentor regarding your future. If you intend to
 pursue a successful career in science, academia, government, or
 industry, you would do well to reconsider the manner in which you
 interact with other people, especially people with whom you disagree.

 In the meantime, I would respectfully request that until you learn how
 to respectfully interact with other adults that you refrain from posting
 to this mailing list.

 Sincerely,
 Lane Schwartz



 On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 According to your book which I have on my desk the job of the TM is
 to model the most likely translations  and the job of the decoder is
 to intelligently search the space of translations to find the most
 likely one/s (I'm paraphrasing of course).


 Would you like to retract that position and republish a next edition
 of your book which openly states that Moses when used with no LM or
 tuning or pruning can and should be expected to perform very poorly
 and select only the least likely translations?


 Don't you in the slightest find it worrying that like at least 90%
 of you code base could be thrown out of the window and high scoring
 results can be obtained with a simple phrase pair based rule based
 system?


 Which would you prefer? Would you prefer to consume computational
 resources calculating probabilites or get straight to the answer
 with simple logic and low computational requirements?


 BE HONEST!


 James

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Lane Schwartz

On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk wrote:

What I take issue with is the en-masse denial that there is a problem with
 the system if it behaves in such a way with no LM + no pruning and/or
 tuning.


There is no mass denial taking place.

Regardless of whether or not you tune, the decoder will do its best to find
translations with the highest model score. That is the expected behavior.

What I have tried to tell you, and what other people have tried to tell
you, is that translations with high model scores are not necessarily good
translations.

We all want our models to be such that high model scores correspond to good
translations, and that low model scores correspond with bad translations.
But unfortunately, our models do not innately have this characteristic. We
all know this. We also know a good way to deal with this shortcoming,
namely tuning. Tuning is the process by which we attempt to ensure that
high model scores correspond to high quality translations, and that low
model scores correspond to low quality translations.

If you can design models that naturally correspond with translation quality
without tuning, that's great. If you can do that, you've got a great shot
at winning a Best Paper award at ACL.

In the meantime, you may want to consider an apology for your rude behavior
and unprofessional attitude.

Goodbye.
Lane
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

I did not claim that the paper does so. The weakness has been exposed. And the 
way it was exposed suggests that certain classes of phrase pairs contribute 
more to BLEU scores than others. We now have an empirical basis for exploring 
new avenues that exploit this observation.

I have no problem with papers being rejected. Clearly only a certain number can 
be published in any particular setting.

What I take issue with is the en-masse denial that there is a problem with the 
system if it behaves in such a way with no LM + no pruning and/or tuning.

I am happy that you seem to be the first person to acknowledge that this is 
undesirable behaviour. I feel that we are finally making some progress. Now if 
more people could acknowledge that their is a problem perhaps we could set 
about improving the situation.

James


From: Lane Schwartz dowob...@gmail.com
Sent: Friday, June 19, 2015 6:10 PM
To: Read, James C
Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

James,

1) Acknowledging that the search algorithm performs poorly with no LM, tuning 
or pruning despite the fact the search space clearly contains high quality 
translations

Yes. We all acknowledge this. If you have a better technique, that's great. 
Show that it's better. Your paper does not do so.


2) to a public display of en-masse reluctance to acknowledge that such is an 
undesirable quality of the system

Yes, this is undesirable. If you have a better technique, that's great. Show 
that it's better. Your paper does not do so.


3) to resorting to censorship not only in the literature but also on a public 
mailing list rather than acknowledge point 2.

No one is trying to censor you in the literature. You wrote a paper that got 
rejected. Lots of papers get rejected. Lots of GOOD papers get rejected. The 
fact that yours got rejected does not mean that you're being censored.

No one is trying to censor you on this list. We are simply requesting that you 
conduct yourself like a well-mannered adult engaged in scientific research.


By the way, your frequent mentions of investors are very much a non sequitur. 
You may be looking for investors, and that's fine if you are. You may want to 
keep in mind that not everyone is. Many of us are interested in this as a field 
of scientific enquiry.


Lane
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

So, all I did was filter out the less likely phrase pairs and the BLEU score 
shot up. Was that such a stroke of genius? Was that not blindingly obvious?


Your telling me that redesigning the search algorithm to prefer higher scoring 
phrase pairs is all we need to do to get a best paper at ACL?


James



From: Lane Schwartz dowob...@gmail.com
Sent: Friday, June 19, 2015 7:40 PM
To: Read, James C
Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On Fri, Jun 19, 2015 at 11:28 AM, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

What I take issue with is the en-masse denial that there is a problem with the 
system if it behaves in such a way with no LM + no pruning and/or tuning.

There is no mass denial taking place.

Regardless of whether or not you tune, the decoder will do its best to find 
translations with the highest model score. That is the expected behavior.

What I have tried to tell you, and what other people have tried to tell you, is 
that translations with high model scores are not necessarily good translations.

We all want our models to be such that high model scores correspond to good 
translations, and that low model scores correspond with bad translations. But 
unfortunately, our models do not innately have this characteristic. We all know 
this. We also know a good way to deal with this shortcoming, namely tuning. 
Tuning is the process by which we attempt to ensure that high model scores 
correspond to high quality translations, and that low model scores correspond 
to low quality translations.

If you can design models that naturally correspond with translation quality 
without tuning, that's great. If you can do that, you've got a great shot at 
winning a Best Paper award at ACL.

In the meantime, you may want to consider an apology for your rude behavior and 
unprofessional attitude.

Goodbye.
Lane

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread amittai axelrod

if we don't understand the problem, how can we possibly fix it?
all the relevant code is open source. go for it!

~amittai

On 6/19/15 12:49, Read, James C wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU
 score shot up. Was that such a stroke of genius? Was that not blindingly
 obvious?


 Your telling me that redesigning the search algorithm to prefer higher
 scoring phrase pairs is all we need to do to get a best paper at ACL?


 James



 
 *From:* Lane Schwartz dowob...@gmail.com
 *Sent:* Friday, June 19, 2015 7:40 PM
 *To:* Read, James C
 *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu
 *Subject:* Re: [Moses-support] Major bug found in Moses
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk
 mailto:jcr...@essex.ac.uk wrote:

 What I take issue with is the en-masse denial that there is a
 problem with the system if it behaves in such a way with no LM + no
 pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to
 find translations with the highest model score. That is the expected
 behavior.

 What I have tried to tell you, and what other people have tried to tell
 you, is that translations with high model scores are not necessarily
 good translations.

 We all want our models to be such that high model scores correspond to
 good translations, and that low model scores correspond with bad
 translations. But unfortunately, our models do not innately have this
 characteristic. We all know this. We also know a good way to deal with
 this shortcoming, namely tuning. Tuning is the process by which we
 attempt to ensure that high model scores correspond to high quality
 translations, and that low model scores correspond to low quality
 translations.

 If you can design models that naturally correspond with translation
 quality without tuning, that's great. If you can do that, you've got a
 great shot at winning a Best Paper award at ACL.

 In the meantime, you may want to consider an apology for your rude
 behavior and unprofessional attitude.

 Goodbye.
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Read, James C

P.S. Have a good weekend everybody. Be back in action in a couple of days.

James

From: Read, James C
Sent: Friday, June 19, 2015 7:49 PM
To: Lane Schwartz
Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

So, all I did was filter out the less likely phrase pairs and the BLEU score 
shot up. Was that such a stroke of genius? Was that not blindingly obvious?

Your telling me that redesigning the search algorithm to prefer higher scoring 
phrase pairs is all we need to do to get a best paper at ACL?

James

From: Lane Schwartz dowob...@gmail.com
Sent: Friday, June 19, 2015 7:40 PM
To: Read, James C
Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On Fri, Jun 19, 2015 at 11:28 AM, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

What I take issue with is the en-masse denial that there is a problem with the 
system if it behaves in such a way with no LM + no pruning and/or tuning.

There is no mass denial taking place.

Regardless of whether or not you tune, the decoder will do its best to find 
translations with the highest model score. That is the expected behavior.

What I have tried to tell you, and what other people have tried to tell you, is 
that translations with high model scores are not necessarily good translations.

We all want our models to be such that high model scores correspond to good 
translations, and that low model scores correspond with bad translations. But 
unfortunately, our models do not innately have this characteristic. We all know 
this. We also know a good way to deal with this shortcoming, namely tuning. 
Tuning is the process by which we attempt to ensure that high model scores 
correspond to high quality translations, and that low model scores correspond 
to low quality translations.

If you can design models that naturally correspond with translation quality 
without tuning, that's great. If you can do that, you've got a great shot at 
winning a Best Paper award at ACL.

In the meantime, you may want to consider an apology for your rude behavior and 
unprofessional attitude.

Goodbye.
Lane

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Marcin Junczys-Dowmunt

Hi Rico,
since you are at it, some pointers to the more advanced pruning 
techniques that do perform better, please :)

On 19.06.2015 19:25, Rico Sennrich wrote:
 [sorry for the garbled message before]

 you are right. The idea is pretty obvious. It roughly corresponds to
 'Histogram pruning' in this paper:

 Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
 Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
 Empirical Methods in Natural Language Processing and Computational
 Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

 The idea has been described in the literature before that (for instance,
 Johnson et al. (2007) only use the top 30 phrase pairs per source
 phrase), and may have been used in practice for even longer. If you read
 the paper above, you will find that histogram pruning does not improve
 translation quality on a state-of-the-art SMT system, and performs
 poorly compared to more advanced pruning techniques.

 On 19.06.2015 17:49, Read, James C. wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU score 
 shot up. Was that such a stroke of genius? Was that not blindingly obvious?


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Rico Sennrich

Marcin Junczys-Dowmunt junczys@... writes:

 
 Hi Rico,
 since you are at it, some pointers to the more advanced pruning 
 techniques that do perform better, please :)
 
 On 19.06.2015 19:25, Rico Sennrich wrote:
  [sorry for the garbled message before]
 
  you are right. The idea is pretty obvious. It roughly corresponds to
  'Histogram pruning' in this paper:
 
  Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
  Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
  Empirical Methods in Natural Language Processing and Computational
  Natural Language Learning (EMNLP-CoNLL), pp. 972-983.
 
  The idea has been described in the literature before that (for instance,
  Johnson et al. (2007) only use the top 30 phrase pairs per source
  phrase), and may have been used in practice for even longer. If you read
  the paper above, you will find that histogram pruning does not improve
  translation quality on a state-of-the-art SMT system, and performs
  poorly compared to more advanced pruning techniques.


the Zens et al. (2012) paper has a nice overview. significance
pruning and relative entropy pruning are both effective - you are not
guaranteed improvements over the unpruned system (although Johnson (2007)
does report improvements), but both allow you to reduce the size of your
models substantially with little loss in quality.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Marcin Junczys-Dowmunt

On that interesting idea that moses should be naturally good at 
translating things, just for general considerations.

Since some said this thread has educational value I would like to share 
something that might not be obvious due to the SMT-biased posts here. 
Moses is also the _leading_ tool for automatic grammatical error 
correction (GEC) right now. The first and third system of the CoNLL 
shared task 2014 were based on Moses. By now I have results that surpass 
the CoNLL results by far by adding some specialized features to Moses 
(which thanks to Hieu is very easy).

It even gets good results for GEC when you do crazy things like 
inverting the TM (so it should actually make the input worse) provided 
you tune on the correct metric and for the correct task. The interaction 
of all the other features after tuning makes that possible.

So, if anything, Moses is just a very flexible text-rewriting tool. 
Tuning (and data) turns into a translator, GEC tool, POS-tagger, 
Chunker, Semantic Tagger etc.

On 19.06.2015 18:40, Lane Schwartz wrote:
 On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk 
 mailto:jcr...@essex.ac.uk wrote:

 What I take issue with is the en-masse denial that there is a
 problem with the system if it behaves in such a way with no LM +
 no pruning and/or tuning.


 There is no mass denial taking place.

 Regardless of whether or not you tune, the decoder will do its best to 
 find translations with the highest model score. That is the expected 
 behavior.

 What I have tried to tell you, and what other people have tried to 
 tell you, is that translations with high model scores are not 
 necessarily good translations.

 We all want our models to be such that high model scores correspond to 
 good translations, and that low model scores correspond with bad 
 translations. But unfortunately, our models do not innately have this 
 characteristic. We all know this. We also know a good way to deal with 
 this shortcoming, namely tuning. Tuning is the process by which we 
 attempt to ensure that high model scores correspond to high quality 
 translations, and that low model scores correspond to low quality 
 translations.

 If you can design models that naturally correspond with translation 
 quality without tuning, that's great. If you can do that, you've got a 
 great shot at winning a Best Paper award at ACL.

 In the meantime, you may want to consider an apology for your rude 
 behavior and unprofessional attitude.

 Goodbye.
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Rico Sennrich

Read, James C jcread@... writes:

 So, all I did was filter out the less likely phrase pairs and the BLEU
score shot up. Was that such a stroke of genius? Was that not blindingly
obvious? 

you are right. The idea is pretty obvious. It roughly corresponds to
'Histogram pruning' in this paper:

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pp. 972-983. 

The idea has been described in the literature before that (for instance,
Johnson et al. (2007) only use the top 30 phrase pairs per source phrase),
and may have been used in 
ps���ѥ�ȁ�ٕ���ȸ�%ԁɕ���ѡ���)���ٔԁݥѡ�Ё���ѽ�Ʌչ���́��Ё���ɽٔ��Ʌ�ͱ�ѥ��)�Յ���䁽�хєѡЁM5Pѕ�ə�ɵ́���ɱ䁍ɕ��Ѽ)��ɔ���م���չѕ�Օ̸

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Rico Sennrich

[sorry for the garbled message before]

you are right. The idea is pretty obvious. It roughly corresponds to 
'Histogram pruning' in this paper:

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase 
Table Pruning Technique. In Proceedings of the 2012 Joint Conference on 
Empirical Methods in Natural Language Processing and Computational 
Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

The idea has been described in the literature before that (for instance, 
Johnson et al. (2007) only use the top 30 phrase pairs per source 
phrase), and may have been used in practice for even longer. If you read 
the paper above, you will find that histogram pruning does not improve 
translation quality on a state-of-the-art SMT system, and performs 
poorly compared to more advanced pruning techniques.

On 19.06.2015 17:49, Read, James C. wrote:
 So, all I did was filter out the less likely phrase pairs and the BLEU score 
 shot up. Was that such a stroke of genius? Was that not blindingly obvious?



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Marcin Junczys-Dowmunt

Ah OK, I misunderstood, I thought you were talking about more advanced 
pruning techniques compared to the significance method from Johnson et 
al. while you only referred to the 30-best variant.
Cheers,
Marcin

On 19.06.2015 19:35, Rico Sennrich wrote:
 Marcin Junczys-Dowmunt junczys@... writes:

 Hi Rico,
 since you are at it, some pointers to the more advanced pruning
 techniques that do perform better, please :)

 On 19.06.2015 19:25, Rico Sennrich wrote:
 [sorry for the garbled message before]

 you are right. The idea is pretty obvious. It roughly corresponds to
 'Histogram pruning' in this paper:

 Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
 Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
 Empirical Methods in Natural Language Processing and Computational
 Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

 The idea has been described in the literature before that (for instance,
 Johnson et al. (2007) only use the top 30 phrase pairs per source
 phrase), and may have been used in practice for even longer. If you read
 the paper above, you will find that histogram pruning does not improve
 translation quality on a state-of-the-art SMT system, and performs
 poorly compared to more advanced pruning techniques.

 the Zens et al. (2012) paper has a nice overview. significance
 pruning and relative entropy pruning are both effective - you are not
 guaranteed improvements over the unpruned system (although Johnson (2007)
 does report improvements), but both allow you to reduce the size of your
 models substantially with little loss in quality.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-19 Thread Matthias Huck

Hi James,

Well, it's pretty straightforward: The decoder's job is to find the
hypothesis with the maximum model score. That's why everybody builds
models which assign high model score to high-quality translations.
Unfortunately, you missed this last point in your own work.

Cheers,
Matthias


On Fri, 2015-06-19 at 14:15 +, Read, James C wrote:
 I'm gonna try once more. This is what he said:
 
 the decoder's job is NOT to find the high quality translation
 
 The  next time I have a panel of potential investors in front of me
 I'm gonna pass that line by them and see how it goes down. I stress
 the words HIGH QUALITY TRANSLATION.
 
 Please promise me that the next time you put in a bid for funding you
 will guarantee your prospective funders that under no circumstances
 will you attempt to design a system which searches for HIGH QUALITY
 TRANSLATION.
 
 James
 
 
 From: Matthias Huck mh...@inf.ed.ac.uk
 Sent: Friday, June 19, 2015 5:08 PM
 To: Read, James C
 Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses
 
 Hi James,
 
 Yes, he just said that.
 
 The decoder's job is to find the hypothesis with the maximum model
 score. That's one reason why your work is flawed. You did not care at
 all whether your model score correlates with BLEU or not.
 
 Cheers,
 Matthias
 
 
 On Fri, 2015-06-19 at 13:24 +, Read, James C wrote:
  I quote:
 
 
  the decoder's job is NOT to find the high quality translation
 
 
 
  Did you REALLY just say that?
 
 
  James
 
 
 
 
  __
  From: Hieu Hoang hieuho...@gmail.com
  Sent: Wednesday, June 17, 2015 9:00 PM
  To: Read, James C
  Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug
  Subject: Re: [Moses-support] Major bug found in Moses
 
  the decoder's job is NOT to find the high quality translation (as
  measured by bleu). It's job is to find translations with high model
  score.
 
 
  you need the tuning to make sure high quality translation correlates
  with high model score. If you don't tune, it's pot luck what quality
  you get.
 
 
  You should tune with the features you use
 
 
  Hieu Hoang
  Researcher
 
  New York University, Abu Dhabi
 
  http://www.hoang.co.uk/hieu
 
 
  On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote:
  The analogy doesn't seem to be helping me understand just how
  exactly it is a desirable quality of a TM to
 
  a) completely break down if no LM is used (thank you for
  showing that such is not always the case)
  b) be dependent on a tuning step to help it find the higher
  scoring translations
 
  What you seem to be essentially saying is that the TM cannot
  find the higher scoring translations because I didn't pretune
  the system to do so. And I am supposed to accept that such is
  a desirable quality of a system whose very job is to find the
  higher scoring translations.
 
  Further, I am still unclear which features you prequire a
  system to be tuned on. At the very least it seems that I have
  discovered the selection process that tuning seems to be
  making up for in some unspecified and altogether opaque way.
 
  James
 
 
  
  From: Hieu Hoang hieuho...@gmail.com
  Sent: Wednesday, June 17, 2015 8:34 PM
  To: Read, James C; Kenneth Heafield; moses-support@mit.edu
  Cc: Arnold, Doug
  Subject: Re: [Moses-support] Major bug found in Moses
 
  4 BLEU is nothing to sniff at :) I was answering Ken's tangent
  aspersion
  that LM are needed for tuning.
 
  I have some sympathy for you. You're looking at ways to
  improve
  translation by reducing the search space. I've bashed my head
  against
  this wall for a while as well without much success.
 
  However, as everyone is telling you, you haven't understood
  the role of
  tuning. Without tuning, you're pointing your lab rat to some
  random part
  of the search space, instead of away from the furry animal
  with whiskers
  and towards the yellow cheesy thing
 
  On 17/06/2015 20:45, Read, James C wrote:
   Doesn't look like the LM is contributing all that much then
  does it?
  
   James
  
   
   From: moses-support-boun...@mit.edu
  moses-support-boun...@mit.edu on behalf of Hieu Hoang
  hieuho...@gmail.com
   Sent: Wednesday, June 17, 2015 7:35 PM
   To: Kenneth Heafield; moses-support@mit.edu
   Subject: Re: [Moses-support] Major bug found in Moses
  
   On 17/06/2015 20:13, Kenneth Heafield wrote

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

The analogy doesn't seem to be helping me understand just how exactly it is a 
desirable quality of a TM to 

a) completely break down if no LM is used (thank you for showing that such is 
not always the case)
b) be dependent on a tuning step to help it find the higher scoring translations

What you seem to be essentially saying is that the TM cannot find the higher 
scoring translations because I didn't pretune the system to do so. And I am 
supposed to accept that such is a desirable quality of a system whose very job 
is to find the higher scoring translations.

Further, I am still unclear which features you prequire a system to be tuned 
on. At the very least it seems that I have discovered the selection process 
that tuning seems to be making up for in some unspecified and altogether opaque 
way.

James



From: Hieu Hoang hieuho...@gmail.com
Sent: Wednesday, June 17, 2015 8:34 PM
To: Read, James C; Kenneth Heafield; moses-support@mit.edu
Cc: Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion
that LM are needed for tuning.

I have some sympathy for you. You're looking at ways to improve
translation by reducing the search space. I've bashed my head against
this wall for a while as well without much success.

However, as everyone is telling you, you haven't understood the role of
tuning. Without tuning, you're pointing your lab rat to some random part
of the search space, instead of away from the furry animal with whiskers
and towards the yellow cheesy thing

On 17/06/2015 20:45, Read, James C wrote:
 Doesn't look like the LM is contributing all that much then does it?

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 7:35 PM
 To: Kenneth Heafield; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 On 17/06/2015 20:13, Kenneth Heafield wrote:
 I'll bite.

 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.

 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.

 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.

 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.

 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.

 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.
 Tuning does work without a LM. The results aren't half bad. fr-en
 europarl (pb):
 with LM: 22.84
 retuned without LM: 18.33
 On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on 
 behalf of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:

 p(e|f) = p(e|f)*p(f|e)

 which is obviously wrong, and will give you poor results. This is not a bug
 in the code, but a poor choice of models and weights. Standard steps in SMT
 (like tuning the model weights on a development set, and including a
 language model) will give you the desired results.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

1) So if I've understood you correctly you are saying we have a system that is 
purposefully designed to perform poorly with a disabled LM and this is the 
proof that the LM is the most fundamental part. Any attempt to prove otherwise 
by, e.g. filtering the phrase table to help the disfunctional search algorithm, 
does not constitute proof that the TM is the most fundamental component of the 
system and if designed correctly can perform just fine on its own but rather 
only evidence that the researcher is not using the system as intended (the 
intention being to break the TM to support the idea that the LM is the most 
fundamental part).

2) If you still feel that the LM is the most fundamental component I challenge 
you to disable the TM and perform LM only translations and see what kind of 
BLEU scores you get.

In conclusion, I do hope that you don't feel that potential investors in MT 
systems lack the intelligence to see through these logical fallacies. Can we 
now just admit that the system is broke and get around to fixing it?

James


From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:29 PM
To: Read, James C
Cc: Arnold, Doug; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses


To paint you a picture:

Imagine you have a rat in a labyrinth (the labyrinth is the TM and the search 
space). That rat is quite good at finding the center of that labyrinth. Now you 
somehow disable that rat's sense of smell, sense of direction, and long-term 
short-term memory (that's the LM). Can you expect the rat to find the center? 
Or will it just tumble around, bumping into walls and not find anything? That's 
what you did to the decoder when disabling the LM.

Now you prune the TM. In the labyrinth that's like closing all the doors that 
would lead the rat away from the center. There are still a few corridors left, 
but they all point into the general direction of the point where the rat is 
supposed to go. Although it may never quite reach it. Now you put that same 
handicapped rat into the labyrinth where all ways lead more or less to the 
center. Are you really surprised that the clueless rat find the center nearly 
every time now?

That's what happend. It's not a bug. The LM is probably the strongest feature 
in a MT system. If you take that away you see what happens.

W dniu 2015-06-17 16:22, Read, James C napisał(a):

All I did was break the link to the language model and then perform filtering. 
How is that a methodoligical mistake? How else would one test the efficacy of 
the TM in isolation?



I remain convinced that this is undersirable behaviour and therefore a bug.



James



From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:12 PM
To: Read, James C
Cc: Arnold, Doug; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses


Hi James

No, not at all. I would say that is expected behaviour. It's how search spaces 
and optimization works. If anything these are methodological mistakes on your 
side, sorry.  You are doing weird thinds to the decoder and then you are 
surprised to get weird results from it.

W dniu 2015-06-17 16:07, Read, James C napisał(a):



So, do we agree that this is undersirable behaviour and therefore a bug?

James


From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:01 PM
To: Read, James C
Subject: Re: [Moses-support] Major bug found in Moses


As I said. With an unpruned phrase table and an decoder that just optmizes some 
unreasonble set of weights all bets are off, so if you get very low BLEU point 
there, it's not surprising. It's probably jumping around in a very weird search 
space. With a pruned phrase table you restrict the search space VERY strongly. 
Nearly everything that will be produced is a half-decent translation. So yes, I 
can imagine that would happen.

Marcin

W dniu 2015-06-17 15:56, Read, James C napisał(a):

You would expect an improvement of 37 BLEU points?



James



From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 4:32 PM
To: Read, James C
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses


Hi James,

there are many more factors involved than just probability, for instance word 
penalties, phrase penalities etc. To be able to validate your own claim you 
would need to set weights for all those non-probabilities to zero. Otherwise 
there is no hope that moses will produce anything similar to the most probable 
translation. And based on that there is no surprise that there may be different 
translations. A pruned phrase table will produce naturally less noise, so I 
would say the behaviour you describe is quite exactly what I would expect to 
happen.

Best,

Marcin

W dniu 2015-06-17 15:26, Read, James C napisał(a):

Hi all

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Lane Schwartz

 interesting result.

Lane



On Wed, Jun 17, 2015 at 11:24 AM, Read, James C jcr...@essex.ac.uk wrote:


 Which features would you like me to tune? The whole purpose of the
 exercise was to eliminate all variables except the TM and to keep constant
 those that could not be eliminated so that I could see which types of
 phrase pairs contribute most to increases in BLEU score in a TM only setup.

 Now you are saying I have to tune but tuning won't work without a LM. So
 how do you expect a researcher to be able to understand how well the TM
 component of the system is working if you are going to insist that I must
 include a LM for tuning to work.

 Clearly the system is broken. It is designed to work well with a LM and
 poorly without. When clearly good results can be obtained with a functional
 TM and well chosen phrase pairs.

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on
 behalf of Kenneth Heafield mo...@kheafield.com
 Sent: Wednesday, June 17, 2015 7:13 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 I'll bite.

 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.

 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.

 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.

 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.

 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.

 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.

 On 06/17/15 11:56, Read, James C wrote:
  Actually the approximation I expect to be:
 
  p(e|f)=p(f|e)
 
  Why would you expect this to give poor results if the TM is well
 trained? Surely the results of my filtering experiments provve otherwise.
 
  James
 
  
  From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on
 behalf of Rico Sennrich rico.sennr...@gmx.ch
  Sent: Wednesday, June 17, 2015 5:32 PM
  To: moses-support@mit.edu
  Subject: Re: [Moses-support] Major bug found in Moses
 
  Read, James C jcread@... writes:
 
  I have been unable to find a logical explanation for this behaviour
 other
  than to conclude that there must be some kind of bug in Moses which
 causes a
  TM only run of Moses to perform poorly in finding the most likely
  translations according to the TM when
   there are less likely phrase pairs included in the race.
  I may have overlooked something, but you seem to have removed the
 language
  model from your config, and used default weights. your default model will
  thus (roughly) implement the following model:
 
  p(e|f) = p(e|f)*p(f|e)
 
  which is obviously wrong, and will give you poor results. This is not a
 bug
  in the code, but a poor choice of models and weights. Standard steps in
 SMT
  (like tuning the model weights on a development set, and including a
  language model) will give you the desired results.
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, Time Enough For Love
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

Please note that in order for the baseline to be meaningful it has to also use 
no LM. So, naturally the scores are lower than those of baselines you are 
referring to.

Regarding expectations. Are you seriously suggesting that we would expect the 
translation model to be incapable of finding higher scoring translations when 
not filtering out less likely phrase pairs? How high exactly would that rank on 
your desirable qualities of a TM list?

James


From: amittai axelrod amit...@umiacs.umd.edu
Sent: Wednesday, June 17, 2015 8:20 PM
To: Read, James C; Hieu Hoang; Kenneth Heafield; moses-support@mit.edu
Cc: Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

hi --

you might not be aware, but your emails sound almost belligerently
confrontational. i can see how you would be frustrated, but starting a
conversation with i have found a major bug and then repeatedly saying
that clearly everything is broken -- that may not be the best way to
convince the few hundred people on the mailing list of the soundness of
your approach.

also, your argument could be easily mis-interpreted as this behavior is
unexpected to me, ergo this is unexpected behavior, and that will
unfortunately bias the listener against you, as that is the preferred
argument structure of conspiracy theorists.

at any rate, the system is designed to take a large number of phrase
pairs and model scores cobble them together into a translation. it does
do that. it appears that you have identified a different way of doing
that cobbling-together, one that uses much fewer models -- so far so good!

however, from reading your paper, it seems that your baseline is
completely unoptimized, so performance gains against it may not show up
in the real world. as specific examples, Table 1 in your paper shows
that your baseline French-English system score is 11.36, Spanish-English
is 7.16, and German-English is 6.70 BLEU. if you compare those baselines
against published results in those languages from the previous few
years, you will see that those scores are well off the mark. your
position will be helped by showing results against a stronger, yet still
basic, baseline.

what happens if you compare your approach against a vanilla use of the
Moses pipeline [this includes tuning]?

cheers,
~amittai



On 6/17/15 12:45, Read, James C wrote:
 Doesn't look like the LM is contributing all that much then does it?

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 7:35 PM
 To: Kenneth Heafield; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 On 17/06/2015 20:13, Kenneth Heafield wrote:
 I'll bite.

 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.

 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.

 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.

 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.

 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.

 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.
 Tuning does work without a LM. The results aren't half bad. fr-en
 europarl (pb):
 with LM: 22.84
 retuned without LM: 18.33

 On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on 
 behalf of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Hieu Hoang

the decoder's job is NOT to find the high quality translation (as measured
by bleu). It's job is to find translations with high model score.

you need the tuning to make sure high quality translation correlates with
high model score. If you don't tune, it's pot luck what quality you get.

You should tune with the features you use


Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu

On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote:

 The analogy doesn't seem to be helping me understand just how exactly it
 is a desirable quality of a TM to

 a) completely break down if no LM is used (thank you for showing that such
 is not always the case)
 b) be dependent on a tuning step to help it find the higher scoring
 translations

 What you seem to be essentially saying is that the TM cannot find the
 higher scoring translations because I didn't pretune the system to do so.
 And I am supposed to accept that such is a desirable quality of a system
 whose very job is to find the higher scoring translations.

 Further, I am still unclear which features you prequire a system to be
 tuned on. At the very least it seems that I have discovered the selection
 process that tuning seems to be making up for in some unspecified and
 altogether opaque way.

 James


 
 From: Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 8:34 PM
 To: Read, James C; Kenneth Heafield; moses-support@mit.edu
 Cc: Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses

 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion
 that LM are needed for tuning.

 I have some sympathy for you. You're looking at ways to improve
 translation by reducing the search space. I've bashed my head against
 this wall for a while as well without much success.

 However, as everyone is telling you, you haven't understood the role of
 tuning. Without tuning, you're pointing your lab rat to some random part
 of the search space, instead of away from the furry animal with whiskers
 and towards the yellow cheesy thing

 On 17/06/2015 20:45, Read, James C wrote:
  Doesn't look like the LM is contributing all that much then does it?
 
  James
 
  
  From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on
 behalf of Hieu Hoang hieuho...@gmail.com
  Sent: Wednesday, June 17, 2015 7:35 PM
  To: Kenneth Heafield; moses-support@mit.edu
  Subject: Re: [Moses-support] Major bug found in Moses
 
  On 17/06/2015 20:13, Kenneth Heafield wrote:
  I'll bite.
 
  The moses.ini files ship with bogus feature weights.  One is required to
  tune the system to discover good weights for their system.  You did not
  tune.  The results of an untuned system are meaningless.
 
  So for example if the feature weights are all zeros, then the scores are
  all zero.  The system will arbitrarily pick some awful translation from
  a large space of translations.
 
  The filter looks at one feature p(target | source).  So now you've
  constrained the awful untuned model to a slightly better region of the
  search space.
 
  In other words, all you've done is a poor approximation to manually
  setting the weight to 1.0 on p(target | source) and the rest to 0.
 
  The problem isn't that you are running without a language model (though
  we generally do not care what happens without one).  The problem is that
  you did not tune the feature weights.
 
  Moreover, as Marcin is pointing out, I wouldn't necessarily expect
  tuning to work without an LM.
  Tuning does work without a LM. The results aren't half bad. fr-en
  europarl (pb):
  with LM: 22.84
  retuned without LM: 18.33
  On 06/17/15 11:56, Read, James C wrote:
  Actually the approximation I expect to be:
 
  p(e|f)=p(f|e)
 
  Why would you expect this to give poor results if the TM is well
 trained? Surely the results of my filtering experiments provve otherwise.
 
  James
 
  
  From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu
 on behalf of Rico Sennrich rico.sennr...@gmx.ch
  Sent: Wednesday, June 17, 2015 5:32 PM
  To: moses-support@mit.edu
  Subject: Re: [Moses-support] Major bug found in Moses
 
  Read, James C jcread@... writes:
 
  I have been unable to find a logical explanation for this behaviour
 other
  than to conclude that there must be some kind of bug in Moses which
 causes a
  TM only run of Moses to perform poorly in finding the most likely
  translations according to the TM when
 there are less likely phrase pairs included in the race.
  I may have overlooked something, but you seem to have removed the
 language
  model from your config, and used default weights. your default model
 will
  thus (roughly) implement the following model:
 
  p(e|f) = p(e|f)*p(f|e)
 
  which is obviously wrong, and will give you poor results. This is not
 a bug
  in the code, but a poor choice

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Matt Post

When you filter the TM, you reported that you used the fourth weight. When you 
translate with the full TM, what weights did you assign to the TM? If you used 
the default, I believe it would equally weight all the phrasal features (i.e., 
1 1 1 1). This would explain why decoding with the full TM does not give the 
same result as filtering first. The moses.ini in your unfiltered translation 
experiment should assign weights of 0 0 0 1 to the TM features.


 On Jun 17, 2015, at 1:52 PM, Read, James C jcr...@essex.ac.uk wrote:
 
 The analogy doesn't seem to be helping me understand just how exactly it is a 
 desirable quality of a TM to 
 
 a) completely break down if no LM is used (thank you for showing that such is 
 not always the case)
 b) be dependent on a tuning step to help it find the higher scoring 
 translations
 
 What you seem to be essentially saying is that the TM cannot find the higher 
 scoring translations because I didn't pretune the system to do so. And I am 
 supposed to accept that such is a desirable quality of a system whose very 
 job is to find the higher scoring translations.
 
 Further, I am still unclear which features you prequire a system to be tuned 
 on. At the very least it seems that I have discovered the selection process 
 that tuning seems to be making up for in some unspecified and altogether 
 opaque way.
 
 James
 
 
 
 From: Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 8:34 PM
 To: Read, James C; Kenneth Heafield; moses-support@mit.edu
 Cc: Arnold, Doug
 Subject: Re: [Moses-support] Major bug found in Moses
 
 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion
 that LM are needed for tuning.
 
 I have some sympathy for you. You're looking at ways to improve
 translation by reducing the search space. I've bashed my head against
 this wall for a while as well without much success.
 
 However, as everyone is telling you, you haven't understood the role of
 tuning. Without tuning, you're pointing your lab rat to some random part
 of the search space, instead of away from the furry animal with whiskers
 and towards the yellow cheesy thing
 
 On 17/06/2015 20:45, Read, James C wrote:
 Doesn't look like the LM is contributing all that much then does it?
 
 James
 
 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on 
 behalf of Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 7:35 PM
 To: Kenneth Heafield; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses
 
 On 17/06/2015 20:13, Kenneth Heafield wrote:
 I'll bite.
 
 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.
 
 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.
 
 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.
 
 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.
 
 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.
 
 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.
 Tuning does work without a LM. The results aren't half bad. fr-en
 europarl (pb):
with LM: 22.84
retuned without LM: 18.33
 On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:
 
 p(e|f)=p(f|e)
 
 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.
 
 James
 
 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on 
 behalf of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses
 
 Read, James C jcread@... writes:
 
 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes 
 a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
   there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:
 
 p(e|f) = p(e|f)*p(f|e)
 
 which is obviously wrong, and will give you poor results. This is not a bug
 in the code

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread amittai axelrod

hi --

you might not be aware, but your emails sound almost belligerently 
confrontational. i can see how you would be frustrated, but starting a 
conversation with i have found a major bug and then repeatedly saying 
that clearly everything is broken -- that may not be the best way to 
convince the few hundred people on the mailing list of the soundness of 
your approach.

also, your argument could be easily mis-interpreted as this behavior is 
unexpected to me, ergo this is unexpected behavior, and that will 
unfortunately bias the listener against you, as that is the preferred 
argument structure of conspiracy theorists.

at any rate, the system is designed to take a large number of phrase 
pairs and model scores cobble them together into a translation. it does 
do that. it appears that you have identified a different way of doing 
that cobbling-together, one that uses much fewer models -- so far so good!

however, from reading your paper, it seems that your baseline is 
completely unoptimized, so performance gains against it may not show up 
in the real world. as specific examples, Table 1 in your paper shows 
that your baseline French-English system score is 11.36, Spanish-English 
is 7.16, and German-English is 6.70 BLEU. if you compare those baselines 
against published results in those languages from the previous few 
years, you will see that those scores are well off the mark. your 
position will be helped by showing results against a stronger, yet still 
basic, baseline.

what happens if you compare your approach against a vanilla use of the 
Moses pipeline [this includes tuning]?

cheers,
~amittai



On 6/17/15 12:45, Read, James C wrote:
 Doesn't look like the LM is contributing all that much then does it?

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 7:35 PM
 To: Kenneth Heafield; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 On 17/06/2015 20:13, Kenneth Heafield wrote:
 I'll bite.

 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.

 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.

 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.

 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.

 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.

 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.
 Tuning does work without a LM. The results aren't half bad. fr-en
 europarl (pb):
 with LM: 22.84
 retuned without LM: 18.33

 On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on 
 behalf of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:

 p(e|f) = p(e|f)*p(f|e)

 which is obviously wrong, and will give you poor results. This is not a bug
 in the code, but a poor choice of models and weights. Standard steps in SMT
 (like tuning the model weights on a development set, and including a
 language model) will give you the desired results.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

Read here for a table of results for 40 language pairs:


http://privatewww.essex.ac.uk/~jcread/paper.pdf


Would you honestly expect such huge differences in BLEU score? Honestly!?


James



From: Read, James C
Sent: Wednesday, June 17, 2015 4:56 PM
To: Marcin Junczys-Dowmunt
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses


You would expect an improvement of 37 BLEU points?


James



From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 4:32 PM
To: Read, James C
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses


Hi James,

there are many more factors involved than just probability, for instance word 
penalties, phrase penalities etc. To be able to validate your own claim you 
would need to set weights for all those non-probabilities to zero. Otherwise 
there is no hope that moses will produce anything similar to the most probable 
translation. And based on that there is no surprise that there may be different 
translations. A pruned phrase table will produce naturally less noise, so I 
would say the behaviour you describe is quite exactly what I would expect to 
happen.

Best,

Marcin

W dniu 2015-06-17 15:26, Read, James C napisal(a):

Hi all,



I tried unsuccessfully to publish experiments showing this bug in Moses 
behaviour. As a result I have lost interest in attempting to have my work 
published. Nonetheless I think you all should be aware of an anomaly in Moses' 
behaviour which I have thoroughly exposed and should be easy enough for you to 
reproduce.



As I understand it the TM logic of Moses should select the most likely 
translations according to the TM. I would therefore expect a run of Moses with 
no LM to find sentences which are the most likely or at least close to the most 
likely according to the TM.



To test this behaviour I performed two runs of Moses. One with an unfiltered 
phrase table the other with a filtered phrase table which left only the most 
likely phrase pair for each source language phrase. The results were truly 
startling. I observed huge differences in BLEU score. The filtered phrase 
tables produced much higher BLEU scores. The beam size used was the default 
width of 100. I would not have been surprised in the differences in BLEU scores 
where minimal but they were quite high.



I have been unable to find a logical explanation for this behaviour other than 
to conclude that there must be some kind of bug in Moses which causes a TM only 
run of Moses to perform poorly in finding the most likely translations 
according to the TM when there are less likely phrase pairs included in the 
race.



I hope this information will be useful to the Moses community and that the 
cause of the behaviour can be found and rectified.



James


___
Moses-support mailing list
Moses-support@mit.edumailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support





___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Marcin Junczys-Dowmunt

 

Hi James 

No, not at all. I would say that is expected behaviour. It's how search
spaces and optimization works. If anything these are methodological
mistakes on your side, sorry. You are doing weird thinds to the decoder
and then you are surprised to get weird results from it. 

W dniu 2015-06-17 16:07, Read, James C napisał(a): 

 So, do we agree that this is undersirable behaviour and therefore a bug? 
 
 James
 
 -
 
 FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 SENT: Wednesday, June 17, 2015 5:01 PM
 TO: Read, James C
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 As I said. With an unpruned phrase table and an decoder that just optmizes 
 some unreasonble set of weights all bets are off, so if you get very low BLEU 
 point there, it's not surprising. It's probably jumping around in a very 
 weird search space. With a pruned phrase table you restrict the search space 
 VERY strongly. Nearly everything that will be produced is a half-decent 
 translation. So yes, I can imagine that would happen. 
 
 Marcin 
 
 W dniu 2015-06-17 15:56, Read, James C napisał(a): 
 
 You would expect an improvement of 37 BLEU points? 
 
 James 
 
 -
 
 FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 SENT: Wednesday, June 17, 2015 4:32 PM
 TO: Read, James C
 CC: Moses-support@mit.edu; Arnold, Doug
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 Hi James, 
 
 there are many more factors involved than just probability, for instance word 
 penalties, phrase penalities etc. To be able to validate your own claim you 
 would need to set weights for all those non-probabilities to zero. Otherwise 
 there is no hope that moses will produce anything similar to the most 
 probable translation. And based on that there is no surprise that there may 
 be different translations. A pruned phrase table will produce naturally less 
 noise, so I would say the behaviour you describe is quite exactly what I 
 would expect to happen. 
 
 Best, 
 
 Marcin 
 
 W dniu 2015-06-17 15:26, Read, James C napisał(a): 
 
 Hi all, 
 
 I tried unsuccessfully to publish experiments showing this bug in Moses 
 behaviour. As a result I have lost interest in attempting to have my work 
 published. Nonetheless I think you all should be aware of an anomaly in 
 Moses' behaviour which I have thoroughly exposed and should be easy enough 
 for you to reproduce. 
 
 As I understand it the TM logic of Moses should select the most likely 
 translations according to the TM. I would therefore expect a run of Moses 
 with no LM to find sentences which are the most likely or at least close to 
 the most likely according to the TM. 
 
 To test this behaviour I performed two runs of Moses. One with an unfiltered 
 phrase table the other with a filtered phrase table which left only the most 
 likely phrase pair for each source language phrase. The results were truly 
 startling. I observed huge differences in BLEU score. The filtered phrase 
 tables produced much higher BLEU scores. The beam size used was the default 
 width of 100. I would not have been surprised in the differences in BLEU 
 scores where minimal but they were quite high. 
 
 I have been unable to find a logical explanation for this behaviour other 
 than to conclude that there must be some kind of bug in Moses which causes a 
 TM only run of Moses to perform poorly in finding the most likely 
 translations according to the TM when there are less likely phrase pairs 
 included in the race. 
 
 I hope this information will be useful to the Moses community and that the 
 cause of the behaviour can be found and rectified. 
 
 James 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support [1]

 

Links:
--
[1] http://mailman.mit.edu/mailman/listinfo/moses-support
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

All I did was break the link to the language model and then perform filtering. 
How is that a methodoligical mistake? How else would one test the efficacy of 
the TM in isolation?


I remain convinced that this is undersirable behaviour and therefore a bug.


James



From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:12 PM
To: Read, James C
Cc: Arnold, Doug; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses


Hi James

No, not at all. I would say that is expected behaviour. It's how search spaces 
and optimization works. If anything these are methodological mistakes on your 
side, sorry.  You are doing weird thinds to the decoder and then you are 
surprised to get weird results from it.

W dniu 2015-06-17 16:07, Read, James C napisał(a):



So, do we agree that this is undersirable behaviour and therefore a bug?

James


From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:01 PM
To: Read, James C
Subject: Re: [Moses-support] Major bug found in Moses


As I said. With an unpruned phrase table and an decoder that just optmizes some 
unreasonble set of weights all bets are off, so if you get very low BLEU point 
there, it's not surprising. It's probably jumping around in a very weird search 
space. With a pruned phrase table you restrict the search space VERY strongly. 
Nearly everything that will be produced is a half-decent translation. So yes, I 
can imagine that would happen.

Marcin

W dniu 2015-06-17 15:56, Read, James C napisał(a):

You would expect an improvement of 37 BLEU points?



James



From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 4:32 PM
To: Read, James C
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses


Hi James,

there are many more factors involved than just probability, for instance word 
penalties, phrase penalities etc. To be able to validate your own claim you 
would need to set weights for all those non-probabilities to zero. Otherwise 
there is no hope that moses will produce anything similar to the most probable 
translation. And based on that there is no surprise that there may be different 
translations. A pruned phrase table will produce naturally less noise, so I 
would say the behaviour you describe is quite exactly what I would expect to 
happen.

Best,

Marcin

W dniu 2015-06-17 15:26, Read, James C napisał(a):

Hi all,



I tried unsuccessfully to publish experiments showing this bug in Moses 
behaviour. As a result I have lost interest in attempting to have my work 
published. Nonetheless I think you all should be aware of an anomaly in Moses' 
behaviour which I have thoroughly exposed and should be easy enough for you to 
reproduce.



As I understand it the TM logic of Moses should select the most likely 
translations according to the TM. I would therefore expect a run of Moses with 
no LM to find sentences which are the most likely or at least close to the most 
likely according to the TM.



To test this behaviour I performed two runs of Moses. One with an unfiltered 
phrase table the other with a filtered phrase table which left only the most 
likely phrase pair for each source language phrase. The results were truly 
startling. I observed huge differences in BLEU score. The filtered phrase 
tables produced much higher BLEU scores. The beam size used was the default 
width of 100. I would not have been surprised in the differences in BLEU scores 
where minimal but they were quite high.



I have been unable to find a logical explanation for this behaviour other than 
to conclude that there must be some kind of bug in Moses which causes a TM only 
run of Moses to perform poorly in finding the most likely translations 
according to the TM when there are less likely phrase pairs included in the 
race.



I hope this information will be useful to the Moses community and that the 
cause of the behaviour can be found and rectified.



James


___
Moses-support mailing list
Moses-support@mit.edumailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support













___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Rico Sennrich

Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
than to conclude that there must be some kind of bug in Moses which causes a
TM only run of Moses to perform poorly in finding the most likely
translations according to the TM when
  there are less likely phrase pairs included in the race.

I may have overlooked something, but you seem to have removed the language
model from your config, and used default weights. your default model will
thus (roughly) implement the following model:

p(e|f) = p(e|f)*p(f|e)

which is obviously wrong, and will give you poor results. This is not a bug
in the code, but a poor choice of models and weights. Standard steps in SMT
(like tuning the model weights on a development set, and including a
language model) will give you the desired results.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Marcin Junczys-Dowmunt

 

Hi James, 

there are many more factors involved than just probability, for instance
word penalties, phrase penalities etc. To be able to validate your own
claim you would need to set weights for all those non-probabilities to
zero. Otherwise there is no hope that moses will produce anything
similar to the most probable translation. And based on that there is no
surprise that there may be different translations. A pruned phrase table
will produce naturally less noise, so I would say the behaviour you
describe is quite exactly what I would expect to happen. 

Best, 

Marcin 

W dniu 2015-06-17 15:26, Read, James C napisał(a): 

 Hi all, 
 
 I tried unsuccessfully to publish experiments showing this bug in Moses 
 behaviour. As a result I have lost interest in attempting to have my work 
 published. Nonetheless I think you all should be aware of an anomaly in 
 Moses' behaviour which I have thoroughly exposed and should be easy enough 
 for you to reproduce. 
 
 As I understand it the TM logic of Moses should select the most likely 
 translations according to the TM. I would therefore expect a run of Moses 
 with no LM to find sentences which are the most likely or at least close to 
 the most likely according to the TM. 
 
 To test this behaviour I performed two runs of Moses. One with an unfiltered 
 phrase table the other with a filtered phrase table which left only the most 
 likely phrase pair for each source language phrase. The results were truly 
 startling. I observed huge differences in BLEU score. The filtered phrase 
 tables produced much higher BLEU scores. The beam size used was the default 
 width of 100. I would not have been surprised in the differences in BLEU 
 scores where minimal but they were quite high. 
 
 I have been unable to find a logical explanation for this behaviour other 
 than to conclude that there must be some kind of bug in Moses which causes a 
 TM only run of Moses to perform poorly in finding the most likely 
 translations according to the TM when there are less likely phrase pairs 
 included in the race. 
 
 I hope this information will be useful to the Moses community and that the 
 cause of the behaviour can be found and rectified. 
 
 James 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support [1]

 

Links:
--
[1] http://mailman.mit.edu/mailman/listinfo/moses-support
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Ondrej Bojar

Hi,

BLEU scores don't mean much, unless you know what the translations look like. 
Marcin's explanation sounds very plausible.

How did you set weights in your experiment? And were they fixed for the two 
contrastive runs?

Cheers, O.


On June 17, 2015 4:01:26 PM CEST, Read, James C jcr...@essex.ac.uk wrote:
Read here for a table of results for 40 language pairs:


http://privatewww.essex.ac.uk/~jcread/paper.pdf


Would you honestly expect such huge differences in BLEU score?
Honestly!?


James



From: Read, James C
Sent: Wednesday, June 17, 2015 4:56 PM
To: Marcin Junczys-Dowmunt
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses


You would expect an improvement of 37 BLEU points?


James



From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 4:32 PM
To: Read, James C
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses


Hi James,

there are many more factors involved than just probability, for
instance word penalties, phrase penalities etc. To be able to validate
your own claim you would need to set weights for all those
non-probabilities to zero. Otherwise there is no hope that moses will
produce anything similar to the most probable translation. And based on
that there is no surprise that there may be different translations. A
pruned phrase table will produce naturally less noise, so I would say
the behaviour you describe is quite exactly what I would expect to
happen.

Best,

Marcin

W dniu 2015-06-17 15:26, Read, James C napisal(a):

Hi all,



I tried unsuccessfully to publish experiments showing this bug in Moses
behaviour. As a result I have lost interest in attempting to have my
work published. Nonetheless I think you all should be aware of an
anomaly in Moses' behaviour which I have thoroughly exposed and should
be easy enough for you to reproduce.



As I understand it the TM logic of Moses should select the most likely
translations according to the TM. I would therefore expect a run of
Moses with no LM to find sentences which are the most likely or at
least close to the most likely according to the TM.



To test this behaviour I performed two runs of Moses. One with an
unfiltered phrase table the other with a filtered phrase table which
left only the most likely phrase pair for each source language phrase.
The results were truly startling. I observed huge differences in BLEU
score. The filtered phrase tables produced much higher BLEU scores. The
beam size used was the default width of 100. I would not have been
surprised in the differences in BLEU scores where minimal but they were
quite high.



I have been unable to find a logical explanation for this behaviour
other than to conclude that there must be some kind of bug in Moses
which causes a TM only run of Moses to perform poorly in finding the
most likely translations according to the TM when there are less likely
phrase pairs included in the race.



I hope this information will be useful to the Moses community and that
the cause of the behaviour can be found and rectified.



James


___
Moses-support mailing list
Moses-support@mit.edumailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support









___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Marcin Junczys-Dowmunt

 

To paint you a picture: 

Imagine you have a rat in a labyrinth (the labyrinth is the TM and the
search space). That rat is quite good at finding the center of that
labyrinth. Now you somehow disable that rat's sense of smell, sense of
direction, and long-term short-term memory (that's the LM). Can you
expect the rat to find the center? Or will it just tumble around,
bumping into walls and not find anything? That's what you did to the
decoder when disabling the LM. 

Now you prune the TM. In the labyrinth that's like closing all the doors
that would lead the rat away from the center. There are still a few
corridors left, but they all point into the general direction of the
point where the rat is supposed to go. Although it may never quite reach
it. Now you put that same handicapped rat into the labyrinth where all
ways lead more or less to the center. Are you really surprised that the
clueless rat find the center nearly every time now? 

That's what happend. It's not a bug. The LM is probably the strongest
feature in a MT system. If you take that away you see what happens. 

W dniu 2015-06-17 16:22, Read, James C napisał(a): 

 All I did was break the link to the language model and then perform 
 filtering. How is that a methodoligical mistake? How else would one test the 
 efficacy of the TM in isolation? 
 
 I remain convinced that this is undersirable behaviour and therefore a bug. 
 
 James 
 
 -
 
 FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 SENT: Wednesday, June 17, 2015 5:12 PM
 TO: Read, James C
 CC: Arnold, Doug; moses-support@mit.edu
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 Hi James 
 
 No, not at all. I would say that is expected behaviour. It's how search 
 spaces and optimization works. If anything these are methodological mistakes 
 on your side, sorry. You are doing weird thinds to the decoder and then you 
 are surprised to get weird results from it. 
 
 W dniu 2015-06-17 16:07, Read, James C napisał(a): 
 
 So, do we agree that this is undersirable behaviour and therefore a bug? 
 
 James
 
 -
 
 FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 SENT: Wednesday, June 17, 2015 5:01 PM
 TO: Read, James C
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 As I said. With an unpruned phrase table and an decoder that just optmizes 
 some unreasonble set of weights all bets are off, so if you get very low BLEU 
 point there, it's not surprising. It's probably jumping around in a very 
 weird search space. With a pruned phrase table you restrict the search space 
 VERY strongly. Nearly everything that will be produced is a half-decent 
 translation. So yes, I can imagine that would happen. 
 
 Marcin 
 
 W dniu 2015-06-17 15:56, Read, James C napisał(a): 
 
 You would expect an improvement of 37 BLEU points? 
 
 James 
 
 -
 
 FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 SENT: Wednesday, June 17, 2015 4:32 PM
 TO: Read, James C
 CC: Moses-support@mit.edu; Arnold, Doug
 SUBJECT: Re: [Moses-support] Major bug found in Moses 
 
 Hi James, 
 
 there are many more factors involved than just probability, for instance word 
 penalties, phrase penalities etc. To be able to validate your own claim you 
 would need to set weights for all those non-probabilities to zero. Otherwise 
 there is no hope that moses will produce anything similar to the most 
 probable translation. And based on that there is no surprise that there may 
 be different translations. A pruned phrase table will produce naturally less 
 noise, so I would say the behaviour you describe is quite exactly what I 
 would expect to happen. 
 
 Best, 
 
 Marcin 
 
 W dniu 2015-06-17 15:26, Read, James C napisał(a): 
 
 Hi all, 
 
 I tried unsuccessfully to publish experiments showing this bug in Moses 
 behaviour. As a result I have lost interest in attempting to have my work 
 published. Nonetheless I think you all should be aware of an anomaly in 
 Moses' behaviour which I have thoroughly exposed and should be easy enough 
 for you to reproduce. 
 
 As I understand it the TM logic of Moses should select the most likely 
 translations according to the TM. I would therefore expect a run of Moses 
 with no LM to find sentences which are the most likely or at least close to 
 the most likely according to the TM. 
 
 To test this behaviour I performed two runs of Moses. One with an unfiltered 
 phrase table the other with a filtered phrase table which left only the most 
 likely phrase pair for each source language phrase. The results were truly 
 startling. I observed huge differences in BLEU score. The filtered phrase 
 tables produced much higher BLEU scores. The beam size used was the default 
 width of 100. I would not have been surprised in the differences in BLEU 
 scores where minimal but they were quite high. 
 
 I have been unable to find a logical explanation for this behaviour other 
 than to conclude that there must be some

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

Doesn't look like the LM is contributing all that much then does it?

James

From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Hieu Hoang hieuho...@gmail.com
Sent: Wednesday, June 17, 2015 7:35 PM
To: Kenneth Heafield; moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On 17/06/2015 20:13, Kenneth Heafield wrote:
 I'll bite.

 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.

 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.

 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.

 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.

 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.

 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.
Tuning does work without a LM. The results aren't half bad. fr-en
europarl (pb):
   with LM: 22.84
   retuned without LM: 18.33

 On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on 
 behalf of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
   there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:

 p(e|f) = p(e|f)*p(f|e)

 which is obviously wrong, and will give you poor results. This is not a bug
 in the code, but a poor choice of models and weights. Standard steps in SMT
 (like tuning the model weights on a development set, and including a
 language model) will give you the desired results.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

--
Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Lane Schwartz

James,

Did you run any optimizer? MERT, MIRA, PRO, etc?

Lane


On Wed, Jun 17, 2015 at 11:45 AM, Read, James C jcr...@essex.ac.uk wrote:

 Doesn't look like the LM is contributing all that much then does it?

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on
 behalf of Hieu Hoang hieuho...@gmail.com
 Sent: Wednesday, June 17, 2015 7:35 PM
 To: Kenneth Heafield; moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 On 17/06/2015 20:13, Kenneth Heafield wrote:
  I'll bite.
 
  The moses.ini files ship with bogus feature weights.  One is required to
  tune the system to discover good weights for their system.  You did not
  tune.  The results of an untuned system are meaningless.
 
  So for example if the feature weights are all zeros, then the scores are
  all zero.  The system will arbitrarily pick some awful translation from
  a large space of translations.
 
  The filter looks at one feature p(target | source).  So now you've
  constrained the awful untuned model to a slightly better region of the
  search space.
 
  In other words, all you've done is a poor approximation to manually
  setting the weight to 1.0 on p(target | source) and the rest to 0.
 
  The problem isn't that you are running without a language model (though
  we generally do not care what happens without one).  The problem is that
  you did not tune the feature weights.
 
  Moreover, as Marcin is pointing out, I wouldn't necessarily expect
  tuning to work without an LM.
 Tuning does work without a LM. The results aren't half bad. fr-en
 europarl (pb):
with LM: 22.84
retuned without LM: 18.33
 
  On 06/17/15 11:56, Read, James C wrote:
  Actually the approximation I expect to be:
 
  p(e|f)=p(f|e)
 
  Why would you expect this to give poor results if the TM is well
 trained? Surely the results of my filtering experiments provve otherwise.
 
  James
 
  
  From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on
 behalf of Rico Sennrich rico.sennr...@gmx.ch
  Sent: Wednesday, June 17, 2015 5:32 PM
  To: moses-support@mit.edu
  Subject: Re: [Moses-support] Major bug found in Moses
 
  Read, James C jcread@... writes:
 
  I have been unable to find a logical explanation for this behaviour
 other
  than to conclude that there must be some kind of bug in Moses which
 causes a
  TM only run of Moses to perform poorly in finding the most likely
  translations according to the TM when
there are less likely phrase pairs included in the race.
  I may have overlooked something, but you seem to have removed the
 language
  model from your config, and used default weights. your default model
 will
  thus (roughly) implement the following model:
 
  p(e|f) = p(e|f)*p(f|e)
 
  which is obviously wrong, and will give you poor results. This is not a
 bug
  in the code, but a poor choice of models and weights. Standard steps in
 SMT
  (like tuning the model weights on a development set, and including a
  language model) will give you the desired results.
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 

 --
 Hieu Hoang
 Researcher
 New York University, Abu Dhabi
 http://www.hoang.co.uk/hieu

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, Time Enough For Love
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

No.

James

From: Lane Schwartz dowob...@gmail.com
Sent: Wednesday, June 17, 2015 7:58 PM
To: Read, James C
Cc: Hieu Hoang; Kenneth Heafield; moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

James,

Did you run any optimizer? MERT, MIRA, PRO, etc?

Lane

On Wed, Jun 17, 2015 at 11:45 AM, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:
Doesn't look like the LM is contributing all that much then does it?

James

From: moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu 
moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu on behalf 
of Hieu Hoang hieuho...@gmail.commailto:hieuho...@gmail.com
Sent: Wednesday, June 17, 2015 7:35 PM
To: Kenneth Heafield; moses-support@mit.edumailto:moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

On 17/06/2015 20:13, Kenneth Heafield wrote:
 I'll bite.

 The moses.ini files ship with bogus feature weights.  One is required to
 tune the system to discover good weights for their system.  You did not
 tune.  The results of an untuned system are meaningless.

 So for example if the feature weights are all zeros, then the scores are
 all zero.  The system will arbitrarily pick some awful translation from
 a large space of translations.

 The filter looks at one feature p(target | source).  So now you've
 constrained the awful untuned model to a slightly better region of the
 search space.

 In other words, all you've done is a poor approximation to manually
 setting the weight to 1.0 on p(target | source) and the rest to 0.

 The problem isn't that you are running without a language model (though
 we generally do not care what happens without one).  The problem is that
 you did not tune the feature weights.

 Moreover, as Marcin is pointing out, I wouldn't necessarily expect
 tuning to work without an LM.
Tuning does work without a LM. The results aren't half bad. fr-en
europarl (pb):
   with LM: 22.84
   retuned without LM: 18.33

 On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 From: moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu 
 moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu on 
 behalf of Rico Sennrich rico.sennr...@gmx.chmailto:rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edumailto:moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
   there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:

 p(e|f) = p(e|f)*p(f|e)

 which is obviously wrong, and will give you poor results. This is not a bug
 in the code, but a poor choice of models and weights. Standard steps in SMT
 (like tuning the model weights on a development set, and including a
 language model) will give you the desired results.

 ___
 Moses-support mailing list
 Moses-support@mit.edumailto:Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edumailto:Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edumailto:Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

--
Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu

___
Moses-support mailing list
Moses-support@mit.edumailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edumailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

--
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, Time Enough For Love
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C


Which features would you like me to tune? The whole purpose of the exercise was 
to eliminate all variables except the TM and to keep constant those that could 
not be eliminated so that I could see which types of phrase pairs contribute 
most to increases in BLEU score in a TM only setup.

Now you are saying I have to tune but tuning won't work without a LM. So how do 
you expect a researcher to be able to understand how well the TM component of 
the system is working if you are going to insist that I must include a LM for 
tuning to work. 

Clearly the system is broken. It is designed to work well with a LM and poorly 
without. When clearly good results can be obtained with a functional TM and 
well chosen phrase pairs.

James


From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Kenneth Heafield mo...@kheafield.com
Sent: Wednesday, June 17, 2015 7:13 PM
To: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

I'll bite.

The moses.ini files ship with bogus feature weights.  One is required to
tune the system to discover good weights for their system.  You did not
tune.  The results of an untuned system are meaningless.

So for example if the feature weights are all zeros, then the scores are
all zero.  The system will arbitrarily pick some awful translation from
a large space of translations.

The filter looks at one feature p(target | source).  So now you've
constrained the awful untuned model to a slightly better region of the
search space.

In other words, all you've done is a poor approximation to manually
setting the weight to 1.0 on p(target | source) and the rest to 0.

The problem isn't that you are running without a language model (though
we generally do not care what happens without one).  The problem is that
you did not tune the feature weights.

Moreover, as Marcin is pointing out, I wouldn't necessarily expect
tuning to work without an LM.

On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
  there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:

 p(e|f) = p(e|f)*p(f|e)

 which is obviously wrong, and will give you poor results. This is not a bug
 in the code, but a poor choice of models and weights. Standard steps in SMT
 (like tuning the model weights on a development set, and including a
 language model) will give you the desired results.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Rico Sennrich

Read, James C jcread@... writes:

 
 Actually the approximation I expect to be:
 
 p(e|f)=p(f|e)
 
 Why would you expect this to give poor results if the TM is well trained?
Surely the results of my filtering
 experiments provve otherwise.
 
 James

I recommend you read the following:
https://en.wikipedia.org/wiki/Confusion_of_the_inverse

you don't explain which score you use for filtering (do you take one of the
scores, their sum, their product, or something else?), but I expect you
(mostly) keep the phrase pairs with a high p(e|f), which is the best thing
to do when you don't have a language model.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

I already answered this question in another post. Apologies for double posting. 
Here is the code I used for filtering. I filtered based on the fourth score 
only.

#!/usr/bin/perl -w
#
# Program filters phrase table to leave only phrase pairs
# with probability above a threshold
#
use strict;
use warnings;
use Getopt::Long;

my $phrase;
my $min;
my $phrase_table;
my $filtered_table;

GetOptions( 'min=f' = \$min,
'out=s' = \$filtered_table,
'in=s'  = \$phrase_table);
die ERROR: must give threshold and phrase table input file and output file\n 
unless ($min  $phrase_table  $filtered_table);
die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table);
open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase table 
$phrase_table\n;;
open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase 
table $filtered_table\n;;

while (my $line = PHRASETABLE)
{
chomp $line;
my @columns = split ('\|\|\|', $line);

# check that file is a well formatted phrase table
if (scalar @columns  4)
{
die ERROR: input file is not a well formatted phrase table. A 
phrase table must have at least four colums each column separated by |||\n;
}

# get the probability and check it is less than the threshold
my @scores = split /\s+/, $columns[2];
if ($scores[3]  $min)
{
print FILTEREDTABLE $line.\n;;
}
}



From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Rico Sennrich rico.sennr...@gmx.ch
Sent: Wednesday, June 17, 2015 7:17 PM
To: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Read, James C jcread@... writes:


 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained?
Surely the results of my filtering
 experiments provve otherwise.

 James

I recommend you read the following:
https://en.wikipedia.org/wiki/Confusion_of_the_inverse

you don't explain which score you use for filtering (do you take one of the
scores, their sum, their product, or something else?), but I expect you
(mostly) keep the phrase pairs with a high p(e|f), which is the best thing
to do when you don't have a language model.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

Below I include a typical moses.ini file. Of course they were kept the same for 
both runs. The only difference was the phrase table filtering. I did everything 
in my power to make this the only variable.

James

From: Ondrej Bojar bo...@ufal.mff.cuni.cz
Sent: Wednesday, June 17, 2015 5:23 PM
To: Read, James C; Marcin Junczys-Dowmunt
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

Hi,

BLEU scores don't mean much, unless you know what the translations look like. 
Marcin's explanation sounds very plausible.

How did you set weights in your experiment? And were they fixed for the two 
contrastive runs?

Cheers, O.

On June 17, 2015 4:01:26 PM CEST, Read, James C jcr...@essex.ac.uk wrote:
Read here for a table of results for 40 language pairs:

http://privatewww.essex.ac.uk/~jcread/paper.pdf

Would you honestly expect such huge differences in BLEU score?
Honestly!?

James

From: Read, James C
Sent: Wednesday, June 17, 2015 4:56 PM
To: Marcin Junczys-Dowmunt
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

You would expect an improvement of 37 BLEU points?

James

From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 4:32 PM
To: Read, James C
Cc: Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

Hi James,

there are many more factors involved than just probability, for
instance word penalties, phrase penalities etc. To be able to validate
your own claim you would need to set weights for all those
non-probabilities to zero. Otherwise there is no hope that moses will
produce anything similar to the most probable translation. And based on
that there is no surprise that there may be different translations. A
pruned phrase table will produce naturally less noise, so I would say
the behaviour you describe is quite exactly what I would expect to
happen.

Best,

Marcin

W dniu 2015-06-17 15:26, Read, James C napisal(a):

Hi all,

I tried unsuccessfully to publish experiments showing this bug in Moses
behaviour. As a result I have lost interest in attempting to have my
work published. Nonetheless I think you all should be aware of an
anomaly in Moses' behaviour which I have thoroughly exposed and should
be easy enough for you to reproduce.

As I understand it the TM logic of Moses should select the most likely
translations according to the TM. I would therefore expect a run of
Moses with no LM to find sentences which are the most likely or at
least close to the most likely according to the TM.

To test this behaviour I performed two runs of Moses. One with an
unfiltered phrase table the other with a filtered phrase table which
left only the most likely phrase pair for each source language phrase.
The results were truly startling. I observed huge differences in BLEU
score. The filtered phrase tables produced much higher BLEU scores. The
beam size used was the default width of 100. I would not have been
surprised in the differences in BLEU scores where minimal but they were
quite high.

I have been unable to find a logical explanation for this behaviour
other than to conclude that there must be some kind of bug in Moses
which causes a TM only run of Moses to perform poorly in finding the
most likely translations according to the TM when there are less likely
phrase pairs included in the race.

I hope this information will be useful to the Moses community and that
the cause of the behaviour can be found and rectified.

James

___
Moses-support mailing list
Moses-support@mit.edumailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

--
Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

Actually the approximation I expect to be:

p(e|f)=p(f|e)

Why would you expect this to give poor results if the TM is well trained? 
Surely the results of my filtering experiments provve otherwise.

James


From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
of Rico Sennrich rico.sennr...@gmx.ch
Sent: Wednesday, June 17, 2015 5:32 PM
To: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
than to conclude that there must be some kind of bug in Moses which causes a
TM only run of Moses to perform poorly in finding the most likely
translations according to the TM when
  there are less likely phrase pairs included in the race.

I may have overlooked something, but you seem to have removed the language
model from your config, and used default weights. your default model will
thus (roughly) implement the following model:

p(e|f) = p(e|f)*p(f|e)

which is obviously wrong, and will give you poor results. This is not a bug
in the code, but a poor choice of models and weights. Standard steps in SMT
(like tuning the model weights on a development set, and including a
language model) will give you the desired results.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Read, James C

Evidently, if you filter the phrase table then the LM is not as important as 
you might feel. The question remains why isn't the system capable of finding 
the most likely translations without the LM? Why do I need to filter to help 
the system find them? This is undesirable behaviour. Clearly a bug.


I include the code I used for filtering. As you can see the 4th score only was 
used as a filtering criteria.


#!/usr/bin/perl -w

#

# Program filters phrase table to leave only phrase pairs

# with probability above a threshold

#

use strict;

use warnings;

use Getopt::Long;


my $phrase;

my $min;

my $phrase_table;

my $filtered_table;


GetOptions( 'min=f' = \$min,

'out=s' = \$filtered_table,

'in=s'  = \$phrase_table);

die ERROR: must give threshold and phrase table input file and output file\n 
unless ($min  $phrase_table  $filtered_table);

die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table);

open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase table 
$phrase_table\n;;

open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase 
table $filtered_table\n;;


while (my $line = PHRASETABLE)

{

chomp $line;

my @columns = split ('\|\|\|', $line);


# check that file is a well formatted phrase table

if (scalar @columns  4)

{

die ERROR: input file is not a well formatted phrase table. A 
phrase table must have at least four colums each column separated by |||\n;

}


# get the probability and check it is less than the threshold

my @scores = split /\s+/, $columns[2];

if ($scores[3]  $min)

{

print FILTEREDTABLE $line.\n;;

}

}




From: Matt Post p...@cs.jhu.edu
Sent: Wednesday, June 17, 2015 5:25 PM
To: Read, James C
Cc: Marcin Junczys-Dowmunt; moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

I think you are misunderstanding how decoding works. The highest-weighted 
translation of each source phrase is not necessarily the one with the best BLEU 
score. This is why the decoder retains many options, so that it can search 
among them (together with their reorderings). The LM is an important component 
in making these selections.

Also, how did you weight the many probabilities attached to each phrase (to 
determine which was the most probable)? The tuning phase of decoding selects 
weights designed to optimize BLEU score. If you weighted them evenly, that is 
going to exacerbate this experiment.

matt



On Jun 17, 2015, at 10:22 AM, Read, James C 
jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote:

All I did was break the link to the language model and then perform filtering. 
How is that a methodoligical mistake? How else would one test the efficacy of 
the TM in isolation?

I remain convinced that this is undersirable behaviour and therefore a bug.

James



From: Marcin Junczys-Dowmunt junc...@amu.edu.plmailto:junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:12 PM
To: Read, James C
Cc: Arnold, Doug; moses-support@mit.edumailto:moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Hi James
No, not at all. I would say that is expected behaviour. It's how search spaces 
and optimization works. If anything these are methodological mistakes on your 
side, sorry.  You are doing weird thinds to the decoder and then you are 
surprised to get weird results from it.
W dniu 2015-06-17 16:07, Read, James C napisał(a):



So, do we agree that this is undersirable behaviour and therefore a bug?

James


From: Marcin Junczys-Dowmunt junc...@amu.edu.plmailto:junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 5:01 PM
To: Read, James C
Subject: Re: [Moses-support] Major bug found in Moses

As I said. With an unpruned phrase table and an decoder that just optmizes some 
unreasonble set of weights all bets are off, so if you get very low BLEU point 
there, it's not surprising. It's probably jumping around in a very weird search 
space. With a pruned phrase table you restrict the search space VERY strongly. 
Nearly everything that will be produced is a half-decent translation. So yes, I 
can imagine that would happen.
Marcin
W dniu 2015-06-17 15:56, Read, James C napisał(a):
You would expect an improvement of 37 BLEU points?



James



From: Marcin Junczys-Dowmunt junc...@amu.edu.plmailto:junc...@amu.edu.pl
Sent: Wednesday, June 17, 2015 4:32 PM
To: Read, James C
Cc: Moses-support@mit.edumailto:Moses-support@mit.edu; Arnold, Doug
Subject: Re: [Moses-support] Major bug found in Moses

Hi James,
there are many more factors involved than just probability, for instance word 
penalties, phrase penalities etc. To be able to validate your own claim you 
would need to set weights for all those non

Re: [Moses-support] Major bug found in Moses

2015-06-17 Thread Kenneth Heafield

I'll bite. 

The moses.ini files ship with bogus feature weights.  One is required to
tune the system to discover good weights for their system.  You did not
tune.  The results of an untuned system are meaningless. 

So for example if the feature weights are all zeros, then the scores are
all zero.  The system will arbitrarily pick some awful translation from
a large space of translations. 

The filter looks at one feature p(target | source).  So now you've
constrained the awful untuned model to a slightly better region of the
search space. 

In other words, all you've done is a poor approximation to manually
setting the weight to 1.0 on p(target | source) and the rest to 0. 

The problem isn't that you are running without a language model (though
we generally do not care what happens without one).  The problem is that
you did not tune the feature weights. 

Moreover, as Marcin is pointing out, I wouldn't necessarily expect
tuning to work without an LM. 

On 06/17/15 11:56, Read, James C wrote:
 Actually the approximation I expect to be:

 p(e|f)=p(f|e)

 Why would you expect this to give poor results if the TM is well trained? 
 Surely the results of my filtering experiments provve otherwise.

 James

 
 From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf 
 of Rico Sennrich rico.sennr...@gmx.ch
 Sent: Wednesday, June 17, 2015 5:32 PM
 To: moses-support@mit.edu
 Subject: Re: [Moses-support] Major bug found in Moses

 Read, James C jcread@... writes:

 I have been unable to find a logical explanation for this behaviour other
 than to conclude that there must be some kind of bug in Moses which causes a
 TM only run of Moses to perform poorly in finding the most likely
 translations according to the TM when
  there are less likely phrase pairs included in the race.
 I may have overlooked something, but you seem to have removed the language
 model from your config, and used default weights. your default model will
 thus (roughly) implement the following model:

 p(e|f) = p(e|f)*p(f|e)

 which is obviously wrong, and will give you poor results. This is not a bug
 in the code, but a poor choice of models and weights. Standard steps in SMT
 (like tuning the model weights on a development set, and including a
 language model) will give you the desired results.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

68 matches

Mail list logo