Re: [Moses-support] Major bug found in Moses
Improvements in 37 BLEU points over the default behaviour was not enough to show that there are problems with the default? James From: Raphael Payen raphael.pa...@gmail.com Sent: Sunday, June 21, 2015 5:29 PM To: Read, James C Cc: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses James, did you try the modifications Philip suggested (removing the word penalty and lowering p(f|e)? (I doubt it will be enough to get a best paper award, but it would probably improve your bleu, that's always a good start :) ) On Friday, June 19, 2015, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James From: Lane Schwartz dowob...@gmail.com Sent: Friday, June 19, 2015 7:40 PM To: Read, James C Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
I think it is you who seems to have missed the point. If the default behaviour is giving BLEU scores considerably lower than the BLEU score obtained from merely selecting the most likely translation of each phrase then there is evidently something very wrong with the default behaviour. If you can't see something as blindingly simple as that then at this point I'm thinking this really isn't a field I want anything to do wiht. James From: Matthias Huck mh...@inf.ed.ac.uk Sent: Friday, June 19, 2015 10:45 PM To: Read, James C Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, Well, it's pretty straightforward: The decoder's job is to find the hypothesis with the maximum model score. That's why everybody builds models which assign high model score to high-quality translations. Unfortunately, you missed this last point in your own work. Cheers, Matthias On Fri, 2015-06-19 at 14:15 +, Read, James C wrote: I'm gonna try once more. This is what he said: the decoder's job is NOT to find the high quality translation The next time I have a panel of potential investors in front of me I'm gonna pass that line by them and see how it goes down. I stress the words HIGH QUALITY TRANSLATION. Please promise me that the next time you put in a bid for funding you will guarantee your prospective funders that under no circumstances will you attempt to design a system which searches for HIGH QUALITY TRANSLATION. James From: Matthias Huck mh...@inf.ed.ac.uk Sent: Friday, June 19, 2015 5:08 PM To: Read, James C Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, Yes, he just said that. The decoder's job is to find the hypothesis with the maximum model score. That's one reason why your work is flawed. You did not care at all whether your model score correlates with BLEU or not. Cheers, Matthias On Fri, 2015-06-19 at 13:24 +, Read, James C wrote: I quote: the decoder's job is NOT to find the high quality translation Did you REALLY just say that? James __ From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 9:00 PM To: Read, James C Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses the decoder's job is NOT to find the high quality translation (as measured by bleu). It's job is to find translations with high model score. you need the tuning to make sure high quality translation correlates with high model score. If you don't tune, it's pot luck what quality you get. You should tune with the features you use Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote: The analogy doesn't seem to be helping me understand just how exactly it is a desirable quality of a TM to a) completely break down if no LM is used (thank you for showing that such is not always the case) b) be dependent on a tuning step to help it find the higher scoring translations What you seem to be essentially saying is that the TM cannot find the higher scoring translations because I didn't pretune the system to do so. And I am supposed to accept that such is a desirable quality of a system whose very job is to find the higher scoring translations. Further, I am still unclear which features you prequire a system to be tuned on. At the very least it seems that I have discovered the selection process that tuning seems to be making up for in some unspecified and altogether opaque way. James From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 8:34 PM To: Read, James C; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion that LM are needed for tuning. I have some sympathy for you. You're looking at ways to improve translation by reducing the search space. I've bashed my head against this wall for a while as well without much success. However, as everyone is telling you, you haven't understood the role of tuning. Without tuning, you're pointing your lab rat to some random part of the search space, instead of away from the furry animal
Re: [Moses-support] Major bug found in Moses
Hi James, Irrespective of the fact that you need to tune the weights of the log-linear model: Let me provide more references in order to shed light on how well established simple pruning techniques are in our field as well as in related fields (namely, automatic speech recognition). This list of references might not be what you are looking for, but maybe other readers can benefit. V. Steinbiss, B. Tran, H. Ney. Improvements in beam search. In Proc. of the Int. Conf. on Spoken Language Processing (ICSLP’94), pages 2143-2146, Yokohama, Japan, Sept. 1994. http://www.steinbiss.de/vst94d.pdf R. Zens, F. J. Och, and H. Ney. Phrase-Based Statistical Machine Translation. In German Conf. on Artificial Intelligence (KI), pages 18-32, Aachen, Germany, Sept. 2002. https://www-i6.informatik.rwth-aachen.de/publications/download/434/Zens-KI-2002.pdf Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proc. of the AMTA, pages 115-124, Washington, DC, USA, Sept./Oct. 2004. http://homepages.inf.ed.ac.uk/pkoehn/publications/pharaoh-amta2004.pdf Robert C. Moore and Chris Quirk. Faster Beam-Search Decoding for Phrasal Statistical Machine Translation. In Proc. of MT Summit XI, European Association for Machine Translation, Sept. 2007. http://research.microsoft.com/pubs/68097/mtsummit2007_beamsearch.pdf Richard Zens and Hermann Ney. Improvements in Dynamic Programming Beam Search for Phrase-based Statistical Machine Translation. In Proc. of the International Workshop on Spoken Language Translation (IWSLT), Honolulu, HI, USA, Oct. 2008. http://www.mt-archive.info/05/IWSLT-2008-Zens.pdf Cheers, Matthias On Wed, 2015-06-24 at 13:11 +, Read, James C wrote: Thank you for reading very careful the draft paper I provided a link to and noticing that the Johnson paper is duly cited there. Given that you had already noticed this I shall not proceed to explain the blinding obvious differences between my very simple filter and their filter based on Fisher's exact test. Other than that it seems painfully clear that the point I meant to make has not been understood entirely. If the default behaviour produces BLEU scores considerably lower than merely selecting the most likely translation of each phrase then evidently there is something very wrong with the default behaviour. If we cannot agree on something as obvious as that then I really can't see this discussion making any productive progress. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Friday, June 19, 2015 8:25 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses [sorry for the garbled message before] you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in practice for even longer. If you read the paper above, you will find that histogram pruning does not improve translation quality on a state-of-the-art SMT system, and performs poorly compared to more advanced pruning techniques. On 19.06.2015 17:49, Read, James C. wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
what *i* would do is tune my systems. ~amittai On 6/24/15 09:15, Read, James C wrote: Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
As the title of this thread makes clear the purpose of reporting the bug was not to invite a discussion about conclusions made in my draft paper. Clearly a community that builds its career around research in SMT is unlikely to agree with those kinds of conclusions. The purpose was to report the flaw in the default behaviour of Moses in the hope that we could all agree that something ought to be done about it. So far you seem to be the only one who has come even close to acknowledging that there is a problem with Moses default behaviour. James From: Lane Schwartz dowob...@gmail.com Sent: Wednesday, June 24, 2015 4:43 PM To: Read, James C Cc: Rico Sennrich; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Wed, Jun 24, 2015 at 8:11 AM, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: Other than that it seems painfully clear that the point I meant to make has not been understood entirely. If the default behaviour produces BLEU scores considerably lower than merely selecting the most likely translation of each phrase then evidently there is something very wrong with the default behaviour. If we cannot agree on something as obvious as that then I really can't see this discussion making any productive progress. James, I understand your point. I think that the others who have responded also understand your point. We simply disagree with your conclusion. I encourage you to consider the possibility that if the many experts in this field who have responded all think that your conclusion is flawed, then there might be something to that. I will agree, though, that this is a good time to conclude this discussion. Sincerely, Lane Schwartz ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
So you still think it's fine that the default would perform at 37 BLEU points less than just selecting the most likely translation of each phrase? You know I think I would have to try really hard to design a system that performed so poorly. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Wednesday, June 24, 2015 5:36 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses what *i* would do is tune my systems. ~amittai On 6/24/15 09:15, Read, James C wrote: Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
On Wed, Jun 24, 2015 at 8:11 AM, Read, James C jcr...@essex.ac.uk wrote: Other than that it seems painfully clear that the point I meant to make has not been understood entirely. If the default behaviour produces BLEU scores considerably lower than merely selecting the most likely translation of each phrase then evidently there is something very wrong with the default behaviour. If we cannot agree on something as obvious as that then I really can't see this discussion making any productive progress. James, I understand your point. I think that the others who have responded also understand your point. We simply disagree with your conclusion. I encourage you to consider the possibility that if the many experts in this field who have responded all think that your conclusion is flawed, then there might be something to that. I will agree, though, that this is a good time to conclude this discussion. Sincerely, Lane Schwartz ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
It would be really wonderful if Moses had an out-of-the-box example that ran without further tuning. Would you be willing to create that for us? We would greatly appreciate it. The open source community exists on a somewhat different model than the commercial software community. In the open-source community, if a feature doesn't exist, and if you believe it should exist, then the correct response is may I contribute this feature to the codebase, please? The fact that no such feature currently exists in Moses means that none of its current users have ever had a need for it. That probably means that all of its current users are machine translation experts, who have no need for an out-of-the-box example that runs without tuning. You are quite correct that it would be nice to expand the user base, so that it includes people who are not machine translation experts, but just want a tool that runs reasonably well out-of-the-box. Since nobody is paid to maintain Moses, however, nobody has ever yet had sufficient incentive to create such an example. If you believe that you have sufficient incentive to create such an example, then please do; we would appreciate it. Thanks. -Original Message- From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu] On Behalf Of Read, James C Sent: Wednesday, June 24, 2015 10:29 AM To: John D. Burger Cc: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Please allow me to give a synthesis of my understanding of your response: a) we understand that out of the box Moses performs notably less well than merely selecting the most likely translation for each phrase b) we don't see this as a problem because for years we've been applying a different type of fix c) we have no intention of rectifying the problem or even acknowledging that there is a problem d) we would rather continue performing this gratuitous step and insisting that our users perform it also Please explain to me. Why even bother running the training process if you have already decided that the default setup should not be designed to maximise on the probabilities learned during that step? James From: John D. Burger j...@mitre.org Sent: Wednesday, June 24, 2015 6:03 PM To: Read, James C Cc: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote: So you still think it's fine that the default would perform at 37 BLEU points less than just selecting the most likely translation of each phrase? Yes, I'm pretty sure we all think that's fine, because one of the steps of building a system is tuning. Is this really the essence of your complaint? That the behavior without tuning is not very good? (Please try to reply without your usual snarkiness.) - John Burger MITRE You know I think I would have to try really hard to design a system that performed so poorly. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Wednesday, June 24, 2015 5:36 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses what *i* would do is tune my systems. ~amittai On 6/24/15 09:15, Read, James C wrote: Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday
Re: [Moses-support] Major bug found in Moses
On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote: So you still think it's fine that the default would perform at 37 BLEU points less than just selecting the most likely translation of each phrase? Yes, I'm pretty sure we all think that's fine, because one of the steps of building a system is tuning. Is this really the essence of your complaint? That the behavior without tuning is not very good? (Please try to reply without your usual snarkiness.) - John Burger MITRE You know I think I would have to try really hard to design a system that performed so poorly. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Wednesday, June 24, 2015 5:36 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses what *i* would do is tune my systems. ~amittai On 6/24/15 09:15, Read, James C wrote: Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Thank you for reading very careful the draft paper I provided a link to and noticing that the Johnson paper is duly cited there. Given that you had already noticed this I shall not proceed to explain the blinding obvious differences between my very simple filter and their filter based on Fisher's exact test. Other than that it seems painfully clear that the point I meant to make has not been understood entirely. If the default behaviour produces BLEU scores considerably lower than merely selecting the most likely translation of each phrase then evidently there is something very wrong with the default behaviour. If we cannot agree on something as obvious as that then I really can't see this discussion making any productive progress. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Friday, June 19, 2015 8:25 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses [sorry for the garbled message before] you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in practice for even longer. If you read the paper above, you will find that histogram pruning does not improve translation quality on a state-of-the-art SMT system, and performs poorly compared to more advanced pruning techniques. On 19.06.2015 17:49, Read, James C. wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
On Wed, Jun 24, 2015 at 9:05 AM, Read, James C jcr...@essex.ac.uk wrote: As the title of this thread makes clear the purpose of reporting the bug was not to invite a discussion about conclusions made in my draft paper. Clearly a community that builds its career around research in SMT is unlikely to agree with those kinds of conclusions. The purpose was to report the flaw in the default behaviour of Moses in the hope that we could all agree that something ought to be done about it. So far you seem to be the only one who has come even close to acknowledging that there is a problem with Moses default behaviour. James, I wasn't talking about the conclusion in your paper. I was talking about the conclusion in your email: If the default behaviour produces BLEU scores considerably lower than merely selecting the most likely translation of each phrase then evidently there is something very wrong with the default behaviour. Your conclusion, quoted above, is seriously flawed. There is not something very wrong with the default behavior of Moses. You have not exposed a bug in Moses. What you have exposed is your own lack of understanding of modern statistical machine translation, and your unwillingness to listen when others take the time to explain how and why you are mistaken. I am happy to help explain things to people who are willing to listen. However, you have shown yourself to be not only rude but obstinate and willfully ignorant. I hope that others who find this thread may find it informative. You appear to have learned nothing from it. Until you become willing to listen to others, and until you take a statistical machine translation class and are willing to pay attention to what you learn there, I don't see any point in taking the time to explain things further. As far as I am concerned, this discussion is over. Sincerely, Lane Schwartz ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Please allow me to give a synthesis of my understanding of your response: a) we understand that out of the box Moses performs notably less well than merely selecting the most likely translation for each phrase b) we don't see this as a problem because for years we've been applying a different type of fix c) we have no intention of rectifying the problem or even acknowledging that there is a problem d) we would rather continue performing this gratuitous step and insisting that our users perform it also Please explain to me. Why even bother running the training process if you have already decided that the default setup should not be designed to maximise on the probabilities learned during that step? James From: John D. Burger j...@mitre.org Sent: Wednesday, June 24, 2015 6:03 PM To: Read, James C Cc: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote: So you still think it's fine that the default would perform at 37 BLEU points less than just selecting the most likely translation of each phrase? Yes, I'm pretty sure we all think that's fine, because one of the steps of building a system is tuning. Is this really the essence of your complaint? That the behavior without tuning is not very good? (Please try to reply without your usual snarkiness.) - John Burger MITRE You know I think I would have to try really hard to design a system that performed so poorly. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Wednesday, June 24, 2015 5:36 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses what *i* would do is tune my systems. ~amittai On 6/24/15 09:15, Read, James C wrote: Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL
Re: [Moses-support] Major bug found in Moses
James, (1) Did you ever look at the model scores? The decoder's job is to find the hypotheses with the highest model score and if your baseline system finds translations with higher model scores than your filtered system then there is no bug in Moses. (2) You should stop talking about BLEU scores as some kind of evidence that there is a bug in the software. We have an unpublished paper in which we show that using BLEU as the objective function to optimize translations in decoding results in terrible translations, too: http://www2.lingfil.uu.se/SLTC2014/abstracts/sltc2014_submission_21.pdf (3) Tuning is part of the training procedure for log-linear models. There is no point in leaving it out (as many others have told you already). (4) Stop driving on the wrong side of the street ... Jörg On Jun 24, 2015, at 5:21 PM, Read, James C wrote: May I humbly suggest that we do some market research and see how many institutions/organisations out there dream about an MT system that out of the box performs at 37 BLEU points less that merely substituting each phrase for its most likely translation? I dare say that most users would expect a system to perform *better* than such a blatantly obvious baseline out of the box. So, please, can we stop trying to play the academic high ground here and just accept that the default behaviour of Moses is much less than desirable? James From: Lane Schwartz dowob...@gmail.com Sent: Wednesday, June 24, 2015 5:56 PM To: Read, James C Cc: Rico Sennrich; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Wed, Jun 24, 2015 at 9:05 AM, Read, James C jcr...@essex.ac.uk wrote: As the title of this thread makes clear the purpose of reporting the bug was not to invite a discussion about conclusions made in my draft paper. Clearly a community that builds its career around research in SMT is unlikely to agree with those kinds of conclusions. The purpose was to report the flaw in the default behaviour of Moses in the hope that we could all agree that something ought to be done about it. So far you seem to be the only one who has come even close to acknowledging that there is a problem with Moses default behaviour. James, I wasn't talking about the conclusion in your paper. I was talking about the conclusion in your email: If the default behaviour produces BLEU scores considerably lower than merely selecting the most likely translation of each phrase then evidently there is something very wrong with the default behaviour. Your conclusion, quoted above, is seriously flawed. There is not something very wrong with the default behavior of Moses. You have not exposed a bug in Moses. What you have exposed is your own lack of understanding of modern statistical machine translation, and your unwillingness to listen when others take the time to explain how and why you are mistaken. I am happy to help explain things to people who are willing to listen. However, you have shown yourself to be not only rude but obstinate and willfully ignorant. I hope that others who find this thread may find it informative. You appear to have learned nothing from it. Until you become willing to listen to others, and until you take a statistical machine translation class and are willing to pay attention to what you learn there, I don't see any point in taking the time to explain things further. As far as I am concerned, this discussion is over. Sincerely, Lane Schwartz ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
John, to my knowledge, you still have not reported BLEU scores for the following experiment: The moses.ini in your unfiltered translation experiment should assign weights of 0 0 0 1 to the TM features. (requested by Matt on June 17). Would you please run this experiment and report the results? Otherwise you are asking the decoder to select phrases with the highest sum of all scores, but expecting it instead to select the phrase with only the fourth score being the highest, which are even by primary school math two completely different things. Gregor -Original Message- From: moses-support-boun...@mit.edu on behalf of Read, James C jcr...@essex.ac.uk Date: Wednesday 24 June 2015 17:29 To: John D. Burger j...@mitre.org Cc: moses-support@mit.edu moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Please allow me to give a synthesis of my understanding of your response: a) we understand that out of the box Moses performs notably less well than merely selecting the most likely translation for each phrase b) we don't see this as a problem because for years we've been applying a different type of fix c) we have no intention of rectifying the problem or even acknowledging that there is a problem d) we would rather continue performing this gratuitous step and insisting that our users perform it also Please explain to me. Why even bother running the training process if you have already decided that the default setup should not be designed to maximise on the probabilities learned during that step? James From: John D. Burger j...@mitre.org Sent: Wednesday, June 24, 2015 6:03 PM To: Read, James C Cc: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote: So you still think it's fine that the default would perform at 37 BLEU points less than just selecting the most likely translation of each phrase? Yes, I'm pretty sure we all think that's fine, because one of the steps of building a system is tuning. Is this really the essence of your complaint? That the behavior without tuning is not very good? (Please try to reply without your usual snarkiness.) - John Burger MITRE You know I think I would have to try really hard to design a system that performed so poorly. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Wednesday, June 24, 2015 5:36 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses what *i* would do is tune my systems. ~amittai On 6/24/15 09:15, Read, James C wrote: Thank you for such an invitation. Let's see. Given the choice of a) reading through thousands of lines of code trying to figure out why the default behaviour performs considerably worse than merely selecting the most likely translation of each phrase or b) spending much less time implementing a simple system that does just that which one would you do? For all know maybe I've already implemented such a system that does just that and not only that improves considerably on such a basic benchmark. But given that on this list we don't seem to be able to accept that there is a problem with the default behaviour of Moses I can only conclude that nobody would be interested in access to the code of such a system. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 7:52 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James --- - *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you
Re: [Moses-support] Major bug found in Moses
That would make very cool student projects. Also that video is acing it, even the voice-over is synthetic :) On 23.06.2015 00:27, Ondrej Bojar wrote: ...and I wouldn't be surprised to find Moses also behind this Java-to-C# automatic translation: https://www.youtube.com/watch?v=CHDDNnRm-g8 O. - Original Message - From: Marcin Junczys-Dowmunt junc...@amu.edu.pl To: moses-support@mit.edu Sent: Friday, 19 June, 2015 19:21:45 Subject: Re: [Moses-support] Major bug found in Moses On that interesting idea that moses should be naturally good at translating things, just for general considerations. Since some said this thread has educational value I would like to share something that might not be obvious due to the SMT-biased posts here. Moses is also the _leading_ tool for automatic grammatical error correction (GEC) right now. The first and third system of the CoNLL shared task 2014 were based on Moses. By now I have results that surpass the CoNLL results by far by adding some specialized features to Moses (which thanks to Hieu is very easy). It even gets good results for GEC when you do crazy things like inverting the TM (so it should actually make the input worse) provided you tune on the correct metric and for the correct task. The interaction of all the other features after tuning makes that possible. So, if anything, Moses is just a very flexible text-rewriting tool. Tuning (and data) turns into a translator, GEC tool, POS-tagger, Chunker, Semantic Tagger etc. On 19.06.2015 18:40, Lane Schwartz wrote: On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
James, did you try the modifications Philip suggested (removing the word penalty and lowering p(f|e)? (I doubt it will be enough to get a best paper award, but it would probably improve your bleu, that's always a good start :) ) On Friday, June 19, 2015, Read, James C jcr...@essex.ac.uk wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James -- *From:* Lane Schwartz dowob...@gmail.com javascript:_e(%7B%7D,'cvml','dowob...@gmail.com'); *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu javascript:_e(%7B%7D,'cvml','moses-support@mit.edu'); *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk javascript:_e(%7B%7D,'cvml','jcr...@essex.ac.uk'); wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
On 19/06/15 19:21, Marcin Junczys-Dowmunt wrote: So, if anything, Moses is just a very flexible text-rewriting tool. Tuning (and data) turns into a translator, GEC tool, POS-tagger, Chunker, Semantic Tagger etc. that's a good point, and the basis of some criticism that can be levelled at the Moses community: because Moses is so flexible, the responsibility is on the user to find the right configuration for a task. I think it is getting harder to find out about all of the settings/models necessary to reproduce a state-of-the-art system, especially outside of an established SMT research group. The results is a high barrier of entry, and frustration on all sides when somebody performs experiments with default settings. To stay with the example of phrase table pruning: this is widely used, and I used count-based pruning, threshold pruning based on p(e|f), and histogram pruning based on the model score in my WMT submission. Can and should we make a wider effort to facilitate the reproduction of systems by disseminating settings or configuration files? This dissemination is partially done by system description papers, but they cannot cover all settings [this would make for a very boring paper]. I put some effort into documenting my WMT submission by releasing EMS configuration files ( https://github.com/rsennrich/wmt2014-scripts/tree/master/example ), and I would be happy to see this done more often. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
I like the idea very much. I would need to discuss this with my collegues, but I guess we can publish recipes for the MT engines we use in production at WIPO and the UN. They are modelled after some of your WMT systems, but tuned for speed and small size. On 20.06.2015 15:42, Adam Lopez wrote: Can and should we make a wider effort to facilitate the reproduction of systems by disseminating settings or configuration files? This dissemination is partially done by system description papers, but they cannot cover all settings [this would make for a very boring paper]. I put some effort into documenting my WMT submission by releasing EMS configuration files ( https://github.com/rsennrich/wmt2014-scripts/tree/master/example ), and I would be happy to see this done more often. Compare with speech recognition, where the major open source toolkit is Kaldi. One of its stated goals is to collect a set of recipes for reproducing state-of-the-art results. http://kaldi.sourceforge.net/about.html I don't know how well they've succeeded at this. But it's an admirable goal. The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Can and should we make a wider effort to facilitate the reproduction of systems by disseminating settings or configuration files? This dissemination is partially done by system description papers, but they cannot cover all settings [this would make for a very boring paper]. I put some effort into documenting my WMT submission by releasing EMS configuration files ( https://github.com/rsennrich/wmt2014-scripts/tree/master/example ), and I would be happy to see this done more often. Compare with speech recognition, where the major open source toolkit is Kaldi. One of its stated goals is to collect a set of recipes for reproducing state-of-the-art results. http://kaldi.sourceforge.net/about.html I don't know how well they've succeeded at this. But it's an admirable goal. The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Philipp Koehn p...@jhu.edu Sent: Thursday, June 18, 2015 9:39 PM To: Burger, John D. Cc: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Hi, I am great fan of open source software, but there is a danger to view its inner workings as a black box - which leads to the strange theories of what is going on, instead of real understanding. But we can try to understand it. In the reported experiment, the language model was removed, while the rest of the system was left unchanged. The default untuned weights that train-model.perl assigns to a model are the following: WordPenalty0= -1 PhrasePenalty0= 0.2 TranslationModel0= 0.2 0.2 0.2 0.2 Distortion0= 0.3 Since no language model is used, a positive distortion cost will lead the decoder to not use any reordering at all. That's a good thing in this case. The word penalty is used to counteract the language model's preference for short translations. Unchecked, there is now a bias towards too long translations. Then there is the translation model with its equal weights for p(e|f) and p(f|e). The p(e|f) weight and scores are fine and well. However, p(f|e) only make sense if you have the Bayes theorem in your mind and a language model in your back. But in the reported setup, there is now a bias to translate into rare English phrases, since these will have high p(f|e) scores. My best guess is that the reported setup translates common function words (such as prepositions) into very long rare English phrases - word penalty likes it, p(f|e) likes it, p(e|f) does not mind enough - which produces a lot of rubbish. By filtering for p(e|f) those junky phrases are removed from the phrase table, restricting the decoder to more reasonable choices. I content that this is not a bug in the software, but a bug in usage. -phi On Thu, Jun 18, 2015 at 11:32 AM, Burger, John D. j...@mitre.orgmailto:j...@mitre.org wrote: On Jun 17, 2015, at 11:54, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: The question remains why isn't the system capable of finding the most likely translations without the LM? Even if it weren't ill-posed, I don't find this to be an interesting question at all. This is like trying to improve automobile transmissions by disabling the steering. These are the parts we have, and they all work together. It's not as if human translators don't use their own internal language models. - John Burger MITRE Evidently, if you filter the phrase table then the LM is not as important as you might feel. The question remains why isn't the system capable of finding the most likely translations without the LM? Why do I need to filter to help the system find them? This is undesirable behaviour. Clearly a bug. I include the code I used for filtering. As you can see the 4th score only was used as a filtering criteria. #!/usr/bin/perl -w # # Program filters phrase table to leave only phrase pairs # with probability above a threshold # use strict; use warnings; use Getopt::Long; my $phrase; my $min; my $phrase_table; my $filtered_table; GetOptions( 'min=f' = \$min, 'out=s' = \$filtered_table, 'in=s' = \$phrase_table); die ERROR: must give threshold and phrase table input file and output file\n unless ($min $phrase_table $filtered_table); die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table); open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase table $phrase_table\n;; open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase table $filtered_table\n;; while (my $line = PHRASETABLE) { chomp $line; my @columns = split ('\|\|\|', $line); # check that file is a well formatted phrase table if (scalar @columns 4) { die ERROR: input file is not a well formatted phrase table. A phrase
Re: [Moses-support] Major bug found in Moses
James, You may see the techniques that exist as outdated, wrong-headed, and inefficient. You have the right to hold that opinion. It may even be that history proves you right. Progress in science is made by people posing questions - often questions that challenge the status quo - and then doing experiments to answer those questions. However, it is incumbent upon you, the proponent of a new idea, to design good experiments to attempt to prove or disprove your new hypothesis. Dispassionately showing the relative merits and shortcomings of your technique with the existing state of the art is part of that process. I, along with numerous other people on this list, have attempted in good faith to answer your questions, and to provide you with our perspective based on our collective understanding of the problem. You, in turn, have responded belligerently. I suggest that you have a frank conversation with your academic advisor or other appropriate mentor regarding your future. If you intend to pursue a successful career in science, academia, government, or industry, you would do well to reconsider the manner in which you interact with other people, especially people with whom you disagree. In the meantime, I would respectfully request that until you learn how to respectfully interact with other adults that you refrain from posting to this mailing list. Sincerely, Lane Schwartz On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk wrote: According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James -- *From:* moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Philipp Koehn p...@jhu.edu *Sent:* Thursday, June 18, 2015 9:39 PM *To:* Burger, John D. *Cc:* moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses Hi, I am great fan of open source software, but there is a danger to view its inner workings as a black box - which leads to the strange theories of what is going on, instead of real understanding. But we can try to understand it. In the reported experiment, the language model was removed, while the rest of the system was left unchanged. The default untuned weights that train-model.perl assigns to a model are the following: WordPenalty0= -1 PhrasePenalty0= 0.2 TranslationModel0= 0.2 0.2 0.2 0.2 Distortion0= 0.3 Since no language model is used, a positive distortion cost will lead the decoder to not use any reordering at all. That's a good thing in this case. The word penalty is used to counteract the language model's preference for short translations. Unchecked, there is now a bias towards too long translations. Then there is the translation model with its equal weights for p(e|f) and p(f|e). The p(e|f) weight and scores are fine and well. However, p(f|e) only make sense if you have the Bayes theorem in your mind and a language model in your back. But in the reported setup, there is now a bias to translate into rare English phrases, since these will have high p(f|e) scores. My best guess is that the reported setup translates common function words (such as prepositions) into very long rare English phrases - word penalty likes it, p(f|e) likes it, p(e|f) does not mind enough - which produces a lot of rubbish. By filtering for p(e|f) those junky phrases are removed from the phrase table, restricting the decoder to more reasonable choices. I content that this is not a bug in the software, but a bug in usage. -phi On Thu, Jun 18, 2015 at 11:32 AM, Burger, John D. j...@mitre.org wrote: On Jun 17, 2015, at 11:54, Read, James C jcr...@essex.ac.uk wrote: The question remains why isn't the system capable of finding the most likely translations without the LM? Even if it weren't ill-posed, I don't find this to be an interesting question at all. This is like trying to improve automobile transmissions by disabling the steering. These are the parts we have, and they all work together. It's not as if human translators don't use their own internal language models. - John Burger
Re: [Moses-support] Major bug found in Moses
If you want to use an automobile analogy then the TM is the engine which powers the vehicle. You as an investor have a few choices before you. Your objective is to make the car fun faster. Would you invest your money in: a) the guy the says it is a desirable feature to keep an inefficient fuel guzzling motor that breaks down constantly such that you need to get out and push it (tuning) so it would be much more preferable to optimise the aerodynamics of the vehicle and install a rear window heater to keep your hands warm while your pushing it b) the guy that says. Well here's a stroke of genius. Why don't we build a more powerful engine that uses less fuel and doesn't break down with no need to get out and push (tuning or pruning) Honest replies only requested please. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Burger, John D. j...@mitre.org Sent: Thursday, June 18, 2015 6:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Jun 17, 2015, at 11:54, Read, James C jcr...@essex.ac.uk wrote: The question remains why isn't the system capable of finding the most likely translations without the LM? Even if it weren't ill-posed, I don't find this to be an interesting question at all. This is like trying to improve automobile transmissions by disabling the steering. These are the parts we have, and they all work together. It's not as if human translators don't use their own internal language models. - John Burger MITRE Evidently, if you filter the phrase table then the LM is not as important as you might feel. The question remains why isn't the system capable of finding the most likely translations without the LM? Why do I need to filter to help the system find them? This is undesirable behaviour. Clearly a bug. I include the code I used for filtering. As you can see the 4th score only was used as a filtering criteria. #!/usr/bin/perl -w # # Program filters phrase table to leave only phrase pairs # with probability above a threshold # use strict; use warnings; use Getopt::Long; my $phrase; my $min; my $phrase_table; my $filtered_table; GetOptions( 'min=f' = \$min, 'out=s' = \$filtered_table, 'in=s' = \$phrase_table); die ERROR: must give threshold and phrase table input file and output file\n unless ($min $phrase_table $filtered_table); die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table); open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase table $phrase_table\n;; open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase table $filtered_table\n;; while (my $line = PHRASETABLE) { chomp $line; my @columns = split ('\|\|\|', $line); # check that file is a well formatted phrase table if (scalar @columns 4) { die ERROR: input file is not a well formatted phrase table. A phrase table must have at least four colums each column separated by |||\n; } # get the probability and check it is less than the threshold my @scores = split /\s+/, $columns[2]; if ($scores[3] $min) { print FILTEREDTABLE $line.\n;; } } From: Matt Post p...@cs.jhu.edu Sent: Wednesday, June 17, 2015 5:25 PM To: Read, James C Cc: Marcin Junczys-Dowmunt; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses I think you are misunderstanding how decoding works. The highest-weighted translation of each source phrase is not necessarily the one with the best BLEU score. This is why the decoder retains many options, so that it can search among them (together with their reorderings). The LM is an important component in making these selections. Also, how did you weight the many probabilities attached to each phrase (to determine which was the most probable)? The tuning phase of decoding selects weights designed to optimize BLEU score. If you weighted them evenly, that is going to exacerbate this experiment. matt On Jun 17, 2015, at 10:22 AM, Read, James C jcr...@essex.ac.uk wrote: All I did was break the link to the language model and then perform filtering. How is that a methodoligical mistake? How else would one test the efficacy of the TM in isolation? I remain convinced that this is undersirable behaviour and therefore a bug. James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:12 PM To: Read, James C Cc: Arnold, Doug; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Hi James No, not at all. I would say that is expected behaviour. It's how search spaces and optimization works. If anything these are methodological mistakes on your side, sorry. You are doing weird thinds
Re: [Moses-support] Major bug found in Moses
I'm gonna try once more. This is what he said: the decoder's job is NOT to find the high quality translation The next time I have a panel of potential investors in front of me I'm gonna pass that line by them and see how it goes down. I stress the words HIGH QUALITY TRANSLATION. Please promise me that the next time you put in a bid for funding you will guarantee your prospective funders that under no circumstances will you attempt to design a system which searches for HIGH QUALITY TRANSLATION. James From: Matthias Huck mh...@inf.ed.ac.uk Sent: Friday, June 19, 2015 5:08 PM To: Read, James C Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, Yes, he just said that. The decoder's job is to find the hypothesis with the maximum model score. That's one reason why your work is flawed. You did not care at all whether your model score correlates with BLEU or not. Cheers, Matthias On Fri, 2015-06-19 at 13:24 +, Read, James C wrote: I quote: the decoder's job is NOT to find the high quality translation Did you REALLY just say that? James __ From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 9:00 PM To: Read, James C Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses the decoder's job is NOT to find the high quality translation (as measured by bleu). It's job is to find translations with high model score. you need the tuning to make sure high quality translation correlates with high model score. If you don't tune, it's pot luck what quality you get. You should tune with the features you use Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote: The analogy doesn't seem to be helping me understand just how exactly it is a desirable quality of a TM to a) completely break down if no LM is used (thank you for showing that such is not always the case) b) be dependent on a tuning step to help it find the higher scoring translations What you seem to be essentially saying is that the TM cannot find the higher scoring translations because I didn't pretune the system to do so. And I am supposed to accept that such is a desirable quality of a system whose very job is to find the higher scoring translations. Further, I am still unclear which features you prequire a system to be tuned on. At the very least it seems that I have discovered the selection process that tuning seems to be making up for in some unspecified and altogether opaque way. James From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 8:34 PM To: Read, James C; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion that LM are needed for tuning. I have some sympathy for you. You're looking at ways to improve translation by reducing the search space. I've bashed my head against this wall for a while as well without much success. However, as everyone is telling you, you haven't understood the role of tuning. Without tuning, you're pointing your lab rat to some random part of the search space, instead of away from the furry animal with whiskers and towards the yellow cheesy thing On 17/06/2015 20:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large
Re: [Moses-support] Major bug found in Moses
You are not interested in discovering which phrase pairs contributed most to increases in BLEU scores so that we can bypass an ineffective search algorithm and construct a reliable phrase pair based rule based system with lower computational cost and higher likelihood of better results? I would like to see you stare investors in the face and make that claim. And manage to keep a straight face. James From: Lane Schwartz dowob...@gmail.com Sent: Wednesday, June 17, 2015 9:11 PM To: Read, James C Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses James, The underlying questions that you appear to be posing are these: When the search space is simplified by decoding without a language model, to what extent is the decoder able to identify hypotheses that have the best model score? Second, does filtering the phrase table in a particular way change the answer to this question? Third, how is the BLEU score (or any other metric) affected by these questions? These are valid questions. Unfortunately, as Kenneth, Amittai, and Hieu have pointed out, the experiment that you have designed does not provide you with all of what you need to be able to answer these questions. Recall that we don't really deal with probabilities when decoding. Yes, some of our features are trained as probability models. But the decoder searches using a weighted combination of scores. Lots of them. Even the phrase table is comprised of (at least) four distinct scores (phrase translation score and lexical translation score, in both directions). Decoding is a search problem. Specifically, it is a search through all possible translations to attempt to identify the one with the highest score according to this weighted combination of component scores. There are two problems then, that we have to deal with: First is this. Even if all we care about is the ultimate weighted combination of component scores, the search space is so vast (it's NP complete) that we cannot hope to exhaustively search through it in a reasonable amount of time, even for sentences that are only of moderate length. This means that we have to resort to pruning. Second is this. We don't really care about finding solutions that are optimal according to the weighted combination of component scores. We care about getting translations that are fluent and mean the same thing as the original sentence. Since we don't know how to measure adequacy and fluency automatically, we resort to imperfect metrics that can be calculated automatically, like BLEU. This is fine, but it makes the search problem (which was already intractably large) even worse. The decoder only knows how to search by finding solutions that are good according to the weighted combination of component scores. If we want translations that are good according to some metric (like BLEU), then we need to attempt to formulate the weights such that solutions that are good according to the weighted combination of component scores are also good according to the desired metric (BLEU). The mechanism by which this is performed is tuning. Your decoder, by necessity, is operating using pruning. As such, your decoder is only operating in a confined region of the overall search space. The question then is, what region of the search space would you prefer to have your decoder operate in. If you choose not to run tuning, then you are choosing to have your decoder operate in an arbitrary region of the search space. If you chose to run tuning, then you are choosing to have your decoder operate in a region of the search space where you have reason to believe contains good translations according to your metric. Another way to think about this is as follows. If you choose not to run tuning, and you obtain translations that are good according to the metric (BLEU), this is great, but it doesn't tell you much. If you obtain translations that are bad according to the metric, this is to be expected. What your experiments have shown is this: The complexity of the search space is greater when you use all available phrase pairs than it is when you pre-select only the best phrase pairs. When you choose to not tune and not use and LM, and then decode in the simpler space, you get better BLEU scores than when you decode in the more complex space. This is not a surprising result. It is in fact the expected result. Why is this the expected result? Two reasons. First, because search involves pruning. If you simplify the search space (by allowing the decoder to search using only the best phrase pairs), then it becomes easier for the decoder to find translations that are closer to optimal according to the weighted combination of scores, simply because the decoder is searching through a much smaller (and higher quality) sub-region of the search space. Second, because by choosing not to tune
Re: [Moses-support] Major bug found in Moses
So we've gone from 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. And your conclusion is that after being a witness to such behaviour I would still have a desire to contribute to this field?!? Why YES. I would love to keep banging my head against a brick wall. I have no other preferred past times. James From: Lane Schwartz dowob...@gmail.com Sent: Friday, June 19, 2015 5:04 PM To: Read, James C Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses James, You may see the techniques that exist as outdated, wrong-headed, and inefficient. You have the right to hold that opinion. It may even be that history proves you right. Progress in science is made by people posing questions - often questions that challenge the status quo - and then doing experiments to answer those questions. However, it is incumbent upon you, the proponent of a new idea, to design good experiments to attempt to prove or disprove your new hypothesis. Dispassionately showing the relative merits and shortcomings of your technique with the existing state of the art is part of that process. I, along with numerous other people on this list, have attempted in good faith to answer your questions, and to provide you with our perspective based on our collective understanding of the problem. You, in turn, have responded belligerently. I suggest that you have a frank conversation with your academic advisor or other appropriate mentor regarding your future. If you intend to pursue a successful career in science, academia, government, or industry, you would do well to reconsider the manner in which you interact with other people, especially people with whom you disagree. In the meantime, I would respectfully request that until you learn how to respectfully interact with other adults that you refrain from posting to this mailing list. Sincerely, Lane Schwartz On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James From: moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu on behalf of Philipp Koehn p...@jhu.edumailto:p...@jhu.edu Sent: Thursday, June 18, 2015 9:39 PM To: Burger, John D. Cc: moses-support@mit.edumailto:moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Hi, I am great fan of open source software, but there is a danger to view its inner workings as a black box - which leads to the strange theories of what is going on, instead of real understanding. But we can try to understand it. In the reported experiment, the language model was removed, while the rest of the system was left unchanged. The default untuned weights that train-model.perl assigns to a model are the following: WordPenalty0= -1 PhrasePenalty0= 0.2 TranslationModel0= 0.2 0.2 0.2 0.2 Distortion0= 0.3 Since no language model is used, a positive distortion cost will lead the decoder to not use any reordering at all. That's a good thing in this case. The word penalty is used to counteract the language model's preference for short translations. Unchecked, there is now a bias towards too long translations. Then there is the translation model with its equal weights for p(e|f) and p(f|e). The p(e|f) weight and scores are fine and well. However, p(f|e) only make sense if you have the Bayes theorem in your mind and a language model in your back. But in the reported setup, there is now a bias to translate into rare English phrases, since these will have high p(f|e) scores. My best guess
Re: [Moses-support] Major bug found in Moses
German joke: Ein Autofahrer hört im Radio die Durchsage: Achtung! Achtung! Auf der N9 kommt Ihnen ein Geisterfahrer entgegen. Fahren Sie bitte ganz rechts und überholen Sie nicht! Der Autofahrer: Was heißt hier einer? Dutzende! Dutzende! Wdniu 2015-06-19 16:12, Read, James C napisał(a): So we've gone from 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. And your conclusion is that after being a witness to such behaviour I would still have a desire to contribute to this field?!? Why YES. I would love to keep banging my head against a brick wall. I have no other preferred past times. James - FROM: Lane Schwartz dowob...@gmail.com SENT: Friday, June 19, 2015 5:04 PM TO: Read, James C CC: Philipp Koehn; Burger, John D.; moses-support@mit.edu SUBJECT: Re: [Moses-support] Major bug found in Moses James, You may see the techniques that exist as outdated, wrong-headed, and inefficient. You have the right to hold that opinion. It may even be that history proves you right. Progress in science is made by people posing questions - often questions that challenge the status quo - and then doing experiments to answer those questions. However, it is incumbent upon you, the proponent of a new idea, to design good experiments to attempt to prove or disprove your new hypothesis. Dispassionately showing the relative merits and shortcomings of your technique with the existing state of the art is part of that process. I, along with numerous other people on this list, have attempted in good faith to answer your questions, and to provide you with our perspective based on our collective understanding of the problem. You, in turn, have responded belligerently. I suggest that you have a frank conversation with your academic advisor or other appropriate mentor regarding your future. If you intend to pursue a successful career in science, academia, government, or industry, you would do well to reconsider the manner in which you interact with other people, especially people with whom you disagree. In the meantime, I would respectfully request that until you learn how to respectfully interact with other adults that you refrain from posting to this mailing list. Sincerely, Lane Schwartz On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk wrote: According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James - FROM: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Philipp Koehn p...@jhu.edu SENT: Thursday, June 18, 2015 9:39 PM TO: Burger, John D. CC: moses-support@mit.edu SUBJECT: Re: [Moses-support] Major bug found in Moses Hi, I am great fan of open source software, but there is a danger to view its inner workings as a black box - which leads to the strange theories of what is going on, instead of real understanding. But we can try to understand it. In the reported experiment, the language model was removed, while the rest of the system was left unchanged. The default untuned weights that train-model.perl assigns to a model are the following: WordPenalty0= -1 PhrasePenalty0= 0.2 TranslationModel0= 0.2 0.2 0.2 0.2 Distortion0= 0.3 Since no language model is used, a positive distortion cost will lead the decoder to not use any reordering at all. That's a good thing in this case. The word penalty is used to counteract the language model's preference for short translations. Unchecked, there is now a bias towards too long translations. Then there is the translation model with its equal weights for p(e|f) and p(f|e). The p(e|f) weight and scores
Re: [Moses-support] Major bug found in Moses
speaking of cobbling together a good translation from imperfect parts: google: A motorist heard on the radio the announcement: Caution Caution On the N9 you will encounter a ghost driver Please drive far right and do not overtake!.! The driver: What do you mean a dozens dozens?! microsoft: A motorist hears the announcement on the radio: 'warning! Caution! On the N9, a (s) satisfies you. Go quite right and not overtake! The car driver: what do you mean one? Dozens! Dozens! :) ~amittai On 6/19/15 10:19, Marcin Junczys-Dowmunt wrote: German joke: Ein Autofahrer hört im Radio die Durchsage: Achtung! Achtung! Auf der N9 kommt Ihnen ein Geisterfahrer entgegen. Fahren Sie bitte ganz rechts und überholen Sie nicht! Der Autofahrer: Was heißt hier einer? Dutzende! Dutzende! Wdniu 2015-06-19 16:12, Read, James C napisał(a): So we've gone from 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. And your conclusion is that after being a witness to such behaviour I would still have a desire to contribute to this field?!? Why YES. I would love to keep banging my head against a brick wall. I have no other preferred past times. James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 5:04 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses James, You may see the techniques that exist as outdated, wrong-headed, and inefficient. You have the right to hold that opinion. It may even be that history proves you right. Progress in science is made by people posing questions - often questions that challenge the status quo - and then doing experiments to answer those questions. However, it is incumbent upon you, the proponent of a new idea, to design good experiments to attempt to prove or disprove your new hypothesis. Dispassionately showing the relative merits and shortcomings of your technique with the existing state of the art is part of that process. I, along with numerous other people on this list, have attempted in good faith to answer your questions, and to provide you with our perspective based on our collective understanding of the problem. You, in turn, have responded belligerently. I suggest that you have a frank conversation with your academic advisor or other appropriate mentor regarding your future. If you intend to pursue a successful career in science, academia, government, or industry, you would do well to reconsider the manner in which you interact with other people, especially people with whom you disagree. In the meantime, I would respectfully request that until you learn how to respectfully interact with other adults that you refrain from posting to this mailing list. Sincerely, Lane Schwartz On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James *From:* moses-support-boun...@mit.edu mailto:moses-support-boun...@mit.edu moses-support-boun...@mit.edu mailto:moses-support-boun...@mit.edu on behalf of Philipp Koehn p...@jhu.edu mailto:p...@jhu.edu *Sent:* Thursday, June 18, 2015 9:39 PM *To:* Burger, John D. *Cc:* moses-support@mit.edu mailto:moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses Hi, I am great fan of open source software, but there is a danger to view its inner workings as a black box - which leads to the strange
Re: [Moses-support] Major bug found in Moses
* [i'm] the guy that says. Well here's a stroke of genius. * a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system ? * resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2 ? heh -- i was right the first time: On 6/17/15 13:20, amittai axelrod wrote: also, your argument could be easily mis-interpreted as this behavior is unexpected to me, ergo this is unexpected behavior, and that will unfortunately bias the listener against you, as that is the preferred argument structure of conspiracy theorists. see also: https://en.wikipedia.org/wiki/Crank_(person)#Common_characteristics_of_cranks if you're ever at a conference, say hi. until then, well, you do you. ~amittai On 6/19/15 10:12, Read, James C wrote: So we've gone from 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. And your conclusion is that after being a witness to such behaviour I would still have a desire to contribute to this field?!? Why YES. I would love to keep banging my head against a brick wall. I have no other preferred past times. James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 5:04 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses James, You may see the techniques that exist as outdated, wrong-headed, and inefficient. You have the right to hold that opinion. It may even be that history proves you right. Progress in science is made by people posing questions - often questions that challenge the status quo - and then doing experiments to answer those questions. However, it is incumbent upon you, the proponent of a new idea, to design good experiments to attempt to prove or disprove your new hypothesis. Dispassionately showing the relative merits and shortcomings of your technique with the existing state of the art is part of that process. I, along with numerous other people on this list, have attempted in good faith to answer your questions, and to provide you with our perspective based on our collective understanding of the problem. You, in turn, have responded belligerently. I suggest that you have a frank conversation with your academic advisor or other appropriate mentor regarding your future. If you intend to pursue a successful career in science, academia, government, or industry, you would do well to reconsider the manner in which you interact with other people, especially people with whom you disagree. In the meantime, I would respectfully request that until you learn how to respectfully interact with other adults that you refrain from posting to this mailing list. Sincerely, Lane Schwartz On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James *From:* moses-support-boun...@mit.edu mailto:moses-support-boun...@mit.edu moses-support-boun...@mit.edu mailto:moses-support-boun...@mit.edu on behalf of Philipp Koehn p...@jhu.edu mailto:p...@jhu.edu *Sent:* Thursday, June 18, 2015 9:39 PM *To:* Burger, John D. *Cc:* moses-support@mit.edu mailto:moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses Hi, I am great fan of open source software, but there is a danger to view its inner workings as a black box - which leads
Re: [Moses-support] Major bug found in Moses
James, 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations Yes. We all acknowledge this. If you have a better technique, that's great. Show that it's better. Your paper does not do so. 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system Yes, this is undesirable. If you have a better technique, that's great. Show that it's better. Your paper does not do so. 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. No one is trying to censor you in the literature. You wrote a paper that got rejected. Lots of papers get rejected. Lots of GOOD papers get rejected. The fact that yours got rejected does not mean that you're being censored. No one is trying to censor you on this list. We are simply requesting that you conduct yourself like a well-mannered adult engaged in scientific research. By the way, your frequent mentions of investors are very much a non sequitur. You may be looking for investors, and that's fine if you are. You may want to keep in mind that not everyone is. Many of us are interested in this as a field of scientific enquiry. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
That's actually presentation-worthy material :) I have to note that down somewhere. W dniu 2015-06-19 16:23, amittai axelrod napisał(a): speaking of cobbling together a good translation from imperfect parts: google: A motorist heard on the radio the announcement: Caution Caution On the N9 you will encounter a ghost driver Please drive far right and do not overtake!.! The driver: What do you mean a dozens dozens?! microsoft: A motorist hears the announcement on the radio: 'warning! Caution! On the N9, a (s) satisfies you. Go quite right and not overtake! The car driver: what do you mean one? Dozens! Dozens! :) ~amittai On 6/19/15 10:19, Marcin Junczys-Dowmunt wrote: German joke: Ein Autofahrer hört im Radio die Durchsage: Achtung! Achtung! Auf der N9 kommt Ihnen ein Geisterfahrer entgegen. Fahren Sie bitte ganz rechts und überholen Sie nicht! Der Autofahrer: Was heißt hier einer? Dutzende! Dutzende! Wdniu 2015-06-19 16:12, Read, James C napisał(a): ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
If I'm ever at a conference I'll come and introduce myself right after you present to all those present that: 1) A well designed search algorithm should select low quality translations despite the fact that the search space contains much higher quality translations. I can't deal with this level of denial. James From: amittai axelrod amit...@umiacs.umd.edu Sent: Friday, June 19, 2015 5:33 PM To: Read, James C; Lane Schwartz Cc: moses-support@mit.edu; Philipp Koehn Subject: Re: [Moses-support] Major bug found in Moses * [i'm] the guy that says. Well here's a stroke of genius. * a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system ? * resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2 ? heh -- i was right the first time: On 6/17/15 13:20, amittai axelrod wrote: also, your argument could be easily mis-interpreted as this behavior is unexpected to me, ergo this is unexpected behavior, and that will unfortunately bias the listener against you, as that is the preferred argument structure of conspiracy theorists. see also: https://en.wikipedia.org/wiki/Crank_(person)#Common_characteristics_of_cranks if you're ever at a conference, say hi. until then, well, you do you. ~amittai On 6/19/15 10:12, Read, James C wrote: So we've gone from 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. And your conclusion is that after being a witness to such behaviour I would still have a desire to contribute to this field?!? Why YES. I would love to keep banging my head against a brick wall. I have no other preferred past times. James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 5:04 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses James, You may see the techniques that exist as outdated, wrong-headed, and inefficient. You have the right to hold that opinion. It may even be that history proves you right. Progress in science is made by people posing questions - often questions that challenge the status quo - and then doing experiments to answer those questions. However, it is incumbent upon you, the proponent of a new idea, to design good experiments to attempt to prove or disprove your new hypothesis. Dispassionately showing the relative merits and shortcomings of your technique with the existing state of the art is part of that process. I, along with numerous other people on this list, have attempted in good faith to answer your questions, and to provide you with our perspective based on our collective understanding of the problem. You, in turn, have responded belligerently. I suggest that you have a frank conversation with your academic advisor or other appropriate mentor regarding your future. If you intend to pursue a successful career in science, academia, government, or industry, you would do well to reconsider the manner in which you interact with other people, especially people with whom you disagree. In the meantime, I would respectfully request that until you learn how to respectfully interact with other adults that you refrain from posting to this mailing list. Sincerely, Lane Schwartz On Fri, Jun 19, 2015 at 8:45 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: According to your book which I have on my desk the job of the TM is to model the most likely translations and the job of the decoder is to intelligently search the space of translations to find the most likely one/s (I'm paraphrasing of course). Would you like to retract that position and republish a next edition of your book which openly states that Moses when used with no LM or tuning or pruning can and should be expected to perform very poorly and select only the least likely translations? Don't you in the slightest find it worrying that like at least 90% of you code base could be thrown out of the window and high scoring results can be obtained with a simple phrase pair based rule based system? Which would you prefer? Would you prefer to consume computational resources calculating probabilites or get straight to the answer with simple logic and low computational requirements? BE HONEST! James
Re: [Moses-support] Major bug found in Moses
On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
I did not claim that the paper does so. The weakness has been exposed. And the way it was exposed suggests that certain classes of phrase pairs contribute more to BLEU scores than others. We now have an empirical basis for exploring new avenues that exploit this observation. I have no problem with papers being rejected. Clearly only a certain number can be published in any particular setting. What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. I am happy that you seem to be the first person to acknowledge that this is undesirable behaviour. I feel that we are finally making some progress. Now if more people could acknowledge that their is a problem perhaps we could set about improving the situation. James From: Lane Schwartz dowob...@gmail.com Sent: Friday, June 19, 2015 6:10 PM To: Read, James C Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses James, 1) Acknowledging that the search algorithm performs poorly with no LM, tuning or pruning despite the fact the search space clearly contains high quality translations Yes. We all acknowledge this. If you have a better technique, that's great. Show that it's better. Your paper does not do so. 2) to a public display of en-masse reluctance to acknowledge that such is an undesirable quality of the system Yes, this is undesirable. If you have a better technique, that's great. Show that it's better. Your paper does not do so. 3) to resorting to censorship not only in the literature but also on a public mailing list rather than acknowledge point 2. No one is trying to censor you in the literature. You wrote a paper that got rejected. Lots of papers get rejected. Lots of GOOD papers get rejected. The fact that yours got rejected does not mean that you're being censored. No one is trying to censor you on this list. We are simply requesting that you conduct yourself like a well-mannered adult engaged in scientific research. By the way, your frequent mentions of investors are very much a non sequitur. You may be looking for investors, and that's fine if you are. You may want to keep in mind that not everyone is. Many of us are interested in this as a field of scientific enquiry. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James From: Lane Schwartz dowob...@gmail.com Sent: Friday, June 19, 2015 7:40 PM To: Read, James C Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
if we don't understand the problem, how can we possibly fix it? all the relevant code is open source. go for it! ~amittai On 6/19/15 12:49, Read, James C wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James *From:* Lane Schwartz dowob...@gmail.com *Sent:* Friday, June 19, 2015 7:40 PM *To:* Read, James C *Cc:* Philipp Koehn; Burger, John D.; moses-support@mit.edu *Subject:* Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
P.S. Have a good weekend everybody. Be back in action in a couple of days. James From: Read, James C Sent: Friday, June 19, 2015 7:49 PM To: Lane Schwartz Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? Your telling me that redesigning the search algorithm to prefer higher scoring phrase pairs is all we need to do to get a best paper at ACL? James From: Lane Schwartz dowob...@gmail.com Sent: Friday, June 19, 2015 7:40 PM To: Read, James C Cc: Philipp Koehn; Burger, John D.; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Hi Rico, since you are at it, some pointers to the more advanced pruning techniques that do perform better, please :) On 19.06.2015 19:25, Rico Sennrich wrote: [sorry for the garbled message before] you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in practice for even longer. If you read the paper above, you will find that histogram pruning does not improve translation quality on a state-of-the-art SMT system, and performs poorly compared to more advanced pruning techniques. On 19.06.2015 17:49, Read, James C. wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Marcin Junczys-Dowmunt junczys@... writes: Hi Rico, since you are at it, some pointers to the more advanced pruning techniques that do perform better, please :) On 19.06.2015 19:25, Rico Sennrich wrote: [sorry for the garbled message before] you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in practice for even longer. If you read the paper above, you will find that histogram pruning does not improve translation quality on a state-of-the-art SMT system, and performs poorly compared to more advanced pruning techniques. the Zens et al. (2012) paper has a nice overview. significance pruning and relative entropy pruning are both effective - you are not guaranteed improvements over the unpruned system (although Johnson (2007) does report improvements), but both allow you to reduce the size of your models substantially with little loss in quality. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
On that interesting idea that moses should be naturally good at translating things, just for general considerations. Since some said this thread has educational value I would like to share something that might not be obvious due to the SMT-biased posts here. Moses is also the _leading_ tool for automatic grammatical error correction (GEC) right now. The first and third system of the CoNLL shared task 2014 were based on Moses. By now I have results that surpass the CoNLL results by far by adding some specialized features to Moses (which thanks to Hieu is very easy). It even gets good results for GEC when you do crazy things like inverting the TM (so it should actually make the input worse) provided you tune on the correct metric and for the correct task. The interaction of all the other features after tuning makes that possible. So, if anything, Moses is just a very flexible text-rewriting tool. Tuning (and data) turns into a translator, GEC tool, POS-tagger, Chunker, Semantic Tagger etc. On 19.06.2015 18:40, Lane Schwartz wrote: On Fri, Jun 19, 2015 at 11:28 AM, Read, James C jcr...@essex.ac.uk mailto:jcr...@essex.ac.uk wrote: What I take issue with is the en-masse denial that there is a problem with the system if it behaves in such a way with no LM + no pruning and/or tuning. There is no mass denial taking place. Regardless of whether or not you tune, the decoder will do its best to find translations with the highest model score. That is the expected behavior. What I have tried to tell you, and what other people have tried to tell you, is that translations with high model scores are not necessarily good translations. We all want our models to be such that high model scores correspond to good translations, and that low model scores correspond with bad translations. But unfortunately, our models do not innately have this characteristic. We all know this. We also know a good way to deal with this shortcoming, namely tuning. Tuning is the process by which we attempt to ensure that high model scores correspond to high quality translations, and that low model scores correspond to low quality translations. If you can design models that naturally correspond with translation quality without tuning, that's great. If you can do that, you've got a great shot at winning a Best Paper award at ACL. In the meantime, you may want to consider an apology for your rude behavior and unprofessional attitude. Goodbye. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Read, James C jcread@... writes: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in ps���ѥ�ȁ�ٕ���ȸ�%ԁɕ���ѡ���)���ٔԁݥѡ�Ё���ѽ�Ʌչ���́��Ё���ɽٔ��Ʌ�ͱ�ѥ��)�Յ���䁽�хєѡЁM5Pѕ�ə�ɵ́���ɱ䁍ɕ��Ѽ)��ɔ���م���չѕ�Օ̸ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
[sorry for the garbled message before] you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in practice for even longer. If you read the paper above, you will find that histogram pruning does not improve translation quality on a state-of-the-art SMT system, and performs poorly compared to more advanced pruning techniques. On 19.06.2015 17:49, Read, James C. wrote: So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Ah OK, I misunderstood, I thought you were talking about more advanced pruning techniques compared to the significance method from Johnson et al. while you only referred to the 30-best variant. Cheers, Marcin On 19.06.2015 19:35, Rico Sennrich wrote: Marcin Junczys-Dowmunt junczys@... writes: Hi Rico, since you are at it, some pointers to the more advanced pruning techniques that do perform better, please :) On 19.06.2015 19:25, Rico Sennrich wrote: [sorry for the garbled message before] you are right. The idea is pretty obvious. It roughly corresponds to 'Histogram pruning' in this paper: Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983. The idea has been described in the literature before that (for instance, Johnson et al. (2007) only use the top 30 phrase pairs per source phrase), and may have been used in practice for even longer. If you read the paper above, you will find that histogram pruning does not improve translation quality on a state-of-the-art SMT system, and performs poorly compared to more advanced pruning techniques. the Zens et al. (2012) paper has a nice overview. significance pruning and relative entropy pruning are both effective - you are not guaranteed improvements over the unpruned system (although Johnson (2007) does report improvements), but both allow you to reduce the size of your models substantially with little loss in quality. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Hi James, Well, it's pretty straightforward: The decoder's job is to find the hypothesis with the maximum model score. That's why everybody builds models which assign high model score to high-quality translations. Unfortunately, you missed this last point in your own work. Cheers, Matthias On Fri, 2015-06-19 at 14:15 +, Read, James C wrote: I'm gonna try once more. This is what he said: the decoder's job is NOT to find the high quality translation The next time I have a panel of potential investors in front of me I'm gonna pass that line by them and see how it goes down. I stress the words HIGH QUALITY TRANSLATION. Please promise me that the next time you put in a bid for funding you will guarantee your prospective funders that under no circumstances will you attempt to design a system which searches for HIGH QUALITY TRANSLATION. James From: Matthias Huck mh...@inf.ed.ac.uk Sent: Friday, June 19, 2015 5:08 PM To: Read, James C Cc: Hieu Hoang; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, Yes, he just said that. The decoder's job is to find the hypothesis with the maximum model score. That's one reason why your work is flawed. You did not care at all whether your model score correlates with BLEU or not. Cheers, Matthias On Fri, 2015-06-19 at 13:24 +, Read, James C wrote: I quote: the decoder's job is NOT to find the high quality translation Did you REALLY just say that? James __ From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 9:00 PM To: Read, James C Cc: Kenneth Heafield; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses the decoder's job is NOT to find the high quality translation (as measured by bleu). It's job is to find translations with high model score. you need the tuning to make sure high quality translation correlates with high model score. If you don't tune, it's pot luck what quality you get. You should tune with the features you use Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote: The analogy doesn't seem to be helping me understand just how exactly it is a desirable quality of a TM to a) completely break down if no LM is used (thank you for showing that such is not always the case) b) be dependent on a tuning step to help it find the higher scoring translations What you seem to be essentially saying is that the TM cannot find the higher scoring translations because I didn't pretune the system to do so. And I am supposed to accept that such is a desirable quality of a system whose very job is to find the higher scoring translations. Further, I am still unclear which features you prequire a system to be tuned on. At the very least it seems that I have discovered the selection process that tuning seems to be making up for in some unspecified and altogether opaque way. James From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 8:34 PM To: Read, James C; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion that LM are needed for tuning. I have some sympathy for you. You're looking at ways to improve translation by reducing the search space. I've bashed my head against this wall for a while as well without much success. However, as everyone is telling you, you haven't understood the role of tuning. Without tuning, you're pointing your lab rat to some random part of the search space, instead of away from the furry animal with whiskers and towards the yellow cheesy thing On 17/06/2015 20:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote
Re: [Moses-support] Major bug found in Moses
The analogy doesn't seem to be helping me understand just how exactly it is a desirable quality of a TM to a) completely break down if no LM is used (thank you for showing that such is not always the case) b) be dependent on a tuning step to help it find the higher scoring translations What you seem to be essentially saying is that the TM cannot find the higher scoring translations because I didn't pretune the system to do so. And I am supposed to accept that such is a desirable quality of a system whose very job is to find the higher scoring translations. Further, I am still unclear which features you prequire a system to be tuned on. At the very least it seems that I have discovered the selection process that tuning seems to be making up for in some unspecified and altogether opaque way. James From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 8:34 PM To: Read, James C; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion that LM are needed for tuning. I have some sympathy for you. You're looking at ways to improve translation by reducing the search space. I've bashed my head against this wall for a while as well without much success. However, as everyone is telling you, you haven't understood the role of tuning. Without tuning, you're pointing your lab rat to some random part of the search space, instead of away from the furry animal with whiskers and towards the yellow cheesy thing On 17/06/2015 20:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu
Re: [Moses-support] Major bug found in Moses
1) So if I've understood you correctly you are saying we have a system that is purposefully designed to perform poorly with a disabled LM and this is the proof that the LM is the most fundamental part. Any attempt to prove otherwise by, e.g. filtering the phrase table to help the disfunctional search algorithm, does not constitute proof that the TM is the most fundamental component of the system and if designed correctly can perform just fine on its own but rather only evidence that the researcher is not using the system as intended (the intention being to break the TM to support the idea that the LM is the most fundamental part). 2) If you still feel that the LM is the most fundamental component I challenge you to disable the TM and perform LM only translations and see what kind of BLEU scores you get. In conclusion, I do hope that you don't feel that potential investors in MT systems lack the intelligence to see through these logical fallacies. Can we now just admit that the system is broke and get around to fixing it? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:29 PM To: Read, James C Cc: Arnold, Doug; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses To paint you a picture: Imagine you have a rat in a labyrinth (the labyrinth is the TM and the search space). That rat is quite good at finding the center of that labyrinth. Now you somehow disable that rat's sense of smell, sense of direction, and long-term short-term memory (that's the LM). Can you expect the rat to find the center? Or will it just tumble around, bumping into walls and not find anything? That's what you did to the decoder when disabling the LM. Now you prune the TM. In the labyrinth that's like closing all the doors that would lead the rat away from the center. There are still a few corridors left, but they all point into the general direction of the point where the rat is supposed to go. Although it may never quite reach it. Now you put that same handicapped rat into the labyrinth where all ways lead more or less to the center. Are you really surprised that the clueless rat find the center nearly every time now? That's what happend. It's not a bug. The LM is probably the strongest feature in a MT system. If you take that away you see what happens. W dniu 2015-06-17 16:22, Read, James C napisał(a): All I did was break the link to the language model and then perform filtering. How is that a methodoligical mistake? How else would one test the efficacy of the TM in isolation? I remain convinced that this is undersirable behaviour and therefore a bug. James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:12 PM To: Read, James C Cc: Arnold, Doug; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Hi James No, not at all. I would say that is expected behaviour. It's how search spaces and optimization works. If anything these are methodological mistakes on your side, sorry. You are doing weird thinds to the decoder and then you are surprised to get weird results from it. W dniu 2015-06-17 16:07, Read, James C napisał(a): So, do we agree that this is undersirable behaviour and therefore a bug? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:01 PM To: Read, James C Subject: Re: [Moses-support] Major bug found in Moses As I said. With an unpruned phrase table and an decoder that just optmizes some unreasonble set of weights all bets are off, so if you get very low BLEU point there, it's not surprising. It's probably jumping around in a very weird search space. With a pruned phrase table you restrict the search space VERY strongly. Nearly everything that will be produced is a half-decent translation. So yes, I can imagine that would happen. Marcin W dniu 2015-06-17 15:56, Read, James C napisał(a): You would expect an improvement of 37 BLEU points? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 4:32 PM To: Read, James C Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisał(a): Hi all
Re: [Moses-support] Major bug found in Moses
interesting result. Lane On Wed, Jun 17, 2015 at 11:24 AM, Read, James C jcr...@essex.ac.uk wrote: Which features would you like me to tune? The whole purpose of the exercise was to eliminate all variables except the TM and to keep constant those that could not be eliminated so that I could see which types of phrase pairs contribute most to increases in BLEU score in a TM only setup. Now you are saying I have to tune but tuning won't work without a LM. So how do you expect a researcher to be able to understand how well the TM component of the system is working if you are going to insist that I must include a LM for tuning to work. Clearly the system is broken. It is designed to work well with a LM and poorly without. When clearly good results can be obtained with a functional TM and well chosen phrase pairs. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Kenneth Heafield mo...@kheafield.com Sent: Wednesday, June 17, 2015 7:13 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Please note that in order for the baseline to be meaningful it has to also use no LM. So, naturally the scores are lower than those of baselines you are referring to. Regarding expectations. Are you seriously suggesting that we would expect the translation model to be incapable of finding higher scoring translations when not filtering out less likely phrase pairs? How high exactly would that rank on your desirable qualities of a TM list? James From: amittai axelrod amit...@umiacs.umd.edu Sent: Wednesday, June 17, 2015 8:20 PM To: Read, James C; Hieu Hoang; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses hi -- you might not be aware, but your emails sound almost belligerently confrontational. i can see how you would be frustrated, but starting a conversation with i have found a major bug and then repeatedly saying that clearly everything is broken -- that may not be the best way to convince the few hundred people on the mailing list of the soundness of your approach. also, your argument could be easily mis-interpreted as this behavior is unexpected to me, ergo this is unexpected behavior, and that will unfortunately bias the listener against you, as that is the preferred argument structure of conspiracy theorists. at any rate, the system is designed to take a large number of phrase pairs and model scores cobble them together into a translation. it does do that. it appears that you have identified a different way of doing that cobbling-together, one that uses much fewer models -- so far so good! however, from reading your paper, it seems that your baseline is completely unoptimized, so performance gains against it may not show up in the real world. as specific examples, Table 1 in your paper shows that your baseline French-English system score is 11.36, Spanish-English is 7.16, and German-English is 6.70 BLEU. if you compare those baselines against published results in those languages from the previous few years, you will see that those scores are well off the mark. your position will be helped by showing results against a stronger, yet still basic, baseline. what happens if you compare your approach against a vanilla use of the Moses pipeline [this includes tuning]? cheers, ~amittai On 6/17/15 12:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following
Re: [Moses-support] Major bug found in Moses
the decoder's job is NOT to find the high quality translation (as measured by bleu). It's job is to find translations with high model score. you need the tuning to make sure high quality translation correlates with high model score. If you don't tune, it's pot luck what quality you get. You should tune with the features you use Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu On 17 June 2015 at 21:52, Read, James C jcr...@essex.ac.uk wrote: The analogy doesn't seem to be helping me understand just how exactly it is a desirable quality of a TM to a) completely break down if no LM is used (thank you for showing that such is not always the case) b) be dependent on a tuning step to help it find the higher scoring translations What you seem to be essentially saying is that the TM cannot find the higher scoring translations because I didn't pretune the system to do so. And I am supposed to accept that such is a desirable quality of a system whose very job is to find the higher scoring translations. Further, I am still unclear which features you prequire a system to be tuned on. At the very least it seems that I have discovered the selection process that tuning seems to be making up for in some unspecified and altogether opaque way. James From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 8:34 PM To: Read, James C; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion that LM are needed for tuning. I have some sympathy for you. You're looking at ways to improve translation by reducing the search space. I've bashed my head against this wall for a while as well without much success. However, as everyone is telling you, you haven't understood the role of tuning. Without tuning, you're pointing your lab rat to some random part of the search space, instead of away from the furry animal with whiskers and towards the yellow cheesy thing On 17/06/2015 20:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice
Re: [Moses-support] Major bug found in Moses
When you filter the TM, you reported that you used the fourth weight. When you translate with the full TM, what weights did you assign to the TM? If you used the default, I believe it would equally weight all the phrasal features (i.e., 1 1 1 1). This would explain why decoding with the full TM does not give the same result as filtering first. The moses.ini in your unfiltered translation experiment should assign weights of 0 0 0 1 to the TM features. On Jun 17, 2015, at 1:52 PM, Read, James C jcr...@essex.ac.uk wrote: The analogy doesn't seem to be helping me understand just how exactly it is a desirable quality of a TM to a) completely break down if no LM is used (thank you for showing that such is not always the case) b) be dependent on a tuning step to help it find the higher scoring translations What you seem to be essentially saying is that the TM cannot find the higher scoring translations because I didn't pretune the system to do so. And I am supposed to accept that such is a desirable quality of a system whose very job is to find the higher scoring translations. Further, I am still unclear which features you prequire a system to be tuned on. At the very least it seems that I have discovered the selection process that tuning seems to be making up for in some unspecified and altogether opaque way. James From: Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 8:34 PM To: Read, James C; Kenneth Heafield; moses-support@mit.edu Cc: Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses 4 BLEU is nothing to sniff at :) I was answering Ken's tangent aspersion that LM are needed for tuning. I have some sympathy for you. You're looking at ways to improve translation by reducing the search space. I've bashed my head against this wall for a while as well without much success. However, as everyone is telling you, you haven't understood the role of tuning. Without tuning, you're pointing your lab rat to some random part of the search space, instead of away from the furry animal with whiskers and towards the yellow cheesy thing On 17/06/2015 20:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code
Re: [Moses-support] Major bug found in Moses
hi -- you might not be aware, but your emails sound almost belligerently confrontational. i can see how you would be frustrated, but starting a conversation with i have found a major bug and then repeatedly saying that clearly everything is broken -- that may not be the best way to convince the few hundred people on the mailing list of the soundness of your approach. also, your argument could be easily mis-interpreted as this behavior is unexpected to me, ergo this is unexpected behavior, and that will unfortunately bias the listener against you, as that is the preferred argument structure of conspiracy theorists. at any rate, the system is designed to take a large number of phrase pairs and model scores cobble them together into a translation. it does do that. it appears that you have identified a different way of doing that cobbling-together, one that uses much fewer models -- so far so good! however, from reading your paper, it seems that your baseline is completely unoptimized, so performance gains against it may not show up in the real world. as specific examples, Table 1 in your paper shows that your baseline French-English system score is 11.36, Spanish-English is 7.16, and German-English is 6.70 BLEU. if you compare those baselines against published results in those languages from the previous few years, you will see that those scores are well off the mark. your position will be helped by showing results against a stronger, yet still basic, baseline. what happens if you compare your approach against a vanilla use of the Moses pipeline [this includes tuning]? cheers, ~amittai On 6/17/15 12:45, Read, James C wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses
Re: [Moses-support] Major bug found in Moses
Read here for a table of results for 40 language pairs: http://privatewww.essex.ac.uk/~jcread/paper.pdf Would you honestly expect such huge differences in BLEU score? Honestly!? James From: Read, James C Sent: Wednesday, June 17, 2015 4:56 PM To: Marcin Junczys-Dowmunt Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses You would expect an improvement of 37 BLEU points? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 4:32 PM To: Read, James C Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisal(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I hope this information will be useful to the Moses community and that the cause of the behaviour can be found and rectified. James ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Hi James No, not at all. I would say that is expected behaviour. It's how search spaces and optimization works. If anything these are methodological mistakes on your side, sorry. You are doing weird thinds to the decoder and then you are surprised to get weird results from it. W dniu 2015-06-17 16:07, Read, James C napisał(a): So, do we agree that this is undersirable behaviour and therefore a bug? James - FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl SENT: Wednesday, June 17, 2015 5:01 PM TO: Read, James C SUBJECT: Re: [Moses-support] Major bug found in Moses As I said. With an unpruned phrase table and an decoder that just optmizes some unreasonble set of weights all bets are off, so if you get very low BLEU point there, it's not surprising. It's probably jumping around in a very weird search space. With a pruned phrase table you restrict the search space VERY strongly. Nearly everything that will be produced is a half-decent translation. So yes, I can imagine that would happen. Marcin W dniu 2015-06-17 15:56, Read, James C napisał(a): You would expect an improvement of 37 BLEU points? James - FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl SENT: Wednesday, June 17, 2015 4:32 PM TO: Read, James C CC: Moses-support@mit.edu; Arnold, Doug SUBJECT: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisał(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I hope this information will be useful to the Moses community and that the cause of the behaviour can be found and rectified. James ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support [1] Links: -- [1] http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
All I did was break the link to the language model and then perform filtering. How is that a methodoligical mistake? How else would one test the efficacy of the TM in isolation? I remain convinced that this is undersirable behaviour and therefore a bug. James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:12 PM To: Read, James C Cc: Arnold, Doug; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Hi James No, not at all. I would say that is expected behaviour. It's how search spaces and optimization works. If anything these are methodological mistakes on your side, sorry. You are doing weird thinds to the decoder and then you are surprised to get weird results from it. W dniu 2015-06-17 16:07, Read, James C napisał(a): So, do we agree that this is undersirable behaviour and therefore a bug? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:01 PM To: Read, James C Subject: Re: [Moses-support] Major bug found in Moses As I said. With an unpruned phrase table and an decoder that just optmizes some unreasonble set of weights all bets are off, so if you get very low BLEU point there, it's not surprising. It's probably jumping around in a very weird search space. With a pruned phrase table you restrict the search space VERY strongly. Nearly everything that will be produced is a half-decent translation. So yes, I can imagine that would happen. Marcin W dniu 2015-06-17 15:56, Read, James C napisał(a): You would expect an improvement of 37 BLEU points? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 4:32 PM To: Read, James C Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisał(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I hope this information will be useful to the Moses community and that the cause of the behaviour can be found and rectified. James ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisał(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I hope this information will be useful to the Moses community and that the cause of the behaviour can be found and rectified. James ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support [1] Links: -- [1] http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Hi, BLEU scores don't mean much, unless you know what the translations look like. Marcin's explanation sounds very plausible. How did you set weights in your experiment? And were they fixed for the two contrastive runs? Cheers, O. On June 17, 2015 4:01:26 PM CEST, Read, James C jcr...@essex.ac.uk wrote: Read here for a table of results for 40 language pairs: http://privatewww.essex.ac.uk/~jcread/paper.pdf Would you honestly expect such huge differences in BLEU score? Honestly!? James From: Read, James C Sent: Wednesday, June 17, 2015 4:56 PM To: Marcin Junczys-Dowmunt Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses You would expect an improvement of 37 BLEU points? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 4:32 PM To: Read, James C Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisal(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I hope this information will be useful to the Moses community and that the cause of the behaviour can be found and rectified. James ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz) http://www.cuni.cz/~obo ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
To paint you a picture: Imagine you have a rat in a labyrinth (the labyrinth is the TM and the search space). That rat is quite good at finding the center of that labyrinth. Now you somehow disable that rat's sense of smell, sense of direction, and long-term short-term memory (that's the LM). Can you expect the rat to find the center? Or will it just tumble around, bumping into walls and not find anything? That's what you did to the decoder when disabling the LM. Now you prune the TM. In the labyrinth that's like closing all the doors that would lead the rat away from the center. There are still a few corridors left, but they all point into the general direction of the point where the rat is supposed to go. Although it may never quite reach it. Now you put that same handicapped rat into the labyrinth where all ways lead more or less to the center. Are you really surprised that the clueless rat find the center nearly every time now? That's what happend. It's not a bug. The LM is probably the strongest feature in a MT system. If you take that away you see what happens. W dniu 2015-06-17 16:22, Read, James C napisał(a): All I did was break the link to the language model and then perform filtering. How is that a methodoligical mistake? How else would one test the efficacy of the TM in isolation? I remain convinced that this is undersirable behaviour and therefore a bug. James - FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl SENT: Wednesday, June 17, 2015 5:12 PM TO: Read, James C CC: Arnold, Doug; moses-support@mit.edu SUBJECT: Re: [Moses-support] Major bug found in Moses Hi James No, not at all. I would say that is expected behaviour. It's how search spaces and optimization works. If anything these are methodological mistakes on your side, sorry. You are doing weird thinds to the decoder and then you are surprised to get weird results from it. W dniu 2015-06-17 16:07, Read, James C napisał(a): So, do we agree that this is undersirable behaviour and therefore a bug? James - FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl SENT: Wednesday, June 17, 2015 5:01 PM TO: Read, James C SUBJECT: Re: [Moses-support] Major bug found in Moses As I said. With an unpruned phrase table and an decoder that just optmizes some unreasonble set of weights all bets are off, so if you get very low BLEU point there, it's not surprising. It's probably jumping around in a very weird search space. With a pruned phrase table you restrict the search space VERY strongly. Nearly everything that will be produced is a half-decent translation. So yes, I can imagine that would happen. Marcin W dniu 2015-06-17 15:56, Read, James C napisał(a): You would expect an improvement of 37 BLEU points? James - FROM: Marcin Junczys-Dowmunt junc...@amu.edu.pl SENT: Wednesday, June 17, 2015 4:32 PM TO: Read, James C CC: Moses-support@mit.edu; Arnold, Doug SUBJECT: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisał(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some
Re: [Moses-support] Major bug found in Moses
Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
James, Did you run any optimizer? MERT, MIRA, PRO, etc? Lane On Wed, Jun 17, 2015 at 11:45 AM, Read, James C jcr...@essex.ac.uk wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
No. James From: Lane Schwartz dowob...@gmail.com Sent: Wednesday, June 17, 2015 7:58 PM To: Read, James C Cc: Hieu Hoang; Kenneth Heafield; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses James, Did you run any optimizer? MERT, MIRA, PRO, etc? Lane On Wed, Jun 17, 2015 at 11:45 AM, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: Doesn't look like the LM is contributing all that much then does it? James From: moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu on behalf of Hieu Hoang hieuho...@gmail.commailto:hieuho...@gmail.com Sent: Wednesday, June 17, 2015 7:35 PM To: Kenneth Heafield; moses-support@mit.edumailto:moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses On 17/06/2015 20:13, Kenneth Heafield wrote: I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. Tuning does work without a LM. The results aren't half bad. fr-en europarl (pb): with LM: 22.84 retuned without LM: 18.33 On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu moses-support-boun...@mit.edumailto:moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.chmailto:rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edumailto:moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Which features would you like me to tune? The whole purpose of the exercise was to eliminate all variables except the TM and to keep constant those that could not be eliminated so that I could see which types of phrase pairs contribute most to increases in BLEU score in a TM only setup. Now you are saying I have to tune but tuning won't work without a LM. So how do you expect a researcher to be able to understand how well the TM component of the system is working if you are going to insist that I must include a LM for tuning to work. Clearly the system is broken. It is designed to work well with a LM and poorly without. When clearly good results can be obtained with a functional TM and well chosen phrase pairs. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Kenneth Heafield mo...@kheafield.com Sent: Wednesday, June 17, 2015 7:13 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Read, James C jcread@... writes: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James I recommend you read the following: https://en.wikipedia.org/wiki/Confusion_of_the_inverse you don't explain which score you use for filtering (do you take one of the scores, their sum, their product, or something else?), but I expect you (mostly) keep the phrase pairs with a high p(e|f), which is the best thing to do when you don't have a language model. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
I already answered this question in another post. Apologies for double posting. Here is the code I used for filtering. I filtered based on the fourth score only. #!/usr/bin/perl -w # # Program filters phrase table to leave only phrase pairs # with probability above a threshold # use strict; use warnings; use Getopt::Long; my $phrase; my $min; my $phrase_table; my $filtered_table; GetOptions( 'min=f' = \$min, 'out=s' = \$filtered_table, 'in=s' = \$phrase_table); die ERROR: must give threshold and phrase table input file and output file\n unless ($min $phrase_table $filtered_table); die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table); open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase table $phrase_table\n;; open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase table $filtered_table\n;; while (my $line = PHRASETABLE) { chomp $line; my @columns = split ('\|\|\|', $line); # check that file is a well formatted phrase table if (scalar @columns 4) { die ERROR: input file is not a well formatted phrase table. A phrase table must have at least four colums each column separated by |||\n; } # get the probability and check it is less than the threshold my @scores = split /\s+/, $columns[2]; if ($scores[3] $min) { print FILTEREDTABLE $line.\n;; } } From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 7:17 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James I recommend you read the following: https://en.wikipedia.org/wiki/Confusion_of_the_inverse you don't explain which score you use for filtering (do you take one of the scores, their sum, their product, or something else?), but I expect you (mostly) keep the phrase pairs with a high p(e|f), which is the best thing to do when you don't have a language model. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Below I include a typical moses.ini file. Of course they were kept the same for both runs. The only difference was the phrase table filtering. I did everything in my power to make this the only variable. James From: Ondrej Bojar bo...@ufal.mff.cuni.cz Sent: Wednesday, June 17, 2015 5:23 PM To: Read, James C; Marcin Junczys-Dowmunt Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi, BLEU scores don't mean much, unless you know what the translations look like. Marcin's explanation sounds very plausible. How did you set weights in your experiment? And were they fixed for the two contrastive runs? Cheers, O. On June 17, 2015 4:01:26 PM CEST, Read, James C jcr...@essex.ac.uk wrote: Read here for a table of results for 40 language pairs: http://privatewww.essex.ac.uk/~jcread/paper.pdf Would you honestly expect such huge differences in BLEU score? Honestly!? James From: Read, James C Sent: Wednesday, June 17, 2015 4:56 PM To: Marcin Junczys-Dowmunt Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses You would expect an improvement of 37 BLEU points? James From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 4:32 PM To: Read, James C Cc: Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non-probabilities to zero. Otherwise there is no hope that moses will produce anything similar to the most probable translation. And based on that there is no surprise that there may be different translations. A pruned phrase table will produce naturally less noise, so I would say the behaviour you describe is quite exactly what I would expect to happen. Best, Marcin W dniu 2015-06-17 15:26, Read, James C napisal(a): Hi all, I tried unsuccessfully to publish experiments showing this bug in Moses behaviour. As a result I have lost interest in attempting to have my work published. Nonetheless I think you all should be aware of an anomaly in Moses' behaviour which I have thoroughly exposed and should be easy enough for you to reproduce. As I understand it the TM logic of Moses should select the most likely translations according to the TM. I would therefore expect a run of Moses with no LM to find sentences which are the most likely or at least close to the most likely according to the TM. To test this behaviour I performed two runs of Moses. One with an unfiltered phrase table the other with a filtered phrase table which left only the most likely phrase pair for each source language phrase. The results were truly startling. I observed huge differences in BLEU score. The filtered phrase tables produced much higher BLEU scores. The beam size used was the default width of 100. I would not have been surprised in the differences in BLEU scores where minimal but they were quite high. I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I hope this information will be useful to the Moses community and that the cause of the behaviour can be found and rectified. James ___ Moses-support mailing list Moses-support@mit.edumailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz) http://www.cuni.cz/~obo ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Major bug found in Moses
Evidently, if you filter the phrase table then the LM is not as important as you might feel. The question remains why isn't the system capable of finding the most likely translations without the LM? Why do I need to filter to help the system find them? This is undesirable behaviour. Clearly a bug. I include the code I used for filtering. As you can see the 4th score only was used as a filtering criteria. #!/usr/bin/perl -w # # Program filters phrase table to leave only phrase pairs # with probability above a threshold # use strict; use warnings; use Getopt::Long; my $phrase; my $min; my $phrase_table; my $filtered_table; GetOptions( 'min=f' = \$min, 'out=s' = \$filtered_table, 'in=s' = \$phrase_table); die ERROR: must give threshold and phrase table input file and output file\n unless ($min $phrase_table $filtered_table); die ERROR: file $phrase_table does not exist\n unless (-e $phrase_table); open (PHRASETABLE, $phrase_table) or die FATAL: Could not open phrase table $phrase_table\n;; open (FILTEREDTABLE, $filtered_table) or die FATAL: Could not open phrase table $filtered_table\n;; while (my $line = PHRASETABLE) { chomp $line; my @columns = split ('\|\|\|', $line); # check that file is a well formatted phrase table if (scalar @columns 4) { die ERROR: input file is not a well formatted phrase table. A phrase table must have at least four colums each column separated by |||\n; } # get the probability and check it is less than the threshold my @scores = split /\s+/, $columns[2]; if ($scores[3] $min) { print FILTEREDTABLE $line.\n;; } } From: Matt Post p...@cs.jhu.edu Sent: Wednesday, June 17, 2015 5:25 PM To: Read, James C Cc: Marcin Junczys-Dowmunt; moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses I think you are misunderstanding how decoding works. The highest-weighted translation of each source phrase is not necessarily the one with the best BLEU score. This is why the decoder retains many options, so that it can search among them (together with their reorderings). The LM is an important component in making these selections. Also, how did you weight the many probabilities attached to each phrase (to determine which was the most probable)? The tuning phase of decoding selects weights designed to optimize BLEU score. If you weighted them evenly, that is going to exacerbate this experiment. matt On Jun 17, 2015, at 10:22 AM, Read, James C jcr...@essex.ac.ukmailto:jcr...@essex.ac.uk wrote: All I did was break the link to the language model and then perform filtering. How is that a methodoligical mistake? How else would one test the efficacy of the TM in isolation? I remain convinced that this is undersirable behaviour and therefore a bug. James From: Marcin Junczys-Dowmunt junc...@amu.edu.plmailto:junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:12 PM To: Read, James C Cc: Arnold, Doug; moses-support@mit.edumailto:moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Hi James No, not at all. I would say that is expected behaviour. It's how search spaces and optimization works. If anything these are methodological mistakes on your side, sorry. You are doing weird thinds to the decoder and then you are surprised to get weird results from it. W dniu 2015-06-17 16:07, Read, James C napisał(a): So, do we agree that this is undersirable behaviour and therefore a bug? James From: Marcin Junczys-Dowmunt junc...@amu.edu.plmailto:junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 5:01 PM To: Read, James C Subject: Re: [Moses-support] Major bug found in Moses As I said. With an unpruned phrase table and an decoder that just optmizes some unreasonble set of weights all bets are off, so if you get very low BLEU point there, it's not surprising. It's probably jumping around in a very weird search space. With a pruned phrase table you restrict the search space VERY strongly. Nearly everything that will be produced is a half-decent translation. So yes, I can imagine that would happen. Marcin W dniu 2015-06-17 15:56, Read, James C napisał(a): You would expect an improvement of 37 BLEU points? James From: Marcin Junczys-Dowmunt junc...@amu.edu.plmailto:junc...@amu.edu.pl Sent: Wednesday, June 17, 2015 4:32 PM To: Read, James C Cc: Moses-support@mit.edumailto:Moses-support@mit.edu; Arnold, Doug Subject: Re: [Moses-support] Major bug found in Moses Hi James, there are many more factors involved than just probability, for instance word penalties, phrase penalities etc. To be able to validate your own claim you would need to set weights for all those non
Re: [Moses-support] Major bug found in Moses
I'll bite. The moses.ini files ship with bogus feature weights. One is required to tune the system to discover good weights for their system. You did not tune. The results of an untuned system are meaningless. So for example if the feature weights are all zeros, then the scores are all zero. The system will arbitrarily pick some awful translation from a large space of translations. The filter looks at one feature p(target | source). So now you've constrained the awful untuned model to a slightly better region of the search space. In other words, all you've done is a poor approximation to manually setting the weight to 1.0 on p(target | source) and the rest to 0. The problem isn't that you are running without a language model (though we generally do not care what happens without one). The problem is that you did not tune the feature weights. Moreover, as Marcin is pointing out, I wouldn't necessarily expect tuning to work without an LM. On 06/17/15 11:56, Read, James C wrote: Actually the approximation I expect to be: p(e|f)=p(f|e) Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise. James From: moses-support-boun...@mit.edu moses-support-boun...@mit.edu on behalf of Rico Sennrich rico.sennr...@gmx.ch Sent: Wednesday, June 17, 2015 5:32 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C jcread@... writes: I have been unable to find a logical explanation for this behaviour other than to conclude that there must be some kind of bug in Moses which causes a TM only run of Moses to perform poorly in finding the most likely translations according to the TM when there are less likely phrase pairs included in the race. I may have overlooked something, but you seem to have removed the language model from your config, and used default weights. your default model will thus (roughly) implement the following model: p(e|f) = p(e|f)*p(f|e) which is obviously wrong, and will give you poor results. This is not a bug in the code, but a poor choice of models and weights. Standard steps in SMT (like tuning the model weights on a development set, and including a language model) will give you the desired results. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support