Re: [Moses-support] RandLM make Error
LDHT is not really supported, but looking at your error message it seems that you need to install Google Sparse Hash. On Wed Nov 19 2014 at 12:47:27 PM Hieu Hoang hieu.ho...@ed.ac.uk wrote: There is a script within the randlm project that compiles just the library needed to integrate the library into Moses. https://sourceforge.net/p/randlm/code/HEAD/tree/trunk/manual-compile/compile.sh It's been a while since people have asked about RandLM, I'm not sure who's still using it and who has time experience to take care of it. On 19 November 2014 11:50, Achchuthan Yogarajah achch1...@gmail.com wrote: Hi Everyone, when i build RandLM with the following command make i got some error Making all in RandLM make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/RandLM' make[1]: Nothing to be done for `all'. make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/RandLM' Making all in LDHT make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/LDHT' /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I../.. -I./ -fPIC -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2 -MT libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c -o libLDHT_la-Client.lo `test -f 'Client.cpp' || echo './'`Client.cpp libtool: compile: g++ -DHAVE_CONFIG_H -I. -I../.. -I./ -fPIC -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2 -MT libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c Client.cpp -fPIC -DPIC -o .libs/libLDHT_la-Client.o In file included from Client.cpp:6:0: Client.h:8:34: fatal error: google/sparse_hash_map: No such file or directory #include google/sparse_hash_map ^ compilation terminated. make[1]: *** [libLDHT_la-Client.lo] Error 1 make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/LDHT' make: *** [all-recursive] Error -- *Thanks Regards,**Yogarajah Achchuthan* [ LinkedIn http://lk.linkedin.com/in/achchuthany/ Twitter https://twitter.com/achchuthany Facebook https://www.facebook.com/achchuthany ] ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] embeddings
I would model them as feature functions over phrases. You might imagine that you can exploit vector similarity to do smoothing. Good luck Miles ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
for those specific characters: perl -C -pe 's/\x{200B}//g' tmp/baa but as Lane mentions, you probably need to somehow specify the set of naughty characters you need to deal with. Miles On 30 May 2014 13:23, Lane Schwartz dowob...@gmail.com wrote: We also used charlint. It might do what you want. On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz dowob...@gmail.com wrote: As far as I know, no such general purpose tool exists. We wrote a custom in-house script that removes many, but not all, possible non-printing Unicode characters as part of our WMT submission. I am interested in writing one, though. I think the right way to do this would be to parse the Unicode character database for all characters of certain classes, and build the tool from that data. Lane On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A
Re: [Moses-support] Perplexity KenLM
you can get kenlm to report perplexity as follows: bin/query foo.arpa text | tail -n 1 note that you need to be careful with OOVs if you are comparing models that do not all use the same vocabulary. (SRILM is broekn in this respect in that an OOV will give you a probability of one) Miles -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] about testing on part of training dataset
SMT systems such as Moses do not guarantee that they can reproduce the training set. For example, phrases might be pruned due to frequencies being too low, not all words might be aligned, the decoder might discard the true translation during etc etc. This doesn't really have much to do with Indian languages per se; instead, it is the way that systems are built in general. Miles Can anyone please tell me about why we got low BLEU score on a testset we get from training set for sparse resourced languages like Indian languages. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] incremental training
Incremental training in Moses is based upon work we did a few years back: http://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf Table 3 shows that there is essentially no quality difference between incremental training and standard GIZA++ training. incremental (re) training is a lot faster. Miles -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] compile error with LDHT in randlm
If I recall the decoder was modified to allow batching of LM requests. Miles On 25 September 2013 10:22, Hieu Hoang hieuho...@gmail.com wrote: I'm not sure how to compile LDHT but when i compiled randlm from svn, i had to change 2 minor things to get it to compile on my mac: 1. src/RandLM/Makefile.am: boost_thread -- boost_thread-mt 2. autogen.sh: libtoolize -- glibtoolize Also, the distributed LM was supported in Moses v1. However, it has been deleted from the current Moses in the git repository. I will try and re-add it if a multi-pass, asynchronous decoding framework can be created. If you're interested in doing this, I would be very glad to help you On 24/09/2013 11:51, Hoai-Thu Vuong wrote: Hello I build LDHT in randlm and have got some errors, look like MurmurHash3.cpp:81:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp:68:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp:60:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp:55:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp: In function 'void MurmurHash3_x86_32(const void*, int, uint32_t, void*)': MurmurHash3.cpp:55:23: error: inlining failed in call to always_inline 'uint32_t getblock(const uint32_t*, int)': function body can be overwritten at link time I attach full error log here. My compiler is g++ version 4.7, OS is Ubuntu server 64bit 13.04, I clean install then install require package such as git, build essential, libtool, autoconf, google sparse hash, boost thread. With same source code I compile successful with g++ version 4.6, OS is ubuntu 64bit 12.04. I google solution to fix, and one guy recommend me change line (in MurmurHash3.cpp): #define FORCE_INLINE __attribute__((always_inline)) to #define FORCE_INLINE inline __attribute__((always_inline)) do this, I pass this error, however, I receive another error ::close(m_sd) not found in deconstructor of ~TransportTCP() -- Thu. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] compile error with LDHT in randlm
have a look at: SearchNormalBatch.h in the source Miles On 25 September 2013 10:34, Lane Schwartz dowob...@gmail.com wrote: Miles, I heard that rumor as well. If anyone could point me to any documentation that describes how to do this, I would be interested in trying out this functionality. Cheers, Lane On Wed, Sep 25, 2013 at 10:24 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: If I recall the decoder was modified to allow batching of LM requests. Miles On 25 September 2013 10:22, Hieu Hoang hieuho...@gmail.com wrote: I'm not sure how to compile LDHT but when i compiled randlm from svn, i had to change 2 minor things to get it to compile on my mac: 1. src/RandLM/Makefile.am: boost_thread -- boost_thread-mt 2. autogen.sh: libtoolize -- glibtoolize Also, the distributed LM was supported in Moses v1. However, it has been deleted from the current Moses in the git repository. I will try and re-add it if a multi-pass, asynchronous decoding framework can be created. If you're interested in doing this, I would be very glad to help you On 24/09/2013 11:51, Hoai-Thu Vuong wrote: Hello I build LDHT in randlm and have got some errors, look like MurmurHash3.cpp:81:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp:68:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp:60:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp:55:23: warning: always_inline function might not be inlinable [-Wattributes] MurmurHash3.cpp: In function 'void MurmurHash3_x86_32(const void*, int, uint32_t, void*)': MurmurHash3.cpp:55:23: error: inlining failed in call to always_inline 'uint32_t getblock(const uint32_t*, int)': function body can be overwritten at link time I attach full error log here. My compiler is g++ version 4.7, OS is Ubuntu server 64bit 13.04, I clean install then install require package such as git, build essential, libtool, autoconf, google sparse hash, boost thread. With same source code I compile successful with g++ version 4.6, OS is ubuntu 64bit 12.04. I google solution to fix, and one guy recommend me change line (in MurmurHash3.cpp): #define FORCE_INLINE __attribute__((always_inline)) to #define FORCE_INLINE inline __attribute__((always_inline)) do this, I pass this error, however, I receive another error ::close(m_sd) not found in deconstructor of ~TransportTCP() -- Thu. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] mosese decoder android and ios porting
For a long time now I've wanted to see Moses on a small device. Apart from all of the extra functionality that isn't needed, one would also need to work on shrinking the phrase table and perhaps also the search graph. KenLM / RandLM already deal with making the language model smaller. An interesting research question would be as follows: can we frame decoding on a small device in terms of a budget and optimise that budget? We normally don't bother thinking this way and instead focus entirely on quality. But it might be possible to instead have a better connection with the amount of space / search done and quality than we have already. I'm not sure if this is just a matter of fiddling with the beam size etc. Evince seems to suggest that this doesn't always give the expected behaviour (ie the relationship between BLEU and beam size isn't linear). Miles -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Including new features in moses decoding
this is a fairly typical result for MERT. i notice you are using MIRA, which is claimed to be more reliable. see http://www.aclweb.org/anthology/N/N09/N09-1025.pdf note that getting MIRA to work takes a lot of tweaking, so read the fine print carefully Miles On 25 July 2012 17:24, Cristina cristi...@lsi.upc.edu wrote: Dear all, We are doing some experiments by adding new features at phrase level in the translation table. We have done a first experiment to see the effects and they are quite weird: * We build a translation table with 9 features and a similar translation table with 18 features (the same 9 features + 9 new features) * We run MERT (or MIRA) on a dev set using the first translation table (9 features) * We translate a test set with 2 configurations: - MERT on 9 features using the translation table with 9 features - MERT on 9 features using the translation table with 18 features (9 + 9) where the weight for the 9 extra features is set to 0 We loose more than 3 points of BLEU with the second configuration with respect to the first one. (Using MERT on the 18 features gives similar results to the second configuration) Does anyone know if there is some penalty when adding more features? Or has anyone encountered the same problem? Thanks in advance! Best, Cristina ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Including new features in moses decoding
if you have non-zero feature values at training time, but they become zero at test time then you may have a problem. the reason for this is that all weights are optimised together. you can think of this as the system trying to work-out how best to translate, using everything. if some are zero, then you are forcing the rest to do the work that they were not optimised for. Miles On 25 July 2012 17:51, Cristina cristi...@lsi.upc.edu wrote: Thanks for the quick answer! I think that the problem here cannot be in the development step, it must be more related to decoding. Regardless the way weights are estimated, translation changes when I add new features with zero weight (not in development but in test). They shouldn't contribute to score the final translation, right? Cristina On Wed, 25 Jul 2012, Miles Osborne wrote: this is a fairly typical result for MERT. i notice you are using MIRA, which is claimed to be more reliable. see http://www.aclweb.org/anthology/N/N09/N09-1025.pdf note that getting MIRA to work takes a lot of tweaking, so read the fine print carefully Miles On 25 July 2012 17:24, Cristina cristi...@lsi.upc.edu wrote: Dear all, We are doing some experiments by adding new features at phrase level in the translation table. We have done a first experiment to see the effects and they are quite weird: * We build a translation table with 9 features and a similar translation table with 18 features (the same 9 features + 9 new features) * We run MERT (or MIRA) on a dev set using the first translation table (9 features) * We translate a test set with 2 configurations: - MERT on 9 features using the translation table with 9 features - MERT on 9 features using the translation table with 18 features (9 + 9) where the weight for the 9 extra features is set to 0 We loose more than 3 points of BLEU with the second configuration with respect to the first one. (Using MERT on the 18 features gives similar results to the second configuration) Does anyone know if there is some penalty when adding more features? Or has anyone encountered the same problem? Thanks in advance! Best, Cristina ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Including new features in moses decoding
then something is wrong Miles On 25 July 2012 19:42, Cristina cristi...@lsi.upc.edu wrote: mmm... but the others were optimised altogether, without the new ones I'm giving a weight zero... On Wed, 25 Jul 2012, Miles Osborne wrote: if you have non-zero feature values at training time, but they become zero at test time then you may have a problem. the reason for this is that all weights are optimised together. you can think of this as the system trying to work-out how best to translate, using everything. if some are zero, then you are forcing the rest to do the work that they were not optimised for. Miles On 25 July 2012 17:51, Cristina cristi...@lsi.upc.edu wrote: Thanks for the quick answer! I think that the problem here cannot be in the development step, it must be more related to decoding. Regardless the way weights are estimated, translation changes when I add new features with zero weight (not in development but in test). They shouldn't contribute to score the final translation, right? Cristina On Wed, 25 Jul 2012, Miles Osborne wrote: this is a fairly typical result for MERT. i notice you are using MIRA, which is claimed to be more reliable. see http://www.aclweb.org/anthology/N/N09/N09-1025.pdf note that getting MIRA to work takes a lot of tweaking, so read the fine print carefully Miles On 25 July 2012 17:24, Cristina cristi...@lsi.upc.edu wrote: Dear all, We are doing some experiments by adding new features at phrase level in the translation table. We have done a first experiment to see the effects and they are quite weird: * We build a translation table with 9 features and a similar translation table with 18 features (the same 9 features + 9 new features) * We run MERT (or MIRA) on a dev set using the first translation table (9 features) * We translate a test set with 2 configurations: - MERT on 9 features using the translation table with 9 features - MERT on 9 features using the translation table with 18 features (9 + 9) where the weight for the 9 extra features is set to 0 We loose more than 3 points of BLEU with the second configuration with respect to the first one. (Using MERT on the 18 features gives similar results to the second configuration) Does anyone know if there is some penalty when adding more features? Or has anyone encountered the same problem? Thanks in advance! Best, Cristina ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Fwd: a question about moses
The standard way to do this is pretend that each word pair in a dictionary is a little sentence. Append this to the usual parallel corpus and train with Giza Miles On May 1, 2012 5:53 PM, Abby Levenberg leven...@gmail.com wrote: Hi, I assume the answer is no but wanted to be sure. Thanks, Abby -- Forwarded message -- From: Niraj Aswani nirajasw...@gmail.com Date: Tue, May 1, 2012 at 4:25 PM Subject: a question about moses To: Abby Levenberg leven...@gmail.com hi Abby, I hope you are fine. I am running a moses experiment on my system and wanted to know how can you supply an external dictionary to support the translation model? Is there a way to do it? Regards, Niraj ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE
Very short sentences will give you high scores. Also multiple references will boost them Miles On Apr 26, 2012 8:13 PM, John D Burger j...@mitre.org wrote: I =think= I recall that pairwise BLEU scores for human translators are usually around 0.50, so anything much better than that is indeed suspect. - JB On Apr 26, 2012, at 14:18 , Daniel Schaut wrote: Hi all, I’m running some experiments for my thesis and I’ve been told by a more experienced user that the achieved scores for BLEU/METEOR of my MT engine were too good to be true. Since this is the very first MT engine I’ve ever made and I am not experienced with interpreting scores, I really don’t know how to reflect them. The first test set achieves a BLEU score of 0.6508 (v13). METEOR’s final score is 0.7055 (v1.3, exact, stem, paraphrase). A second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR score of 0.6748. Here are some basic facts about my system: Decoding direction: EN-DE Training corpus: 1.8 mil sentences Tuning runs: 5 Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain) LM type: trigram TM type: unfactored I’m now trying to figure out if these scores are realistic at all, as different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang 2011. Any comments regarding the mentioned decoding direction and related scores will be much appreciated. Best, Daniel ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Evaluation
no it works as I just verified. On 20 April 2012 11:29, sara hamza sarahamz...@gmail.com wrote: Good Morning everyOne , Can anyone tell me please where can I get the mteval‐v11b.pl used in evaluation ?? I found this URL in some documentation : ftp:// jaguar.ncsl.nist.gov/mt/resources/mteval‐v11b.pl but access failed . Thank you in advance . ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Incremental training
incremental training for Giza is distinct from incremental training for the language model. we have worked on both --see Abby Levenberg's PhD http://homepages.inf.ed.ac.uk/miles/phd-projects/levenberg.pdf the short answer is yes, but I don't think the incremental LM code has migrated from Abby's thesis work into the Moses distribution Miles On 20 February 2012 20:23, marco turchi marco.tur...@gmail.com wrote: Dear all, I'm starting to use the incremental training and I was wondering if it updates the language model as well. If the answer is not, is it possible to update the language model without restarting Moses? Thanks a lot Marco ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Remote LM protocol?
Oliver is in the process of finishing it. Miles On Feb 14, 2012 3:45 PM, Lane Schwartz dowob...@gmail.com wrote: Miles, Just ran across this email and thought I'd follow up. How is this coming along? :) Cheers, Lane On Thu, Nov 17, 2011 at 11:31 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: what we have is something that is very similar to the Google bloomier filter setup --ie a randomised LM, with the actual LM sharded across multiple machines. we have been working on making it faster and have some results here. with any luck we will release this sometime early next year Miles On 17 November 2011 16:25, Christian Federmann cfederm...@dfki.de wrote: Hi Peter, Hieu, all, my thesis stuff is rather outdated and likely not working with current Moses code. As Hieu pointed out, the whole thing is problematic as networked requests take much longer than in-memory n-gram lookups. In the Dublin MT Marathon, Mark Fishel and I worked on optimal batching of LMServer requests and got pretty far; the combination of Miles' RandLM and such a batched, remote LM interface could be a nice thing... Cheers, Christian On Nov 17, 2011, at 2:53 PM, Hieu Hoang wrote: hi peter i think christian federmann worked on the remote LM : https://www.google.com/search?hl=enq=federmann+Very+large+language+models+for+machine+translation however, IMO, the decoder is lacking the infrastructure to do remote LM. to do it well, the decoder has to batch the LM calls to minimise second too many queries. Also, it has to make the calls asynchronously rather than wait for the LM query to complete. I'm not sure how far christian got but i suspect this is a major undertaking ps. your email to the mailing list went through fine. Why did you think it didn't? http://news.gmane.org/gmane.comp.nlp.moses.user On 17/11/2011 14:54, P.J. Berck wrote: Hi, I was looking at the possibility to use a remote LM in moses, but I can't find any documentation. I know about the 6 0 3 host:port specification in moses.ini, but a naive test just gives errors like Your data containss in a position other than the first word. Is there some kind of protocol I need to implement? What kind of results does moses expect? Thanks for pointers, -peter ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Remote LM protocol?
integration with the search process needs doing but the backend and batching of requests is done. Miles On Feb 14, 2012 4:37 PM, Lane Schwartz dowob...@gmail.com wrote: Cool. :) I'm definitely looking forward to giving it a try when it is released. Cheers, Lane On Tue, Feb 14, 2012 at 10:33 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: Oliver is in the process of finishing it. Miles On Feb 14, 2012 3:45 PM, Lane Schwartz dowob...@gmail.com wrote: Miles, Just ran across this email and thought I'd follow up. How is this coming along? :) Cheers, Lane On Thu, Nov 17, 2011 at 11:31 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: what we have is something that is very similar to the Google bloomier filter setup --ie a randomised LM, with the actual LM sharded across multiple machines. we have been working on making it faster and have some results here. with any luck we will release this sometime early next year Miles On 17 November 2011 16:25, Christian Federmann cfederm...@dfki.de wrote: Hi Peter, Hieu, all, my thesis stuff is rather outdated and likely not working with current Moses code. As Hieu pointed out, the whole thing is problematic as networked requests take much longer than in-memory n-gram lookups. In the Dublin MT Marathon, Mark Fishel and I worked on optimal batching of LMServer requests and got pretty far; the combination of Miles' RandLM and such a batched, remote LM interface could be a nice thing... Cheers, Christian On Nov 17, 2011, at 2:53 PM, Hieu Hoang wrote: hi peter i think christian federmann worked on the remote LM : https://www.google.com/search?hl=enq=federmann+Very+large+language+models+for+machine+translation however, IMO, the decoder is lacking the infrastructure to do remote LM. to do it well, the decoder has to batch the LM calls to minimise second too many queries. Also, it has to make the calls asynchronously rather than wait for the LM query to complete. I'm not sure how far christian got but i suspect this is a major undertaking ps. your email to the mailing list went through fine. Why did you think it didn't? http://news.gmane.org/gmane.comp.nlp.moses.user On 17/11/2011 14:54, P.J. Berck wrote: Hi, I was looking at the possibility to use a remote LM in moses, but I can't find any documentation. I know about the 6 0 3 host:port specification in moses.ini, but a naive test just gives errors like Your data containss in a position other than the first word. Is there some kind of protocol I need to implement? What kind of results does moses expect? Thanks for pointers, -peter ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] New multi-parallel corpus available (Indic Languages)
The Indic multi-parallel corpus consists of approximately 2000 Wikipedia sentences translated into the following Indic languages: Bengali Hindi Malayalam Tamil Telugi Urdu The data was translated by non-expert translators hired over Mechanical Turk and so it is of mixed quality. Every source source segments was translated redundantly by four different Turkers. Note that we have translated paragraphs, so the data should be of interest to researchers looking at discourse as well as machine translation. http://homepages.inf.ed.ac.uk/miles/babel.html Miles Osborne (Edinburgh) Chris Callison-Burch (JHU) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Filtering LMs
this can be done, but it tends to not save much space. also it does not help deal with OOVs, which the language model can still score even though they are not in the parallel set. if you are worried about saving space then you should either look at KenLM or RandLM Miles On 24 November 2011 12:58, Thomas Schoenemann thomas_schoenem...@yahoo.de wrote: Dear all, I hope that this is not too stupid a question, and that it hasn't been asked recently. In the MOSES EMS, when running experiments the phrase table is automatically reduced to only those phrases that actually occur in the respective dev/test set. Obviously this saves a lot of memory without changing the resulting translations. Now, I was wondering if something similar can be done/is done with the language model. That is, can one reduce the ARPA-file to only those words that occur on the target side in the (filtered) phrase table? The objective would of course be to maintain the translation result. Would the LM-software renormalize internally if some of the original entries are removed? Then the results would differ. This may even depend on what language model you use to load (rather than train) the ARPA file. I am using SRILM in my own translation programs, but would also be interested in other toolkits in case they behave more suitably. Can anyone point me to anything? Many thanks! Thomas Schoenemann (currently University of Pisa) ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Randomisation by MGIZA and tuning result is worse than no tuning
--in general, Machine Translation training is non-convex. this means that there are multiple solutions and each time you run a full training job, you will get different results. in particular, you will see different results when running Giza++ (any flavour) and MERT. Is there no way to stop the variant in Giza++? I look at the code but has no idea where it occurs. no, this is a property of the task, not the method. put it another way, there is nothing which tells the model how words are translated. Giza++ makes a guess based upon how well it `explains's the training data (log-likelihood / cross entropy). there are many ways to achieve the same log-likelihood and each guess amounts to a different translation model. on average these alternative models will all be similar to each other (words are translated in similar ways), but in general you will find differences. --the best way to deal with this (and most expensive) would be to run the full pipe-line, from scratch and multiple times. this will give you a feel for variance --differences in results. in general, variance arising from Giza++ is less damaging than variance from MERT. How many run is enough for this? As you say, it would be very expensive to do so. how long is a piece of string? --to reduce variance it is best to use as much data as possible at each stage. (100 sentences for tuning is far too low; you should be using at least 1000 sentences). it is possible to reduce this variability by using better machine learning, but in general it will always be there. What do you mean by better machine learning? Isn't the 500,000 words corpus enough? For the 1,000 sentences for tuning, can I use the same sentences as used in the training or they shall be separate sets of sentences? lattice MERT is an example, or the Berkeley Aligner. you cannot use the same sentences for training and tuning, as has been explained earlier on the list --another strategy I know about is to fix everything once you have a set of good weights and never rerun MERT. should you need to change say the language model, you will then manually alter the associated weight. this will mean stability, but at the obvious cost of generality. it is also ugly. Could you elaborate a bit about the fixing everything and never rerun MERT part? Do you mean after running n times, we find the best variation of variables (there are so many of them) and don't run MERT which I understand is for tuning? if you have some problem that is fairly stable (uses the same training set, language models etc) then after running MERT many times and evaluating it on a disjoint test set, you pick the weights that produce good results. afterwards you do not re-run MERT even if you have changed the model. as i mentioned, this is ugly and something you do not want to do unless you are forced to do it Miles Thanks and sorry to answer it with more questions. Cheers, Jelita -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Various questions about training and tuning
re: not tuning on training data, in principle this shouldn't matter (especially if the tuning set is large and/or representative of the task). in reality, Moses will assign far too much weight to these examples, at the detriment of the others. (it will drastically overfit). this is why the tuning and training sets are typically disjoint. this is a standard tactic in NLP and not just Moses. re: assigning more weight to certain translations, you have two options here. the first would be to assign more weight to these pairs when you run Giza++. (you can assign per-sentence pair weights at this stage). this is really just a hint and won't guarantee anything. the second option would be to force translations (using the XML markup). Miles On 18 November 2011 08:42, Jehan Pages je...@mygengo.com wrote: Hi, On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar tah...@precisiontranslationtools.com wrote: Jehan, here are my strategies, others may vary. Thanks. 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not just a convenience for speed. If you make the effort to use the BerkeleyAligner, this limit disappears. Ok I didn't know this alternative to GIZA++. I see there are some explanation on the website for switching to this aligner. I may give it a try someday then. :-) 2/ From a statistics and survey methodology point of view, your training data is a subset of individual samples selected from a whole population (linguistic domain) so-as to estimate the characteristics of the whole population. So, duplicates can exist and they play an important role in determining statistical significance and calculating probabilities. Some data sources, however, repeat information with little relevance to the linguistic balance of the whole domain. One example is a web sites with repetitive menus on every page. Therefore, for our use, we keep duplicates where we believe they represent a balanced sampling and results we want to achieve. We remove them when they do not. Not everyone, however, agrees with this approach. I see. And that confirms my thoughts. I don't know for sure what will be my strategy, but I think that will be keeping them all then, most probably. Making conditional removal like you do is interesting, but that would prove hard to do on our platform as we don't have context on translations stored. 3/ Yes, none of the data pairs in the tuning set should be present in your training data. To do so skews the tuning weights to give excellent BLEU scores on the tuning results, but horrible scores on real world translations. I am not sure I understand what you say. How do you do so? Also why would we want to give horrible score to real world translations? Isn't the point exactly that the tuning data should actually represent this real world translations that we want to get close to? 4/ Also I was wondering something else that I just remember. So that will be a fourth question! Suppose in our system, we have some translations we know for sure are very good (all are good but some are supposed to be more like certified quality). Is there no way in Moses to give some more weight to some translations in order to influence the system towards quality data (still keeping all data though)? Thanks again! Jehan Tom On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages je...@mygengo.com wrote: Hi all, I have a few questions about quality of training and tuning. If anyone has any clarifications, that would be nice! :-) 1/ According to the documentation: « sentences longer than 100 words (and their corresponding translations) have to be eliminated (note that a shorter sentence length limit will speed up training » So is it only for the sake of training speed or can too long sentences end up being a liability in MT quality? In other words, when I finally need to train for real usage, should I really remove long sentences? 2/ My data is taken from real crowd-sourced translated data. As a consequence, we end up with some duplicates (same original text and same translation). I wonder if for training, that either doesn't matter, or else we should remove duplicates, or finally that's better to have duplicates. I would imagine the latter (keep duplicates) is the best as this is statistical machine learning and after all, these represent real life duplicates (text we often encounter and that we apparently usually translate the same way) so that would be good to insist on these translations during training. Am I right? 3/ Do training and tuning data have necessarily to be different? I guess for it to be meaningful, it should, and various examples on the website seem to go in that way, but I could not read anything clearly stating this. Thanks. Jehan ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Remote LM protocol?
we have been working on making distributed LMs efficient. stay tuned Miles On 17 November 2011 13:53, Hieu Hoang hieuho...@gmail.com wrote: hi peter i think christian federmann worked on the remote LM : https://www.google.com/search?hl=enq=federmann+Very+large+language+models+for+machine+translation however, IMO, the decoder is lacking the infrastructure to do remote LM. to do it well, the decoder has to batch the LM calls to minimise second too many queries. Also, it has to make the calls asynchronously rather than wait for the LM query to complete. I'm not sure how far christian got but i suspect this is a major undertaking ps. your email to the mailing list went through fine. Why did you think it didn't? http://news.gmane.org/gmane.comp.nlp.moses.user On 17/11/2011 14:54, P.J. Berck wrote: Hi, I was looking at the possibility to use a remote LM in moses, but I can't find any documentation. I know about the 6 0 3 host:port specification in moses.ini, but a naive test just gives errors like Your data containss in a position other than the first word. Is there some kind of protocol I need to implement? What kind of results does moses expect? Thanks for pointers, -peter ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Multi-run mert to average non-deterministic results
Question: do you think it's better to run mert-moses.pl more times with smaller sets, or fewer times with larger sets? you should run tuning with larger sets, multiple times no amount of rerunning tuning on a small set will tell you anything Miles On 7 November 2011 13:45, Tom Hoar tah...@precisiontranslationtools.com wrote: A recent list thread recommended running mert several times and averaging the various non-deterministic results. If we adopt multiple mert tests, I want optimize the sizes of the tuning/test set, without taking too many segments from the total population. Currently, we extract statistically significant number of randomly selected segments (pairs) for one tuning set and one test set. We calculate a sample size with a basic population sampling formula that uses the population size, user-selected confidence level and confidence interval (e.g. 97% ±2%). We always assume an equal probabilistic proportion (50/50), which I understand results in the highest population sample. Of course, higher confidence levels with tighter intervals result in larger tuning/testing sample sizes. Reducing the confidence level, for example to 90%, with an interval of ±5%, gives significantly smaller random sample sets. Smaller random sample sets are less representative of the overall population, but mert-moses.pl runs faster allowing us to evaluate more sets. Question: do you think it's better to run mert-moses.pl more times with smaller sets, or fewer times with larger sets? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Multi-threading / Boost lib / compile error for threaded Moses
that doesn't work, as all of the locking code etc would still be invoked. you really want something like --threads 0 which should bypass everything and truly run in single threaded mode Miles On 22 September 2011 10:26, Kenneth Heafield mo...@kheafield.com wrote: -threads 1 ? On 09/22/11 10:06, Tom Hoar wrote: Re: the survey. I suggest if multi-threading is always enabled, there should be a command-line option that allows users to disable multi-threading for debugging. Tom On Thu, 22 Sep 2011 09:56:57 +0100, Kenneth Heafield mo...@kheafield.com wrote: My fault. Sorry. Fixed. On 09/22/11 09:41, Hieu Hoang wrote: hiya There's currently a compile error in trunk when multi-threading is enabled. However, I think the root cause of the problem is that there's currently too many compile flags so developers can't test the different combinations. Specifically, the boost library and multi-threading options. I've made a little poll to to see if people want to make Boost library a prerequisite, and threading always turned on: http://www.doodle.com/g7tgw778m9mp7dvw The poll also asks if you're willing to chip in and help out whichever way you vote. Having Boost only as an option makes it difficult to develop in Moses and makes it error prone, as we see with the compile error. Mandating Boost may mean some people have to install the correct Boost version on their machine. There may be Boost questions on this mailing list as a result. Hieu ps. the compile error is /bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I../.. -W -Wall -ffor-scope -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread -DTRACE_ENABLE=1 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include -I/home/s0565741/workspace/sourceforge/trunk/kenlm -g -O2 -MT AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c -o AlignmentInfo.lo AlignmentInfo.cpp libtool: compile: g++ -DHAVE_CONFIG_H -I. -I../.. -W -Wall -ffor-scope -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread -DTRACE_ENABLE=1 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include -I/home/s0565741/workspace/sourceforge/trunk/kenlm -g -O2 -MT AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c AlignmentInfo.cpp -o AlignmentInfo.o In file included from StaticData.h:41:0, from AlignmentInfo.cpp:23: FactorCollection.h: In member function \u2018bool Moses::FactorCollection::EqualsFactor::operator()(const Moses::Factor, const Moses::FactorFriend) const\u2019: FactorCollection.h:80:19: error: \u2018const class Moses::Factor\u2019 has no member named \u2018in\u2019 make[3]: *** [AlignmentInfo.lo] Error 1 make[3]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk/moses/src' make[2]: *** [all] Error 2 make[2]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk/moses/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk' make: *** [all] Error 2 ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Multi-threading / Boost lib / compile error for threaded Moses
this has nothing to do with speed, more about actual debugging. running just a single thread is a special case of running multiple threads, so all the code to ensure things being safe is the same in all situations. should someone want to debug with no threading, then there would need to be a mess of ifdefs removing all support for threading. i agree, this will be a pain to deal with, but this is what debugging with no threads means. Miles On 22 September 2011 10:43, Kenneth Heafield mo...@kheafield.com wrote: It works for debugging. Perhaps your argument is that the single-threaded version will be slower due to unnecessary locking. My response is that, if you care about performance, then you shouldn't be running single-theaded. Wrapping every lock in an if statement is arguably worse than wrapping them in ifdefs, especially due to the RAII nature of boost locks. So compile-time does a better job at meeting a goal that I don't buy into. On 09/22/11 10:31, Miles Osborne wrote: that doesn't work, as all of the locking code etc would still be invoked. you really want something like --threads 0 which should bypass everything and truly run in single threaded mode Miles On 22 September 2011 10:26, Kenneth Heafield mo...@kheafield.com wrote: -threads 1 ? On 09/22/11 10:06, Tom Hoar wrote: Re: the survey. I suggest if multi-threading is always enabled, there should be a command-line option that allows users to disable multi-threading for debugging. Tom On Thu, 22 Sep 2011 09:56:57 +0100, Kenneth Heafield mo...@kheafield.com wrote: My fault. Sorry. Fixed. On 09/22/11 09:41, Hieu Hoang wrote: hiya There's currently a compile error in trunk when multi-threading is enabled. However, I think the root cause of the problem is that there's currently too many compile flags so developers can't test the different combinations. Specifically, the boost library and multi-threading options. I've made a little poll to to see if people want to make Boost library a prerequisite, and threading always turned on: http://www.doodle.com/g7tgw778m9mp7dvw The poll also asks if you're willing to chip in and help out whichever way you vote. Having Boost only as an option makes it difficult to develop in Moses and makes it error prone, as we see with the compile error. Mandating Boost may mean some people have to install the correct Boost version on their machine. There may be Boost questions on this mailing list as a result. Hieu ps. the compile error is /bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I../.. -W -Wall -ffor-scope -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread -DTRACE_ENABLE=1 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include -I/home/s0565741/workspace/sourceforge/trunk/kenlm -g -O2 -MT AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c -o AlignmentInfo.lo AlignmentInfo.cpp libtool: compile: g++ -DHAVE_CONFIG_H -I. -I../.. -W -Wall -ffor-scope -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread -DTRACE_ENABLE=1 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include -I/home/s0565741/workspace/sourceforge/trunk/kenlm -g -O2 -MT AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c AlignmentInfo.cpp -o AlignmentInfo.o In file included from StaticData.h:41:0, from AlignmentInfo.cpp:23: FactorCollection.h: In member function \u2018bool Moses::FactorCollection::EqualsFactor::operator()(const Moses::Factor, const Moses::FactorFriend) const\u2019: FactorCollection.h:80:19: error: \u2018const class Moses::Factor\u2019 has no member named \u2018in\u2019 make[3]: *** [AlignmentInfo.lo] Error 1 make[3]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk/moses/src' make[2]: *** [all] Error 2 make[2]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk/moses/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk' make: *** [all] Error 2 ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Multi-threading / Boost lib / compile error for threaded Moses
this is the last thing i will post here on this subject: debugging with a single thread running invokes the threading code. ***if you suspect that this is somehow broken, then you need to debug without it***. it is that simple. running gdb in single thread mode still uses threading. Miles On 22 September 2011 11:28, Kenneth Heafield mo...@kheafield.com wrote: But I don't see a use case for it. I can run gdb just fine on a multithreaded program that happens to be running one thread. And the stderr output will be in order. On 09/22/11 11:21, Miles Osborne wrote: should someone want to debug with no threading, then there would need to be a mess of ifdefs removing all support for threading. i agree, this will be a pain to deal with, but this is what debugging with no threads means. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Phrase probabilities
exactly, the only correct way to get real probabilities out would be to compute the normalising constant and renormalise the dot products for each phrase pair. remember that this is best thought of as a set of scores, weighted such that the relative proportions of each model are balanced Miles On 20 September 2011 16:07, Burger, John D. j...@mitre.org wrote: Taylor Rose wrote: I am looking at pruning phrase tables for the experiment I'm working on. I'm not sure if it would be a good idea to include the 'penalty' metric when calculating probability. It is my understanding that multiplying 4 or 5 of the metrics from the phrase table would result in a probability of the phrase being correct. Is this a good understanding or am I missing something? I don't think this is correct. At runtime all the features from the phrase table and a number of other features, some only available during decoding, are combined in an inner product with a weight vector to score partial translations. I believe it's fair to say that at no point is there an explicit modeling of a probability of the phrase being correct, at least not in isolation from the partially translated sentence. This is not to say you couldn't model this yourself, of course. - John Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Phrase probabilities
some terminology: these are feature values, not metrics. feature values have a number of roles to play eg P(e | f) indicates the chance that phrase e should be the translation of phrase f. these values are designed to be used together, and weighted to produce an overall score for a translation choice. this is the basis of a log-linear model. if you take them all and multiply them together then I guess that is equivalent to assuming each is equally weighted and that you have something like the geometric mean of them (a product of logs, without the divisor). you may well be able to use the scores in the way you suggest, but whether you have `good' or `bad' results will be by chance. if you want to prune the phrase table then a starting point is here: http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc16 Miles On 20 September 2011 16:47, Taylor Rose tr...@languageintelligence.com wrote: So what exactly can I infer from the metrics in the phrase table? I want to be able to compare phrases to each other. From my experience, multiplying them and sorting by that number has given me more accurate phrases... Obviously calling that metric probability is wrong. My question is: What is that metric best indicative of? -- Taylor Rose Machine Translation Intern Language Intelligence IRC: Handle: trose Server: freenode On Tue, 2011-09-20 at 16:14 +0100, Miles Osborne wrote: exactly, the only correct way to get real probabilities out would be to compute the normalising constant and renormalise the dot products for each phrase pair. remember that this is best thought of as a set of scores, weighted such that the relative proportions of each model are balanced Miles On 20 September 2011 16:07, Burger, John D. j...@mitre.org wrote: Taylor Rose wrote: I am looking at pruning phrase tables for the experiment I'm working on. I'm not sure if it would be a good idea to include the 'penalty' metric when calculating probability. It is my understanding that multiplying 4 or 5 of the metrics from the phrase table would result in a probability of the phrase being correct. Is this a good understanding or am I missing something? I don't think this is correct. At runtime all the features from the phrase table and a number of other features, some only available during decoding, are combined in an inner product with a weight vector to score partial translations. I believe it's fair to say that at no point is there an explicit modeling of a probability of the phrase being correct, at least not in isolation from the partially translated sentence. This is not to say you couldn't model this yourself, of course. - John Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] build 5 gram with SRILM and moses
yes On 6 September 2011 17:28, Cyrine NASRI cyrine.na...@gmail.com wrote: Hi all, Is it possible tu uses 5 gram Language model built bu SRILM with MOses? Thanks Best Cyrine -- *Cyrine Ph.D. Student in Computer Science* ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM build-binary trie problems (SRILM ARPA file)
for the SRILM, you use the -unk flag; RandLM does this by default if I recall Miles On 16 August 2011 06:28, Tom Hoar tah...@precisiontranslationtools.comwrote: Ken, Does the online moses documentation refer to how to ensure the language model has unk in the vocabulary? I've never seen it. What's the best way to ensure a LM has the unk token in the vocabulary? Is it as simple as appending one line consisting of one unk token to the language model corpus? Or, is there command line switch for ngram-count, build-lm.sh, buildlm? Or, should we just edit the raw text language model and add it to the vocabulary manually? Thanks, Tom On Mon, 15 Aug 2011 22:12:36 +0100, Kenneth Heafield mo...@kheafield.com wrote: Ok I have reproduced the problem. It only happens when the ARPA file is missing and is probably an off-by-one on vocabulary size. I'll have a fix soon. Kenneth On 08/15/11 19:20, Kenneth Heafield wrote: Hi, Back from vacation and sorry but I'm having trouble reproducing this locally. - Latest Moses (revision 4143); I haven't made any changes that should impact language modeling since 4096. - svn status says the relevant source code is unmodified. - Tried an SRI model, including rebuilding with build_binary that ships with Moses. - Ran threaded and not threaded. Can you send me your very small SRILM model? Does it have ? Kenneth On 08/04/11 11:42, Kenneth Heafield wrote: Sorry I am slow to respond. This is my first thing to look at, but I am traveling a lot through the 14th. Alex Fraser alexfra...@gmail.com wrote: Hi Kenneth -- Latest revision, 4096. Single threaded also crashes. Cheers, Alex On Fri, Jul 29, 2011 at 6:00 PM, Kenneth Heafield mo...@kheafield.com wrote: Hi, There was a problem with this; thought it was fixed but maybe it came back. Which revision are you running? Does it still happen if you run single-threaded? Kenneth On 07/29/11 09:39, Alex Fraser wrote: Hi Folks, Tom Hoar previously mentioned that he had a problem with KenLMs built from SRILM crashing Moses. Fabienne Cap and I also have had a problem with this. It seems to be restricted to using the trie option with build-binary. Ken, if you have any problems repr! oducing this, please let me know. I can send you a very small SRILM trained language model that crashes moses when converted to binary with the trie option, but works fine as a probing binary and using the original ARPA. (BTW, this is running the decoder multi-threaded and the crash comes at some point during decoding the first sentence, not during loading files) Cheers, Alex -- Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Improvements to MERT
good to see the variance reduction. why not repeat this with more features? you should see a greater effect this way. an easy way to do this is to just add more language models. Miles On 11 August 2011 19:53, Philipp Koehn pko...@inf.ed.ac.uk wrote: Hi, I added a number of improvements to MERT that have been recently proposed in the literature, with the aim to support more features and greater stability. The improvements are: (1) Optimization in random directions [Cer et al., 2008] (2) Re-use of best weight settings from last n iterations as starting points [Foster and Kuhn, 2009] (3) Pairwise-Ranked Optimization [Hopkins and May, 2011] To give some more details: (1) Traditional MERT optimizes each parameter in isolation, finding the best gain for any parameter, applying it, and repeating this process until convergence. With the switch -number-of-random-directions NUM, in addition to these directions of exploring the multi-dimensional weight space, a specified number of random directions are also explored. (2) In each iteration of the running the decoder to produce n-best lists and and optimizing weights, the first starting point is the last best weight setting found. 20 additional starting points are randomly generated. With the switch -historic-best, the best found weights of each prior iterations are used as starting points in addition to the random starting points. (3) A recent paper proposed an alternative to MERT that trains a classifier to predict which of two candidates in the n-best list is better. Candidates are randomly sampled (with a bias towards candidates with large metric score differences) and passed to a standards linear model classifier (maximum entropy, support vector machines, etc.). The current Moses implementation uses MegaM by Hal Daume (check for license terms). This alternative to traditional MERT can be used with the switch -pairwise-ranked. Notes: * the indicated switch are either specified when calling mert-moses.pl or in the parameter tuning-settings in EMS. * option (3) is incompatible with (1) and (2), but the latter can be used together. * for -number-of-random-directions I used 50 random directions, which slows down MERT quite a bit. * option (3) does not converge under the current Moses stopping criteria, so it runs for 25 iterations, but you may want to reduce this to 10 with the additional switch -max-iterations 10 Some results: Urdu-English, SAMT Model MERT setting iterations tuning set test set baseline 11.6 (std 4.8) 22.73 (std 0.07) 21.54 (std 0.38) 50 random directions 9.4 (std 2.3) 22.82 (std 0.14) *21.58* (std 0.38) rand.dir. + historic best 9.2 (std 5.9) 22.79 (std 0.23) 21.40 (std 0.37) pairwise-ranked max-iter 10 10 - 21.33 *(std 0.13)* Urdu-English, Hierarchical Model MERT setting iterations tuning set test set baseline 8.8 (std 2.2) 23.91 (std 0.18) *23.02* (std 0.42) 50 random directions 8.4 (std 3.3) 23.85 (std 0.35) 22.80 (std 0.70) rand.dir. + historic best 12.0 (std 3.5) 24.03 (std 0.23) 22.89 *(std 0.18)* pairwise-ranked max-iter 10 10 - 21.93 (std 0.36) German-English, Phrase-based MERT setting iterations tuning set test set baseline 7.2 (std 14.3) 24.82 (std 0.04) *21.29* (std 0.05) rand.dir. + historic best 6.6 (std 1.8)24.88 (std 0.07)21.28 (std 0.16)pairwise-ranked max-iter 1010- *21.29 (std 0.02)* German-English, Factored Backoff MERT setting iterations tuning set test set baseline 12.0 (std 15.2)24.89 (std 0.25)21.35 (std 0.15)rand.dir. + historic best11.4 (std 7.6)25.01 (std 0.12)21.45 (std 0.12)pairwise-ranked25- *21.58 (std 0.11)* pairwise-ranked max-iter 10 10 - 21.54 (std 0.10) Results are reported over 5 runs of each optimization method, in terms of average and standard deviation. What we are looking for is high test set scores and low variance. The Urdu-English systems use a smaller tuning set of less than a 1000 sentences (with 4 references), so I would tend to give it less faith. Test set for German-English is WMT 2011. Your milage may vary, but it is worth a tryout. -phi ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Running Giza++ on subsets of data
that isn't the expected answer here. i think the OP wants some kind of incremental (re) training. firstly: it is not really possible to guarantee that performance is not degraded when running from subsets up to the full set (compared with just running it on the full set). secondly, you may wish to investigate a version of Giza which supports incremental retraining. this would allow you to train on a subset and then add more and more data, without retraining at each point from scratch. the current version has minimal documentation, but right now this is hopefully being fixed. if you are feeling brave, look here: http://code.google.com/p/inc-giza-pp/ Miles On 15 June 2011 18:50, Kenneth Heafield mo...@kheafield.com wrote: Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview On 06/15/11 04:51, Prasanth K wrote: Hello All, I am conducting a series of experiments to build translation systems using Moses in which the corpus of the current experiment is a subset of the corpora used in the previous experiment. I have started with the Europarl corpora and am likely to repeat this process about 20 times. Unless I am mistaken, this is going to take me nearly a month and I am looking for ways to speeden up the whole process. Is there any optimal way to run Giza++ on these different subsets of data without having to run it again and again? I do not want to use the alignments obtained from running Giza++ on the entire Europarl corpora, for the other experiments (by selecting the alignment information from aligned.grow-final-and-diag for the sentences in the subsets). The order of the experiments does not matter, so the experiments can be done on the smallest dataset followed by supersets of the previous dataset, provided there is a way to modify the translation probabilities from Giza++ using just the additional data alone and this does not affect the performance of Giza++ in comparison to when Giza++ is run on the corpus in stand-alone mode. Kindly let me know if there is some way to do this and I am missing it. - regards, Prasanth -- Theories have four stages of acceptance. i) this is worthless nonsense; ii) this is an interesting, but perverse, point of view, iii) this is true, but quite unimportant; iv) I always said so. --- J.B.S. Haldane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Running Giza++ on subsets of data
it is this: Abby Levenberg, Chris Callison-Burch and Miles Osborne. Stream-based Translation Models for Statistical Machine Translationhttp://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf. NAACL, Los Angeles, USA, 2010. http://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf Miles On 15 June 2011 19:28, Qin Gao q...@cs.cmu.edu wrote: Yes, MGIZA isn't really incrementally training, it only initialize the model parameters with that trained previously, since it does not store sufficient statistics of the previous training. It will give bad performance if 1. You train only model 1 or 2. The incremental data or sub set is really small It is more suitable for the following scenario: You train a model on corpus A, and have new data B, you want to train several iterations of model 4 on A+B. For incremental training giza, do you know does it use online EM (as in Liang and Klein 2009) or just storing the sufficient statistics of previous training? --Q On Wed, Jun 15, 2011 at 11:07 AM, Miles Osborne mi...@inf.ed.ac.ukwrote: that isn't the expected answer here. i think the OP wants some kind of incremental (re) training. firstly: it is not really possible to guarantee that performance is not degraded when running from subsets up to the full set (compared with just running it on the full set). secondly, you may wish to investigate a version of Giza which supports incremental retraining. this would allow you to train on a subset and then add more and more data, without retraining at each point from scratch. the current version has minimal documentation, but right now this is hopefully being fixed. if you are feeling brave, look here: http://code.google.com/p/inc-giza-pp/ Miles On 15 June 2011 18:50, Kenneth Heafield mo...@kheafield.com wrote: Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview On 06/15/11 04:51, Prasanth K wrote: Hello All, I am conducting a series of experiments to build translation systems using Moses in which the corpus of the current experiment is a subset of the corpora used in the previous experiment. I have started with the Europarl corpora and am likely to repeat this process about 20 times. Unless I am mistaken, this is going to take me nearly a month and I am looking for ways to speeden up the whole process. Is there any optimal way to run Giza++ on these different subsets of data without having to run it again and again? I do not want to use the alignments obtained from running Giza++ on the entire Europarl corpora, for the other experiments (by selecting the alignment information from aligned.grow-final-and-diag for the sentences in the subsets). The order of the experiments does not matter, so the experiments can be done on the smallest dataset followed by supersets of the previous dataset, provided there is a way to modify the translation probabilities from Giza++ using just the additional data alone and this does not affect the performance of Giza++ in comparison to when Giza++ is run on the corpus in stand-alone mode. Kindly let me know if there is some way to do this and I am missing it. - regards, Prasanth -- Theories have four stages of acceptance. i) this is worthless nonsense; ii) this is an interesting, but perverse, point of view, iii) this is true, but quite unimportant; iv) I always said so. --- J.B.S. Haldane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] How to change phrase representation
the simplest approach would be to use another character to join words together. the tokeniser thinks you have hyphenated words, which is probably what you don't want. Miles On 13 June 2011 18:39, Anna c annac...@hotmail.com wrote: Hi, I've tried what you suggested, but I'm not sure if I'm doing it right... I've replaced all the occurrences in the input files as you said, adding a '~' between the words (as in the~man), but when I see the file training.tok.en or training.tok.es (resulting of the first steps in the guide), the words have been separated and it appears as the ~ man. Should I change the tokenizer.perl to ignore the '~' or should I skip that steps? Or it is correct in that way? Thank you very much! Best regards, Anna Date: Fri, 10 Jun 2011 10:48:07 +0100 Subject: Re: [Moses-support] How to change phrase representation From: pko...@inf.ed.ac.uk To: annac...@hotmail.com CC: moses-support@mit.edu Hi, I am not entirely sure if I fully understand your question, but let me try to answer. the phrase-based model implementation considers tokens separated by a white space as a word. It does also learn translation entries for sequences of words (phrases). If you want to group words into larger tokens, then you have to replace the white spaces. For instance, if you want to force the training setup and decoder to treat the man as a unit, then you should replace all occurrences (in training data and decoder input) with the~man. -phi On Fri, Jun 10, 2011 at 10:38 AM, Anna c annac...@hotmail.com wrote: Hi! I'm doing a master's degree and I need some help with one of my subjects. I've already installed GIZA++ and Moses correctly, and made the step by step guide of the web, checking that everything was ok. But I'm a newbie in this and I'm a bit lost. What I have to do is to change the representation so the basic unit won't be the word, but pairs or triplets of words, and compare it with the normal representation. How do I do that? Do I have to change the preparation step in the training? Thank you very much! Best regards, Anna ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] experiment.perl with IRSTLM only (no SRILM installed)
is this after running with SRILM? if so, then look for the script which creates the LM and delete it. that should force it to be re-created, using IRSLM Miles On 27 May 2011 09:16, Greg Wilson gre...@gmail.com wrote: Hi, first let me thank the people who are making Moses available, your work is very appreciated! I am trying to run experiment.perl on an installation with only IRSTLM (SRILM is not installed). This works perfectly fine if I do the experiments manually. I followed this instruction for configuring the experiment to only use IRSTLM: http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc13 The instruction is quite clear: uncomment lm-binarizer and lm-quantizer, which I did. The problem is that it seems like experiment.perl still try to use SRILM: perl $m/scripts-20110520-1542/ems/experiment.perl -config config.toy Use of implicit split to @_ is deprecated at /usr/local/bin/scripts-20110520-1542/ems/experiment.perl line 2145. STARTING UP AS PROCESS 14381 ON liveserver0 AT Fri May 27 06:45:00 UTC 2011 LOAD CONFIG... find: `/usr/local/srilm/bin/i686/ngram-count*': No such file or directory LM:lm-training: file /usr/local/srilm/bin/i686/ngram-count does not exist! find: `/usr/local/srilm/bin/i686*': No such file or directory GENERAL:srilm-dir: file /usr/local/srilm/bin/i686 does not exist! Died at /usr/local/bin/scripts-20110520-1542/ems/experiment.perl line 360. Is it possible to do what I want; to configure an experiment.perl-experiment to only use IRSTLM, or are there hardwired calls to SRILM somewhere in there? Thankful for any advice, /Greg ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Can't compile latest Moses with irstlm and srilm
It looks like you are using 64 bit versions eg srilm. Make sure everything is 32 bit Miles On 21 May 2011 13:45, Bartosz Grabski bartosz.grab...@gmail.com wrote: Hello, I'm using quite fresh Ubuntu 11.04 (on a 32bit machine). I downloaded and compiled latest srilm and irstlm (not without some troubles), then downloaded latest Moses from sourceforge. I ran regenerate-makefiles.sh, and configure (with srilm and irstlm). Then after make I get following errors. Do you have any suggestions? Thanks in advance. make all-recursive make[1]: Entering directory `/home/bar/moses' Making all in kenlm make[2]: Entering directory `/home/bar/moses/kenlm' /bin/sh ../libtool --tag=CXX --mode=link g++ -g -O2 -L/home/bar/lm/srilm/lib/i686 -L/home/bar/lm/srilm/flm/obj/i686 -L/home/bar/lm/irstlm/lib -o build_binary build_binary.o libkenlm.la -loolm -loolm -ldstruct -lmisc -lflm -lirstlm -lz libtool: link: g++ -g -O2 -o build_binary build_binary.o -L/home/bar/lm/srilm/lib/i686 -L/home/bar/lm/srilm/flm/obj/i686 -L/home/bar/lm/irstlm/lib ./.libs/libkenlm.a -loolm -ldstruct -lmisc -lflm -lirstlm -lz build_binary.o: In function `lm::ngram::(anonymous namespace)::ParseFloat(char const*)': /home/bar/moses/kenlm/lm/build_binary.cc:44: undefined reference to `util::ParseNumberException::ParseNumberException(StringPiece)' build_binary.o: In function `main': /home/bar/moses/kenlm/lm/build_binary.cc:77: undefined reference to `lm::ngram::Config::Config()' /home/bar/moses/kenlm/lm/build_binary.cc:115: undefined reference to `lm::ngram::detail::GenericModellm::ngram::trie::TrieSearch, lm::ngram::SortedVocabulary::GenericModel(char const*, lm::ngram::Config const)' build_binary.o: In function `~SortedVocabulary': /home/bar/moses/kenlm/./lm/vocab.hh:46: undefined reference to `lm::base::Vocabulary::~Vocabulary()' build_binary.o: In function `util::scoped_memory::reset()': /home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to `util::scoped_memory::reset(void*, unsigned int, util::scoped_memory::Alloc)' /home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to `util::scoped_memory::reset(void*, unsigned int, util::scoped_memory::Alloc)' build_binary.o: In function `~Backing': /home/bar/moses/kenlm/./lm/binary_format.hh:42: undefined reference to `util::scoped_fd::~scoped_fd()' build_binary.o: In function `~ModelFacade': /home/bar/moses/kenlm/./lm/facade.hh:45: undefined reference to `lm::base::Model::~Model()' build_binary.o: In function `ShowSizes': /home/bar/moses/kenlm/lm/build_binary.cc:56: undefined reference to `util::FilePiece::FilePiece(char const*, std::basic_ostreamchar, std::char_traitschar *, long long)' /home/bar/moses/kenlm/lm/build_binary.cc:57: undefined reference to `lm::ReadARPACounts(util::FilePiece, std::vectorunsigned long long, std::allocatorunsigned long long )' /home/bar/moses/kenlm/lm/build_binary.cc:58: undefined reference to `lm::ngram::detail::GenericModellm::ngram::detail::ProbingHashedSearch, lm::ngram::ProbingVocabulary::Size(std::vectorunsigned long long, std::allocatorunsigned long long const, lm::ngram::Config const)' /home/bar/moses/kenlm/lm/build_binary.cc:66: undefined reference to `lm::ngram::detail::GenericModellm::ngram::trie::TrieSearch, lm::ngram::SortedVocabulary::Size(std::vectorunsigned long long, std::allocatorunsigned long long const, lm::ngram::Config const)' /home/bar/moses/kenlm/lm/build_binary.cc:56: undefined reference to `util::FilePiece::~FilePiece()' build_binary.o: In function `main': /home/bar/moses/kenlm/lm/build_binary.cc:107: undefined reference to `lm::ngram::detail::GenericModellm::ngram::detail::ProbingHashedSearch, lm::ngram::ProbingVocabulary::GenericModel(char const*, lm::ngram::Config const)' build_binary.o: In function `~ProbingVocabulary': /home/bar/moses/kenlm/./lm/vocab.hh:97: undefined reference to `lm::base::Vocabulary::~Vocabulary()' build_binary.o: In function `util::scoped_memory::reset()': /home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to `util::scoped_memory::reset(void*, unsigned int, util::scoped_memory::Alloc)' /home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to `util::scoped_memory::reset(void*, unsigned int, util::scoped_memory::Alloc)' build_binary.o: In function `~Backing': /home/bar/moses/kenlm/./lm/binary_format.hh:42: undefined reference to `util::scoped_fd::~scoped_fd()' build_binary.o: In function `~ModelFacade': /home/bar/moses/kenlm/./lm/facade.hh:45: undefined reference to `lm::base::Model::~Model()' build_binary.o: In function `main': /home/bar/moses/kenlm/lm/build_binary.cc:113: undefined reference to `lm::ngram::detail::GenericModellm::ngram::detail::ProbingHashedSearch, lm::ngram::ProbingVocabulary::GenericModel(char const*, lm::ngram::Config const)' build_binary.o: In function `~ProbingVocabulary': /home/bar/moses/kenlm/./lm/vocab.hh:97: undefined reference to `lm::base::Vocabulary::~Vocabulary()' build_binary.o: In function `util::scoped_memory::reset()':
Re: [Moses-support] How much Ram for Europarl?
naturally, the parallel data could be down-sampled (eg use 1/2 of it). you probably won't see a significant degradation in translation quality and the whole training process will use less RAM and will be quicker. Miles On 18 April 2011 15:05, Tom Hoar tah...@precisiontranslationtools.com wrote: Your report of 100% physical usage, growing swap usage and low CPU load is normal when working with limited RAM machines. With only 4 Gb Ram and the new (larger) EuroParl v6 corpus, you could train for 3 or 4 days depending on how you setup your swap partition. Even then, it's possible you will run out of RAM before it's finished. Upgrading to 8 Gb ram is a move in the right direction. Once it's finished training, you'll want to use the binarized the tables and language model, which MMM's train-1.11 script creates. Tom On Mon, 18 Apr 2011 14:52:10 +0100, Philipp Koehn pko...@inf.ed.ac.uk wrote: Hi, I am not familiar with the MMM setup, but one of the causes of memory use may be the translation table. You should use the on-disk translation table. -phi On Mon, Apr 18, 2011 at 2:47 PM, David Wilkinson davidzw...@hotmail.com wrote: I have set up an Ubuntu 10.04 system with the moses-for-mere-mortals scripts. The default corpus trained in about 6-7 hours on my system (Athlon x3 3.2Ghz, 4Gb Ram). I am now trying to train the system with the Europarl German-English parallel corpus (about 45m words in each language), again using the default moses-for-mere-mortals settings. The system has been running for 24 hrs and is currently using all the physical memory and about 1.2Gb of swap. None of the cores are being used more than 10%, so like this it will take a very long time to finish. If I double the ram to 8gb, will this be sufficient? Many Thanks David ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists
There is work published on making mert more stable (on the train so can't easily dig it up) Miles sent using Android On 25 Mar 2011 12:49, Lane Schwartz dowob...@gmail.com wrote: We know that there is nondeterminism during optimization, yet virtually all papers report results based on a single MERT run. We know that results can very widely based on language pair and data sets, but a large majority of papers report results on a single language pair, and often for a single data set. While these issues are widely known at the informal level, I think that Suzy's point is well taken. I think there would be value in published studies showing just how wide the gap due to nondeterminism can be expected to be. It may be that such studies already exist, and I'm just not aware of them. Does anyone know of any? Cheers, Lane On Fri, Mar 25, 2011 at 7:03 AM, Barry Haddow bhad...@inf.ed.ac.uk wrote: Hi This is an is... -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists
Yes, that is the one Miles sent using Android On 25 Mar 2011 13:08, Barry Haddow bhad...@inf.ed.ac.uk wrote: This might be what Miles is referring too http://www.statmt.org/wmt09/pdf/WMT-0939.pdf There was some progress towards getting this into moses http://lium3.univ-lemans.fr/mtmarathon2010/ProjectFinalPresentation/MERT/StabilizingMert.pdf On Friday 25 March 2011 13:02, Miles Osborne wrote: There is work published on making mert more s... The University of Edinburgh is a charitable body, registered in Scotland, with registration number S... ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] running moses on a cluster with sge
to add to Barry's excellent answer, we are currently working on a client-server language model. this will mean that a cluster of machines can be used, with a shared resource. it should also work with multicore but in the short-term, you are probably better off with multicore Miles On 2 February 2011 06:06, Noubours, Sandra sandra.noubo...@fkie.fraunhofer.de wrote: Hello Barry, hello Tom, thank you for your answers. I think I have a better idea about different approaches to MOSES efficiency issues now. Best regards, Sandra -Ursprüngliche Nachricht- Von: Barry Haddow [mailto:bhad...@inf.ed.ac.uk] Gesendet: Montag, 31. Januar 2011 10:52 An: moses-support@mit.edu Cc: Noubours, Sandra; Tom Hoar Betreff: Re: [Moses-support] running moses on a cluster with sge Hi Sandra The short answer is that it really depends how big your models are. Running on a cluster helps speed up tuning because most of the time in tuning is spent decoding, which can be easily parallelised by splitting up the file into chunks. So each of the individual machines should be capable of loading your models and running a decoder. The problem with using a cluster (as opposed to multicore) is that each machine has to have its own ram, and if you want to load large models then you need a lot of ram. Whereas with multicore, each thread can access the same model. Sure, binarising saves a lot on ram usage, but it slows you down and puts a lot of load on the filesystem which can cause problems on clusters. Our group's machines are a mixture of 8 and 16 core Xeon 2.67GHz, with 36-72G ram, no sge. We also have access to the university cluster, but since the most ram you can get is 16G and sge hold jobs don't work at the moment we don't really use it for moses any more, hope that helps - regards - Barry On Monday 31 January 2011 07:42, Noubours, Sandra wrote: Hello, thanks for the tips! When talking about using a Sun Grid Engine I was referring tuning. Making use of a cluster is supposed to speed up the tuning process (see http://www.statmt.org/moses/?n=Moses.FAQ#ntoc10). In this context I wondered what hardware exactly is needed for such a cluster. Sandra Von: Tom Hoar [mailto:tah...@precisiontranslationtools.com] Gesendet: Freitag, 28. Januar 2011 09:01 An: Noubours, Sandra Cc: moses-support@mit.edu Betreff: Re: [Moses-support] running moses on a cluster with sge Sandra, What kind of capacity do you need to support? I just finished translating 21,000 pages, over 1/2 million phrases, in 22 hours on an old Intel Core2Quad, 2.4 Ghz with 4 GB RAM and a 4-disk RAID-0. Moses was configured with binarized phrase/reordering tables and kenlm binarized language model. The advances in Moses supporting efficient binarized tables/models are great! We're planning tests for a 2-socket host with two Intel Xeon 5680 6-core 3.33 Ghz CPU's, 48 GB RAM and 4 1-TB disks as RAID0. With 12 cores (totaling 24 simultaneous threads according to Intel specs), we're expecting to boot capacity to well over 15 million phrases per day on one host. What's the advantage of running Moses on a grid or cluster? Tom On Fri, 28 Jan 2011 08:40:22 +0100, Noubours, Sandra sandra.noubo...@fkie.fraunhofer.de wrote: Hello, I would like to run Moses on a cluster. I am yet inexperienced in using Sun Grid as well as clusters in common. Could you give me any instructions or tips for implementing a Linux-Cluster with Sun Grid Engine for running Moses? a) What kind of cluster would you recommend, i.e. how many machines, how many cpus, what memory, etc.? b) When tuning is performed with the multicore option it does not use more than one cpu. Does the tuning step use more than one cpu when run on a cluster? c) Can Sun Grid implement a cluster virtually on one computer, so that jobs are spread locally to different cpus of one computer? Thank you and best regards! Sandra -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] skip tuning in ems
supply a weights file, eg weight-config = /home/miles/nist09/run9.moses.ini add this to the TUNING section. Miles On 31 January 2011 21:22, John Morgan johnjosephmor...@gmail.com wrote: -- Regards, John J Morgan Hello, I'd like to run an experiment with the ems without tuning. Is it enough to write IGNORE on the [TUNING] line in the configuration file? This doesn't seem to be working for me, so I've been changing experiment.meta. Under the decode section I write in: TRAINING:config instead of in: TUNING:weight-config. What is the right way to do this? Thanks, John ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] skip tuning in ems
no. just create a dummy one (with uniform weights) if you want to skip tuning and don't have the weights handy. Miles On 31 January 2011 22:31, John Morgan johnjosephmor...@gmail.com wrote: Miles, I don't think this does what I need. I think your example assumes that the weight-config file already exist when experiment.perl is run. I tried setting weight-config = $working-dir/model/moses.ini.* and weight-config = $working-dir/model/moses.ini In both cases I get a file does not exist error. I can skip the [RECASING] module , , why can't I skip the [TUNING] module? Is there a way to use pas-unless, ignore-unless, or template-if for this? Thanks, John On 1/31/11, Miles Osborne mi...@inf.ed.ac.uk wrote: supply a weights file, eg weight-config = /home/miles/nist09/run9.moses.ini add this to the TUNING section. Miles On 31 January 2011 21:22, John Morgan johnjosephmor...@gmail.com wrote: -- Regards, John J Morgan Hello, I'd like to run an experiment with the ems without tuning. Is it enough to write IGNORE on the [TUNING] line in the configuration file? This doesn't seem to be working for me, so I've been changing experiment.meta. Under the decode section I write in: TRAINING:config instead of in: TUNING:weight-config. What is the right way to do this? Thanks, John ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Regards, John J Morgan -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Train moses incrementally
Not yet Miles sent using Android On 15 Jan 2011 10:00, Sébastien Druon s.dr...@ml-technologies.com wrote: Thanks! Do you approximately know in what time frame? Regards, Sebastien On Wed, 2011-01-12 at 09:44 +, Miles Osborne wrote: sorry, the code is not publically availab... ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Train moses incrementally
sorry, the code is not publically available yet. we will probably release it in the near future Miles On 12 January 2011 09:36, Sébastien Druon s.dr...@ml-technologies.com wrote: Thanks for this answer... Is there some code available? When will it be integrated into Moses? Thanks again Sebastien On 12 Jan 2011 09:21, Miles Osborne mi...@inf.ed.ac.uk wrote: yes. we have done this for both Giza++ and for the language model: Stream-based Translation Models for Statistical Machine Translation, Abby Levenberg, Chris Callison-Burch and Miles Osborne, NAACL 2010 Stream-based Randomised Language Models for SMT, Abby Levenberg and Miles Osborne, EMNLP 2009 this isn't integrated into Moses (yet) Miles On 12 January 2011 08:10, Sébastien Druon s.dr...@ml-technologies.com wrote: Hello, Is it p... ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] SRILM problem
in general you should send SRILM requests to their mailing list and not to this one. but i can tell you straight away that the ngram server is behaving correctly. it waits for requests ... Miles On 26 November 2010 11:28, Korzec, Sanne sanne.kor...@wur.nl wrote: Hi, I have compiled SRILM on a machine type of: ppc64 The make world seems to have finished ok. These files are in place: libdstruct.a libflm.a liblattice.a libmisc.a liboolm.a The make test seems to perform great. However it hangs (more than an hour) on this line: *** Running test ngram-server *** I have no idea what might cause this. Can anyone help me solve the problem. I have tried to ignore this and compile moses anyway, but that generates an error during make moses. Thanks in advance. Sanne ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Proposal to replace vertical bar as factor delimeter
i second this. but can I make another suggestion. make the default be *non* factored input. i reckon that most people using Moses don't actually use factors (hands-up if you do). this means, plain input, with absolutely no meta chars in them. and if you are going to use meta-chars, why not just have a flag such as: --factorDelimiter=| etc. Miles On 15 November 2010 21:30, Hieu Hoang hieuho...@gmail.com wrote: That's a good idea. In the decoder, there's 4 places that has to be changed cos it's hardcoded ConfusionNet GenerationDictionary LanguageModelJoint Word::createFromString However, the train-model.perl is more difficult to change Hieu Sent from my flying horse On 15 Nov 2010, at 09:00 PM, Lane Schwartz dowob...@gmail.com wrote: I'd like to propose changing the current factor delimiter to something other than the single vertical bar | Looking through the mailing archives, it seems that the failure to properly purge your corpus of vertical bars is a frequent source of headaches for users. I know I've encountered this problem before, but even knowing that I should do this, just today I had to track down another vertical bar-related problem. I don't really care what the replacement character(s) ends up being, just so that any corpus munging related to this delimiter gets handled internally by moses rather than being the user's responsibility. If moses could easily be modified to take a multi-character delimeter, that would probably be best. My suggestion for a single-character delimiter would be something with the following characteristics: * Character should be printable (ie not a control character) * Character should be one that's implemented in most commonly used fonts * Character should be highly obscure, and extremely unlikely to appear in a corpus * Character should not be confusable with any commonly used character. Many characters in the Dingbats section of Unicode (block 2700) would fit these desiderata. I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly obscure printable character that looks like a thick vertical bar. It's obviously a vertical bar, but just as obviously not the same thing as the regular vertical bar |. Cheers, Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] bag of words language model
i implemented this years ago (the idea then was to see if for free-word-order languages, phrases could be generalised). at the time it didn't seem that there was a more efficient way to do it than just generate permutations and score them. and if you think about it, this is essentially the reordering problem Miles On 25 October 2010 12:59, Philipp Koehn pko...@inf.ed.ac.uk wrote: Hi, I am not familiar with that, but somewhat related is Arne Mauser's global lexical model, which also exists as a secret feature in Moses (secret because no effiencient training exists): Citation: A. Mauser, S. Hasan, and H. Ney. Extending Statistical Machine Translation with Discriminative and Trigger-Based Lexicon Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, August 2009. http://www-i6.informatik.rwth-aachen.de/publications/download/628/MauserArneHasanSav%7Bs%7DaNeyHermann--ExtendingStatisticalMachineTranslationwithDiscriminativeTrigger-BasedLexiconModels--2009.pdf -phi On Fri, Oct 22, 2010 at 7:02 PM, Francis Tyers fty...@prompsit.com wrote: Hello all, I have a rather strange request. Does anyone know of any papers (or impementations) on bag-of-words language models ? That is, a language model which does not take into account the order in which the words appear in an ngram, so if you have the string 'police chief of' in your model, you will get a result for both 'chief of police' and 'police chief of'. I have thought of using IRSTLM or some generic model and scoring all the permutations, but wondered if there was a more efficient implementation already in existence. I have searched without much luck in Google, but perhaps I am searching with the wrong words. Best regards, Fran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] train-truecaser.perl proposed tweak
this sounds risky to me. it would be better to allow the user to specify the behaviour; for your suggestions, you would add an extra flag which would enable this. the default would be for truecasing to operate as it used to. Miles On 25 October 2010 17:37, Ben Gottesman gottesman@gmail.com wrote: Hi, Are truecase models still widely in use? I have a proposal for a tweak to the train-truecaser.perl script. Currently, we don't take the first token of a sentence as evidence for the true casing of that type, on the basis that the first word of a sentence is always capitalized. The first token of a segment is always assumed to be the first word of a sentence, and thus is never taken as casing evidence. However, if a given segment is only one token long, then the segment is probably not a sentence, and the token is quite possibly in its natural case. So my proposal is to take the sole token of one-token segments as evidence for true casing. I attach the code change. Any objections? If not, I'll check it in. Ben ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] about Morph tagging
ah, my apologies --I didn't realise you also wanted morphological information. in that case, you will need something like Fran's suggestion Miles On 20 October 2010 11:12, Francis Tyers fty...@prompsit.com wrote: You could use the morphological analysers from the Apertium project. http://wiki.apertium.org/wiki/Using_an_lttoolbox_dictionary http://wiki.apertium.org/wiki/Lttoolbox http://wiki.apertium.org/wiki/HFST Fran El dc 20 de 10 de 2010 a les 17:58 +0800, en/na JiaHongwei va escriure: Hi, I need to train a model with POS tags and morphological information for Moses involving languages such as German, Spanish, French and Italian. By using TreeTagger, I can get POS tags in the format 'form pos lemma'. But I want it further processed to be like this, such as 'form pos lemma morph'. So the job is taking 'form pos lemma' as input and output in format 'form pos lemma morph'. Could you recommend a way or a tool to help me do this job automatically or in pipeline? Thanks in advance! Best Regards Henry ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] mteval-v11b
note also that NIST changed to IBM BLEU recently which has a different treatment of multiple references. (mteval 13 uses IBM BLEU if i recall) generally the BLEU scores will be a little lower than before, but MERT performance should be more robust Miles On 17 October 2010 09:57, liu chang liuchangj...@gmail.com wrote: On Sun, Oct 17, 2010 at 3:41 PM, Somayeh Bakhshaei s.bakhsh...@yahoo.com wrote: Hello, I have some question about mteval-v11b.pl 1) It can not use multi-reference with mteval what is a equivalent tool for this aim? 2) I tried multi-bleu.perl, but the scores reduced ! while we expect to increase while adding more reference sets !! How it is may? 3) I test mteval-v11b.pl and multi-bleu.perl in equivalent situations, they do not always agree ! sometimes mteval and sometimes the other gives better scores. Is there any problem? 4) and at the end, isn't there any better tool with the property of multi-reference? Hi Somayeh, BLEU has defined treatment for multiple references from the very beginning (see the original Papineni et al 2002 paper for details). Any implementation of BLEU that does not support multiple references should be considered defective. Personally I've always used mteval-v13a from http://www.itl.nist.gov/iad/mig/tests/mt/2009/ which has no problem dealing with multiple references at all. All you need to do is to provide the multiple references as multiple doc sections in your reference set: doc docid=document sysid=r1 seg.../seg ... /doc doc docid=document sysid=r2 ... Disclaimer: The above definitely works for v13a but I'm not specifically familiar with v11b. Cheers, Liu Chang National University of Singapore ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] max-phrase-length vs. number of scores
the phrase length refers to the number of words in a phrase and the number of scores to the number of feature function, per phrase. they have nothing to do with each other On 6 October 2010 11:31, supp...@precisiontranslationtools.com wrote: I found this message below, which mentions the topic, but leaves my question unanswered. The train-model.perl script has an option called max-phrase-length. Documentation shows its default is 7. The processPhraseTable binarizer has an option called -nscores that refers to number of scores. The moses binary's fourth numeric option in moses.ini's [ttable-file] section is also number of scores. Documentation and the message below define a default of 5. Are the max-phrase-length and number of scores values the same? If not the same, is there a connection and if so, what is it? If there's no connection, What criteria should one choose when setting number of scores and what the consequence of changing it from the default of 5? Thanks, Tom On Fri, 25 Jun 2010 18:14:07 +0100, Philipp Koehn pko...@inf.ed.ac.uk wrote: Hi, something has gone awry in your use of the binarizer. A typical way to call the binarizer is: LC_ALL=C sort phrase-table | ~/bin/processPhraseTable -ttable 0 0 - -nscores 5 -out phrase-table -nscores refers to the number of scores in the phrase translation table which are by default 5. -phi On Fri, Jun 25, 2010 at 5:45 PM, Cyrine NASRI cyrine.na...@gmail.com wrote: Good morning everybody I dont understand the meaning of -nscores 5 When i make the command wich Binaryze the Phrase Tables, a message appears to me processing ptree for 5 Can't read 5 Thank you very much PS : i'm not english so please excuse me for the very bad english wich i write Cyrine ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] giza++ best alignment
clearly changing the configuration will change the alignment results. i suggest that before mailing the list again, you read this article: A Systematic Comparison of Various. Statistical Alignment Models. Franz Josef Och*. Hermann Ney http://acl.ldc.upenn.edu/J/J03/J03-1002.pdf Miles 2010/10/3 musa ghurab mossaghu...@hotmail.com: Thank Venkataramani, But on giza++ website http://fjoch.com/GIZA++.html they said Alignment models depending on word classes. And on mkcls website http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/mkcls.html they said -n number of optimization runs (Default: 1); larger number = better results I changed this number to -n10 where it was -n2 on train-model.perl, and that gaves a different alignment file. Any explanation? Thanks Date: Sun, 3 Oct 2010 02:23:06 -0400 Subject: Re: [Moses-support] giza++ best alignment From: eknath.i...@gmail.com To: mossaghu...@hotmail.com That purely depends on your corpus. There is no such thing as the best configuration 2010/10/2 musa ghurab mossaghu...@hotmail.com Hi Please if someone tell me, what is the best configuration for giza++ to get the best alignment file?if time and size are ignored (or not important) with best regard musa ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Eknath Venkataramani ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Problem with tuning
looking at your output: [ERROR] Malformed input at Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...) but instead received input with 0 factor(s). sh: line 1: 5114 Aborted make sure you have no bar (|) characters in the data Miles On 27 September 2010 14:45, Souhir Gahbiche s.gahbi...@gmail.com wrote: Hi all, I'm trying to tune my system, but the tuning stops at the first iteration. Here is my mert.log file: After default: -l mem_free=0.5G -hard Using SCRIPTS_ROOTDIR: /vol/mt2/tools/nadi/moses-scripts/scripts-20090923-1833/ SYNC distortionchecking weight-count for ttable-file checking weight-count for lmodel-file checking weight-count for distortion-file Executing: mkdir -p /working/tuningdev2009/mert Executing: /vol/mt2/tools/nadi/moses-scripts/scripts-20090923-1833//training/filter-model-given-input.pl ./filtered /tmp/souhir /model/mosesdev2009.ini /working/tuningdev2009/ar.project-syndicate.2009-07.v1.de v.bw.mada.tok filtering the phrase tables... Fri Aug 27 16:43:10 CEST 2010 The filtered model was ready in /working/tuningdev2009/mert/filtered, not doing a nything. run 1 start at Fri Aug 27 16:43:10 CEST 2010 Parsing --decoder-flags: |-v 0| Saving new config to: ./run1.moses.ini Saved: ./run1.moses.ini Normalizing lambdas: 0 1 1 1 1 1 1 1 1 0.3 0.2 0.3 0.2 0 DECODER_CFG = -w %.6f -lm %.6f -d %.6f %.6f %.6f %.6f %.6f %.6f %.6f -tm %.6f %.6f %.6f %.6f %.6f values = 0 0.111 0.111 0.111 0.111 0.111 0.111 0.1 11 0.111 0.0333 0.0222 0.0333 0.0222 0 Executing: /vol/mt2/tools/nadi/moses/moses-cmd/src/moses -v 0 -config filtered/moses.ini -inputtype 0 -w 0.00 -lm 0.11 -d 0.11 0.11 0.11 0.11 0.11 0.11 0.11 -tm 0.03 0.02 0.03 0.02 0.00 -n-best-li st run1.best100.out 100 -i /working/tuningdev2009/ar.project-syndicate.2009-07.v1 .dev.bw.mada.tok run1.out (1) run decoder to produce n-best lists params = -v 0 decoder_config = -w 0.00 -lm 0.11 -d 0.11 0.11 0.11 0.11 0.11 0.11 0.11 -tm 0.03 0.0222 22 0.03 0.02 0.00 Loading lexical distortion models... have 1 models Creating lexical reordering... weights: 0.111 0.111 0.111 0.111 0.111 0.111 Loading table into memory...done. Created lexical orientation reordering [ERROR] Malformed input at Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...) but instead received input with 0 factor(s). sh: line 1: 5114 Aborted /vol/mt2/tools/nadi/moses/moses-cmd/src/moses -v 0 -config filtered/moses.ini -inputt ype 0 -w 0.00 -lm 0.11 -d 0.11 0.11 0.11 0.11 0.11 0.11 0.11 -tm 0.03 0.02 0.03 0.02 0.00 -n-best-list run1.best100.out 100 -i /working/tuningdev2009/ar .project-syndicate.2009-07.v1.dev.bw.mada.tok run1.out Exit code: 134 The decoder died. CONFIG WAS -w 0.00 -lm 0.11 -d 0.11 0.11 0.11 0.11 0.11 0.11 0.11 -tm 0.0 3 0.02 0.03 0.02 0.00 The file run1.out is empty. I tried many times, but every time it stopps at the same level. I looked for the moses.ini. It works perfectly when I use two phrase tables. Here's my moses.ini used : # ### MOSES CONFIG FILE ### # # input factors [input-factors] 0 # mapping steps [mapping] 0 T 0 # translation tables: source-factors, target-factors, number of scores, file [ttable-file] 0 0 5 /working/model/phrase # no generation models, no generation-file section # language models: type(srilm/irstlm), factors, order, file [lmodel-file] 0 0 4 /working/lmm/newsLM+news-train08.fr.4gki.arpa.gz # limit on how many phrase translations e for each phrase f are loaded # 0 = all elements loaded [ttable-limit] 20 # distortion (reordering) files [distortion-file] 0-0 msd-bidirectional-fe 6 /working/model/reordering-table.gz # distortion (reordering) weight [weight-d] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 # language model weights [weight-l] 0.5000 # translation model weights [weight-t] 0.2 0.2 0.2 0.2 0.2 # no generation models, no weight-generation section # word penalty [weight-w] -1 [distortion-limit] 6 Any ideas? Thanks SG ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] wrong alignment
it is probably more helpful to give the number of sentences you used for language model training (and other details, eg ngram order). but at first glance that looks like a tiny amount of language model data --i would expect to see something closer to 2GB or so, depending upon representation Miles 2010/9/24 musa ghurab mossaghu...@hotmail.com: Thank Burger, here are some informations: Language model: 45MB Phrase Table: 26MB Reordering Model: 36MB but I'm still waiting for tuning to finish From: j...@mitre.org To: moses-support@mit.edu Date: Fri, 24 Sep 2010 13:40:40 -0400 Subject: Re: [Moses-support] wrong alignment musa ghurab wrote: I trained a system of Chinese-Arabic language, but many alignments are wrong. The same thing to lexical model, where are many words are wrongly aligned Here is an example of lexical model (lex.e2f): The point of Moses is not to get good alignments, but to get good translation output. The target language model will help the decoder to pick good translations, even if the translation probabilities that come out of the alignment do not appear to be ideal. A great deal of research effort has been wasted (in my opinion) on getting better alignments, without actually achieving better translation. Have you run the resulting model! s on a test set? What was the score? How big is your language model? More LM data is probably the easiest way to make up for what might appear to be poor alignments. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] qsub and EMS again
yes, not doing the checking during the planning stage seems sensible. (you could just change the delay at this point to speed things up). here in Edinburgh we use experiment.perl mainly in a multicore / single machine setting and that is why support for slow STDERR creation is not really there yet. but, there are plans to port this to Hadoop, which should solve synchronisation problems like this. this is the next major piece of development I'll be involved with. (the current one involves more language modelling) Miles On 3 September 2010 01:18, Suzy Howlett s...@showlett.id.au wrote: Thanks for the responses. I think I will go with the loop. I was a bit confused about this at first - it considers the step to have crashed if the STDERR file does not exist, but since the STDERR file is the output of the script that creates the DONE file, I would have thought that the DONE file could not be created without the STDERR file ultimately following. However presumably if the STDERR file didn't appear for some reason, that is a problem, and so should be considered a crash. The unfortunate thing about putting a loop like this in check_if_crashed is that it also has to go through this when it's planning what steps to do, which could lead to a long delay in planning if a step has actually crashed through not creating a STDERR file. I think the problem is ultimately with our cluster. I noticed sometimes some jobs would be sitting on the queue with status exiting for several minutes - so the DONE file had been created but the STDERR file would not appear until after the job had been finally removed from the queue. Having given it some more thought, I think the issue may be with writing to disk. I'm pretty sure that the slave nodes do not have their own hard disks, only the master, and I think jobs may have been stalled while they waited for a chance to write results to disk - the master node was very very busy at the time. I don't know if that accounts for it! I'm not sure how there being no hard disks in the slaves interacts with Hieu's point - I don't really understand how the setup works. Thanks again, Suzy On 2/09/10 8:26 PM, Miles Osborne wrote: a better setup would be to have a loop which did the following: --for a given version number and step, check for STDERR, STDOUT and DONE --if they are all found, exit --otherwise sleep and recheck (and put some limit overall to prevent an endless loop) Miles On 2 September 2010 11:16, Hieu Hoanghieuho...@gmail.com wrote: sounds like a bad case of a network file system. you prob need to harass your sysadmin and try a few of these too http://fixunix.com/nfs/61890-forcing-nfs-sync.html On 02/09/2010 04:09, Suzy Howlett wrote: Hi everyone, I'm running Moses through its experiment management system across a cluster and I'm finding that sometimes jobs will finish successfully but the .STDERR and .STDOUT files will be slow in appearing relative to the .DONE file, meaning that the EMS concludes that the step crashed. I can run the system again and it successfully reuses the results of the step (it doesn't have to rerun the step) but this is becoming frustrating as I have to restart the system frequently. I tried adding a call to sleep() in the check_if_crashed() method in experiment.perl but this is not helping in general - I think sometimes the delay is as much as a couple of minutes. Has anyone else faced this problem, or have a better idea for how to get around it? Cheers, Suzy ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Suzy Howlett http://www.showlett.id.au/ -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] mert-moses.pl working-dir tmp
this is after a crash I presume? if so, then you should delete the step which creates the first config file. this will force it to be recreated, using the current version. below is a small perl script I use (for an older version of experiment.perl, but it should work for you too). this was intended for experiments which use new language models. it forces tuning and removed older versions of filtered phrase tables. # test out a new LM, making sure experiment.prl uses it $config = $ARGV[0]; system rm -fr /disk1/miles/work4/steps/TUNING*; system rm /disk1/miles/work4/steps/TRAINING\_create-config*; system rm -fr /disk1/miles/work4/tuning/tmp.*/filtered/; system nohup perl experiment.perl -config $config -exec -no-graph ; Miles On 1 September 2010 22:17, John Morgan johnjosephmor...@gmail.com wrote: Hello, I'm running the basic demo for the ems an the experiment is crashing at the tuning step. There's a problem transitioning from the step where the moses.ini config file is created to the step where tuning is started. The moses.ini file is created in the model directory, but the tuning step looks for it under the tuning directory. Then experiment.perl puts the moses.ini file under tuning/tmp.$VERSION which doesn't exist. What am I missing? Thanks, John ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] problem with tokenizer.perl
see here: http://jeremy.zawodny.com/blog/archives/010546.html for a discussion of utf8 v UTF8 ... now off to see England triumphant against Germany Miles On 27 June 2010 13:23, Miles Osborne mi...@inf.ed.ac.uk wrote: on the subject of UTF8, i think the Moses tokeniser may be using the version that is too strict. i've just changed it to this: binmode(STDIN, :encoding(UTF-8)); binmode(STDOUT, :encoding(UTF-8)); and later on in the same file,: open(PREFIX, ::encoding(UTF-8), $prefixfile); see if this helps. Miles On 27 June 2010 13:15, Ingrid Falk ingrid.f...@loria.fr wrote: Hi Cyrine, I think this is because tokenizer.perl expects utf-8 input (on STDIN). This is because of the binmode(STDIN, ':utf8'); line in the tokenizer script. Your input is maybe not utf-8? Ingrid On 06/27/2010 01:08 PM, Cyrine NASRI wrote: Hello everyone, I try to run the script for my two tokenizer.perl development file. I'm having a problem when running, but I do not understand why. A message appears: /home/Bureau/moses/moses/scripts/tokenizer$ ./tokenizer.perl -l fr /home/Bureau/work/test-fr.fr http://test-fr.fr /home/Bureau/work/input.tok Tokenizer Version 1.0 Language: fr WARNING: No known abbreviations for language 'fr', attempting fall-back to English version... utf8 \xE9 does not map to Unicode at ./tokenizer.perl line 47, STDIN line 1. Malformed UTF-8 character (fatal) at ./tokenizer.perl line 67, STDIN line 1. Thank you very much. Sincerely Cyrine ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] moses may 10
On 11 May 2010 17:33, Christian Hardmeier c...@rax.ch wrote: For my purposes, even a hard-coded assumption of 1, along with a more transparent error message if the model isn't found, would do. Does anybody actually decode with in-memory phrase tables in real life? (well, I suppose some people do...) Google and anyone who actually wants to do more than optimise against a fixed dev/test set You can't afford to filter the phrase table when dealing with any old translation request Miles /Christian On Tue, 11 May 2010, Barry Haddow wrote: Maybe a more transparent error message would help? On Tuesday 11 May 2010 17:20:26 Hieu Hoang wrote: i thought about making it back-compatible but the code gets messy and error prone. Theres now 3 more phrase table - the text SCFG, binary SCFG, and the suffix array. So i thought it better to take the punch now and feel a short, sharp pain rather than let it linger. however, anyone wants to put back the old code to make it back comp, they're welcome to, as long as u look after it On 11/05/2010 17:04, Christian Hardmeier wrote: Hi, The first error that you give is because the format of the moses.ini file has changed. You need to add an extra digit at the beginning of the line that specifies the ttable-file. Add 0 for a memory-based ttable, and 1 for a binarised ttable. Is there a reason why we can't have backwards compatibility here? I'm a bit concerned about moving to the latest decoder version since it will require me to update the configuration file of each and every system I've ever trained, and then they won't work with the old decoders any more. Couldn't the decoder figure out on its own whether it should be 0 or 1 if the indication is missing, as it used to do? Cheers, Christian ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] A few MOSES questions (Arabic, missing scripts, Moses error)
MADA can create tokens that are bar characters (ie | ) you need to rename them to something like BAR. Moses treats these as factor delimiters, hence the message you are seeing (i've been using MADA+TOKAN for Arabic, using the D2 setting) Miles On 7 May 2010 23:26, David Edelstein dedelst...@ucdavis.edu wrote: Hello, I'm using Moses to do some SMT on Arabic, experimenting with diacritized vs. undiacritized Arabic training corpora. (I am using MADA+TOKAN to perform automatic diacritization.) So, if anyone happens to be specifically interested in Arabic, has some tips on using Moses for Arabic (right now I am just trying to get a baseline system running, so I haven't even begun exploring which parameters I need to tweak from the defaults), or can give me any other insights, I'd be very pleased to talk to you about it off-list; please email me. Now, I have a specific question and a specific problem, to which I have not found a solution by searching the archives. 1. There are two scripts referenced in scripts/released-files (read by the scripts Makefile): training/train-factored-phrase-model.perl training/filter-and-binarize-model-given-input.pl These scripts do not exist in the most recent SVN release so 'make release' reports an error since obviously it cannot install them. The tutorials alternately reference train-factored-phrase-model.perl and train-model.perl; reading the latter, it seems to do factored training. Is this just an error (and something that should be updated in the online docs and released-files), and I should only be using train-model.perl? Or is there a difference between the two scripts? And is the same true of training/filter-and-binarize-model-given-input.pl vs. filter-model-given-input.pl? 2. I went through the entire tutorial using the French-English Europarl data sets, and got it working. Now I'm going through the same process with my Arabic-English parallel corpora. I've gotten as far as tuning. I've been trying to use train-model.perl, and it gets to this part: my-moses-dir/moses-cmd/src/moses -v 0 -config my-model-dir/moses.ini -inputtype 0 -w 0.00 -lm 0.33 -d 0.33 -tm 0.10 0.07 0.10 0.07 0.00 -n-best-list run1.best100.out 100 -i my-arabic-input-file run1.out It generates run1.best100.out and run1.out, but then chokes with this error message: Translation took 0.060 seconds Finished translating [ERROR] Malformed input at Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...) but instead received input with 2 factor(s). Aborted So I gather somewhere I have a setting wrong, but I cannot figure out where it is. I basically followed the exact same steps with my Arabic-English corpora as in the tutorial, just substituting my own training data. I'm not trying to do factored training at this time. Any advice appreciated. Thanks! ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] different tune set diferent tuned parameters !
there is a large amount of randomness involved with parameter tuning. each time you run it (using the same language resources) you might get different parameters, also, the parameters are not scaled. this means that one run might give you these values: 10 20 30 and the next run might give you these ones: 0.1 0.2 0.3 Miles On 2 May 2010 09:34, Somayeh Bakhshaei s.bakhsh...@yahoo.com wrote: Hi All, A problem: Isn't it true that the parameter tuning must gain the structure of the language so i must get the same set of tuned parameters sets with different kind of tune sets? So why with changing the tuning set i get different amounts for parameters? another awful result: I changed my test set, the Bleu result changed from 19 to 3 ! How its may while there is no overlap between none of the test sets and train set?!! -- Best Regards, S.Bakhshaei ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] IRSTLM error: converting iARPA to ARPA format
this means you have run out of memory. you can either: --get more memory --use less data --use a lower-order LM --use RandLM, which can easily handle this amount of data (i am currently building LMs using more than 30 billion words with it for example) Miles On 21 April 2010 09:57, Zahurul Islam zai...@gmail.com wrote: Hi, I am trying to build a language model large amount text (13GB). In the step of converting iARPA format to ARPA format i met following error: /tools/irstlm-5.22.01/bin/compile-lm wiki.it.truecase.ilm.gz --text yes wiki.it.lm inpfile: wiki.it.truecase.ilm.gz dub: 1000 Reading wiki.it.truecase.ilm.gz... iARPA loadtxt() terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc /tools/irstlm-5.22.01/bin/compile-lm: line 9: 20328 Aborted $dir/$name $@ Any help to identify|solve this problem will be appreciated. Thank you very much. Regards, Zahurul ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 41, Issue 36
a quick question. will this break compatibility with existing training runs? also, adding new features --even if they are not used-- can impact upon MERT and may slow things down / make things worse. have you verified (using multiple runs) that this new feature doesnt' make things worse than before? Miles On 28 March 2010 19:46, Lane Schwartz dowob...@gmail.com wrote: On 28 Mar 2010, at 11:02 AM, moses-support-requ...@mit.edu wrote: Hiya Mosers and Mosettes, It's been a year since the last release there's been lots of changes, by lots of people, that we thought you should know about. A new release tar ball and zip file are on sourceforge, or svn update as usual https://sourceforge.net/projects/mosesdecoder/ Also, there is likely to be big changes in the next month as we merge the hierarchical/syntax branch into trunk. Please avoid svn up after today, and double check with someone else before committing large chunks of code to the trunk. Hieu, I've got a handful of changes from last week that I was planning to merge from my new branch back into trunk tomorrow. The changes pretty much involve adding one new feature, and should not affect anyone not using the new feature. I'll wait for your go-ahead before I do this merge. If there are plans for lots of updates to trunk tomorrow, I could probably do my merge later today (Sunday) instead, if that would help. Lane ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Dictonary use during training
re: adding dictionary entries, this is certainly a hack. but the standard trick is to pretend that the dictionary actually consists of tiny parallel sentences. you therefore just append each word-entry as a new sentence pair. don't bother with that -d option. Miles On 23 February 2010 18:34, maria sol ferrer mariasol.fer...@gmail.com wrote: Hi all, I'm wondering if you would know where I can find an english to spanish parallel, word to word dictionary to complement my training corpus. Also, from what I have searched I understand you can either add the dictionary words at the end of the corpus or use the giza option. I would like to try both, but for the giza option -d I see that the file format uses the word's ids, then where will the real words (from the parallel dictionary) go? in the corpus as well? or in a separate file? Any other suggestions for using a dictionary are welcome. Thank you. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] skipping incompatible liboolm.a
this is a standard error. you need to build SRILM using 64-bit support (i686-m64) Miles On 22 February 2010 11:40, Marce van Velden marcevanvelde...@gmail.com wrote: Hi, I get the folowing error when trying to compile moses on a intel64 pc. What could cause the liboolm.a to be incompatible? (/usr/bin/ld: skipping incompatible /home/marce/srilm64/lib/i686/liboolm.a when searching for -loolm) ma...@moses:~/moses/trunk$ sudo make make all-recursive make[1]: Entering directory `/home/marce/moses/trunk' Making all in moses/src make[2]: Entering directory `/home/marce/moses/trunk/moses/src' make all-am make[3]: Entering directory `/home/marce/moses/trunk/moses/src' make[3]: Nothing to be done for `all-am'. make[3]: Leaving directory `/home/marce/moses/trunk/moses/src' make[2]: Leaving directory `/home/marce/moses/trunk/moses/src' Making all in moses-cmd/src make[2]: Entering directory `/home/marce/moses/trunk/moses-cmd/src' g++ -g -O2 -L/home/marce/srilm64/lib/i686 -o moses Main.o mbr.o IOWrapper.o TranslationAnalysis.o LatticeMBR.o -L../../moses/src -lmoses -L/usr/include/boost/lib -lboost_thread-mt -loolm -ldstruct -lmisc -lz /usr/bin/ld: skipping incompatible /home/marce/srilm64/lib/i686/liboolm.a when searching for -loolm /usr/bin/ld: cannot find -loolm collect2: ld returned 1 exit status make[2]: *** [moses] Error 1 make[2]: Leaving directory `/home/marce/moses/trunk/moses-cmd/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/marce/moses/trunk' make: *** [all] Error 2 Thanks, Marce ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Build Moses for translating English to Chinese.
How words are tokenised / segmented etc is crucial when using small amounts of data. For the vast numbers of people using Moses (people not training-up on millions of sentence pairs) this is the kind of thing that needs to be done correctly. It would be a service to extend the Moses tokeniser to deal with languages other than just those ones you mentioned before. Miles On 11 February 2010 17:51, Christof Pintaske christof.pinta...@sun.com wrote: Hi, you may want to have a closer look at tokenizer.perl which is used for word-breaking. It seems there is some special logic to handle English, French, and Italian but nothing much else. I'm not sure if you can or plan to reveal your findings here on the list but at any rate I'd be very interested to learn how Chinese worked for you. best regards Christof nati g wrote: Hello, Do we need any special scripts to build moses for translating english to chinese. thanks in advance. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] moses for haitian relief
it looks to me like you have not correctly compiled / installed the srilm. Miles 2010/1/27 christopher taylor christopher.paul.tay...@gmail.com: hello everyone! i'm currently trying to build an instance of moses to support crisiscommons.org's machine translation project (i'm currently the PM). i really want to give moses a spin *but* i'm having issues building it. my build trouble is related to liboolm.a - here's out put from my compilation: Making all in moses-cmd/src make[2]: Entering directory `../mt/moses/moses-cmd/src' g++ -g -O2 -L..//mt/srilm/lib/i686 -L..//mt/irstlm//lib/x86_64 -o moses Main.o mbr.o IOWrapper.o TranslationAnalysis.o -L../../moses/src -lmoses -loolm -ldstruct -lmisc -lirstlm -lz /usr/bin/ld: skipping incompatible ../mt/srilm/lib/i686/liboolm.a when searching for -loolm /usr/bin/ld: cannot find -loolm collect2: ld returned 1 exit status make[2]: *** [moses] Error 1 make[2]: Leaving directory `..//mt/moses/moses-cmd/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `..//mt/moses' make: *** [all] Error 2 thanks so much for your help! chris taylor ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses on the iPhone
you should also look at RandLM, as it will enable you to run a language model in small space. that aside, i would look hard at pruning the various tables (eg phrase tables, reordering, language models) so you can just the core that you need. this will make for faster loading etc. note also that you probably shouldn't prune the phrase table for a test set (as is commonly done). Miles 2010/1/12 Hieu Hoang hieuho...@gmail.com: hi andrew some of us have been working on putting moses onto the OLPC http://wiki.laptop.org/go/Projects/Automatic_translation_software which has roughly the same resources as an iphone. We've got it working for reasonable size models my advice would be: 1. The moses-cmd shows you how to interact with the moses library. For normal decoding, it's quite simple. To make it even more simple for the gui developers, I would create a static library as a replacement for moses-cmd. Call the static library functions from your gui, rather than the moses functions directly 2. from what i know of ARM development, there are compiler switches to enable fast floating point operations. Make sure these are enabled. 3. the moses library assumes lots of memory so caches certain objects. Look throught this mailing list to see how to turn caching off. 4. Iphone apps can't run in the background so it would be best to have instant loading. This is not the case with any of our models, which can take some time to initialize. Speciically the phrase table and language models. You may have to write new implementations for them. 5. There may be littendian/bigendian issues with the binary phrase tables language models. i.e you may not be able to create a binary phrase table/LM on your desktop and expect it to work on the iphone. i think its definitely doable, but don't expect just to be able to compile go sounds like a fun project, let us know how it goes. On 11/01/2010 17:57, Andrew W. Haddad wrote: Hello, My name is Andrew Haddad. I am a Graduate Research Assistant at Purdue University. I have been given the task of getting moses working on the iphone. The moses package, which we have successfully installed and have running in simulation on the iphone will of course not work due to some limitations put for by Apple. I am going to be forced to cross compile the moses static library, used in moses-cmd, for the arm and i386 architecture. And then rewrite the functionality of moses-cmd to be used in our application. Do you know of anyone who has attempted something similar, that might be able to explain the process? -- Sláinte Andrew W. Haddad ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] different servers + different time - differentresult?
yes, you can easily get a 1BP drop between multiple runs. if you want to do experiments and report BLEU scores then people really need to do multiple runs and report on averages, along with variances. i think from no-one i'm going to start penalising papers i get to review if people don't do something about this (and i do a lot of reviewing ...) Miles 2010/1/11 李贤华 08lixian...@gmail.com: hi, Thanks for you quick response. But, will this cause a drop of BLEU, like, 0.5 point? I thinks that's too much... I have run my baseline experiments three times, and got three different results. The results for test set are: 0.2798, 0.2741, 0.2790. The first is run on server1 previously, the second and the third are run recently, while the second is run on server2, and the third is run on server1. Now I don't know what is my baseline. Regards, Lee Xianhua 2010-01-11 发件人: Miles Osborne 发送时间: 2010-01-11 16:12:38 收件人: 李贤华 抄送: moses-support 主题: Re: [Moses-support] different servers + different time - differentresult? Giza++ and MERT both can produce different results, even when using the same code, corpora etc. This is because multiple solutions exist and each time you run Moses, you find one of these (different) optima. Miles 2010/1/11 李贤华 08lixian...@gmail.com: Hi all, I ran some experiments with moses like, half a year ago. And recently I ran them for a second time. The time I got the reuslts, I got confused. Beacause they're so different from those I got previously. The softwares I used was not changed, the same version. The corpus is of course the same. I just copied them. And I used the same script the run the experiments, just changed some directory. It seems I ran the same experiments on two different servers at different time, and got different results. I checked alignment results, aligned.grow-diag-and-final, and there're a lot of differences. I also checked moses.ini, and the parameters are greatly different. So, has anybody ever come into this situation? I'm really confused... Regards, Lee Xianhua 2010-01-11 ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] The results of your email commands
randlm is already in a binary format, so there is no extra conversion loading randomised models faster is not something that we have really looked at. Miles 2009/12/23 Arda Tezcan arda...@yahoo.com: Hi, I would really appreciate it if you could help me with the following question I have: I was wondering if a LM created with RANDLM can be converted into a binary format? Or is there maybe another way of loading the model faster? I know it is possible with IRSTLM and SRILM but I couldn't find anything about RANDLM. Thank you in advance for your support Best regards, Arda Tezcan ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] moses threads compilation problem (with RandLM)
Making RandLM thread-safe is something I've been thinking about. There are a number of bug fixes which need dealing with too, so perhaps at some point I'll push out a new release. Miles 2009/12/17 Alexander Fraser fra...@ims.uni-stuttgart.de: Hi Barry and Philipp, Philipp is correct, multi-threaded moses is unlikely to work with randlm since the latter uses a (presumably) non-thread-safe cache. Darn, RandLM would be really useful together with threading because our cluster has low memory machines with 8 processors. David or Miles, any chance I could convince you to fix this soon? As regards the compile error, this is due to a recent change in the mbr code, and the fact that we don't have a regression test to pick it up. I should be able to fix the compile error fairly quickly, but at the moment I'm not sure what to do about the regression test. Ideally I'd like to get rid of the separate main for mosesmt, although we'd still have to have compiler switches which would leave us open to these issues. If you roll back to 2636 then mosesmt should compile fine. It compiles for me without RandLM fine. BTW, it would be cool to stick something about -threads in the moses usage when compiled with threads. You mentioned some weird use of -DWITH_THREADS - do you mean the failure of (eg) the test of Ngram.h ? I think this is due to a different problem with configure.ac, which would explain why I keep seeing errors like WARNING: Ngram.h: accepted by the compiler, rejected by the preprocessor! ! Ngram.h always fails for me (regardless of whether using threads or not), there is something that causes it to try to invoke a null command: configure:5997: checking Ngram.h presence configure:6012: -I/home/users6/fraser/statmt/srilm-1.5.7/include conftest.cpp ./configure: line 6014: -I/home/users6/fraser/statmt/srilm-1.5.7/include: No such file or directory The -DWITH_THREADS thing causes other things to fail (only when building with threads), such as getopt.h. These failures make no difference, since all of the things that fail get something like: configure:6537: WARNING: getopt.h: accepted by the compiler, rejected by the preprocessor! configure:6539: WARNING: getopt.h: proceeding with the compiler's result See the config.log I posted on my previous message (or let me know if I should send you a copy directly) for more examples. Cheers, Alex cheers Barry On Wednesday 16 December 2009 16:28, Alexander Fraser wrote: Hi Barry and other folks, I'm also having trouble compiling Moses with threads and RandLM, there seems to be a bug in MainMT.cpp ? Here is what I am doing: Get fresh copy of Moses (I did this on Monday night). ./regenerate-makefiles.sh ./configure --enable-threads --with-srilm=/home/users6/fraser/statmt/srilm-1.5.7 --with-randlm=/home/users6/fraser/statmt/randlm-v0.11 --with-boost=/home/users6/fraser --with-boost-thread=boost_thread make (The last argument --with-boost-thread is necessary to stop it from picking up the globally installed boost thread library). I attach config.log, which makes it through fine (though I think there is some weird use of -DWITH_THREADS in there which might be interesting). I also attach make.log (which only contains the compilation error, I typed make twice). Let me know if I can provide any more info. Thanks a lot for your help! Cheers, Alex -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Looking for non-CLI tool for aligning parallel text
or even see our own ACL paper from this year, which applies MC techniques correctly http://aclweb.org/anthology-new/P/P09/P09-1088.pdf (a problem with the paper you mentioned is that they only ran the sampler for 100 rounds --that is barely enough to move from the initial distribution) Miles 2009/10/28 Adam Lopez alo...@inf.ed.ac.uk: See this paper (which I believe is current state of the art for direct alignment of phrases) and references therein: http://aclweb.org/anthology-new/D/D08/D08-1033.pdf This strand of research goes back at least as far as this paper: http://aclweb.org/anthology-new/W/W02/W02-1018.pdf On Tue, Oct 27, 2009 at 10:51 PM, Catalin Braescu cata...@braescu.com wrote: Then I wonder how can aligning be done automatically for phrases? And what's the accuracy of such process? Catalin Braescu On Wed, Oct 28, 2009 at 12:36 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: well, alignment is a task that is really done en mass and not sentence-by-sentence. apart from say teaching, there isn't really a need for a GUI to do it. (convince me that you are ready to use this to align 8 million sentence pairs and i'd be impressed) Miles 2009/10/27 Catalin Braescu cata...@braescu.com: Big thanks for the links! But I have to say I cannot believe my eyes... most of these programs are jar files launcged with parameters from the command line... and the way they work could be a textbook for user unfriendliness :-( How can people stand such primitive and bizarre apps? I am not bashing their authors, I am only surprised there weren't any authors of better programs... Catalin Braescu On Tue, Oct 27, 2009 at 9:57 PM, Adam Lopez alo...@inf.ed.ac.uk wrote: There are several of these around. Note that I have not used any of them. http://www.cs.utah.edu/~hal/HandAlign/ http://www.umiacs.umd.edu/~nmadnani/alignment/forclip.htm http://www.d.umn.edu/~tpederse/parallel.html http://www.let.rug.nl/~tiedeman/Uplug/ Ulrich Germann also demonstrated such an editor at last year's ACL, although it does not seem to be online; perhaps email him. Adam On Tue, Oct 27, 2009 at 6:25 PM, Catalin Braescu cata...@braescu.com wrote: Ok, so what I'm looking for is a non-CLI alignment editor. Any ideas? Catalin Braescu Omlulu.com On Tue, Oct 27, 2009 at 1:41 PM, Catalin Braescu cata...@braescu.com wrote: I am asking in advance for your forgiveness if my question is trivial (or, rather, the answer). I am looking for a non-CLI tool that a not-very-technical person can use to align 2 documents in different languages. When I'm saying non--CLI I mean anything that has a window and a visual way of handling things: anything between a dual pane Notepad, a php-backed web form, a Java Applet, whatever. as in, not a command line thing - our newly hired PC operators won't be able to handle it. Any suggestions? Catalin Braescu Omlulu.com ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Looking for non-CLI tool for aligning parallel text
phrases are not usually directly aligned. instead, words are (this is what Giza++ does for example). phrases are usually extracted using heuristics. the accuracy of word alignment is a function of the number of sentence pairs and also the actual language pair. for example, you need a lot more data to do well at Chinese-English than with Spanish-English. Miles 2009/10/27 Catalin Braescu cata...@braescu.com: Then I wonder how can aligning be done automatically for phrases? And what's the accuracy of such process? Catalin Braescu On Wed, Oct 28, 2009 at 12:36 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: well, alignment is a task that is really done en mass and not sentence-by-sentence. apart from say teaching, there isn't really a need for a GUI to do it. (convince me that you are ready to use this to align 8 million sentence pairs and i'd be impressed) Miles 2009/10/27 Catalin Braescu cata...@braescu.com: Big thanks for the links! But I have to say I cannot believe my eyes... most of these programs are jar files launcged with parameters from the command line... and the way they work could be a textbook for user unfriendliness :-( How can people stand such primitive and bizarre apps? I am not bashing their authors, I am only surprised there weren't any authors of better programs... Catalin Braescu On Tue, Oct 27, 2009 at 9:57 PM, Adam Lopez alo...@inf.ed.ac.uk wrote: There are several of these around. Note that I have not used any of them. http://www.cs.utah.edu/~hal/HandAlign/ http://www.umiacs.umd.edu/~nmadnani/alignment/forclip.htm http://www.d.umn.edu/~tpederse/parallel.html http://www.let.rug.nl/~tiedeman/Uplug/ Ulrich Germann also demonstrated such an editor at last year's ACL, although it does not seem to be online; perhaps email him. Adam On Tue, Oct 27, 2009 at 6:25 PM, Catalin Braescu cata...@braescu.com wrote: Ok, so what I'm looking for is a non-CLI alignment editor. Any ideas? Catalin Braescu Omlulu.com On Tue, Oct 27, 2009 at 1:41 PM, Catalin Braescu cata...@braescu.com wrote: I am asking in advance for your forgiveness if my question is trivial (or, rather, the answer). I am looking for a non-CLI tool that a not-very-technical person can use to align 2 documents in different languages. When I'm saying non--CLI I mean anything that has a window and a visual way of handling things: anything between a dual pane Notepad, a php-backed web form, a Java Applet, whatever. as in, not a command line thing - our newly hired PC operators won't be able to handle it. Any suggestions? Catalin Braescu Omlulu.com ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] How many and/or which language model(s) to use?
you can't supply language models for both directions: you need to supply them for the target and not the source Miles 2009/10/22 Ivan Uemlianin i.uemlia...@bangor.ac.uk: Dear All I am using Moses with irstlm. The language pair I am developing is English and Welsh. I have built language models, and I am now exploring train-factored-phrase-model.perl. My question is: which language model should I supply to the perl script, or should I supply both (I have built a separate language model for each language), and how? Below is the script I'm using (I've wrapped the perl command in a shell script for readability). This script runs without errors, but I should like to know if I'm supplying the language models correctly: #! /bin/bash SCRIPTS_ROOTDIR=/path/to/moses_scripts ROOT_DIR=/path/to/project FSTEM=project_name nohup nice $SCRIPTS_ROOTDIR/training/Train-factored-phrase-model.perl \ -scripts-root-dir $SCRIPTS_ROOTDIR \ -root-dir $ROOT_DIR \ -corpus $ROOT_DIR/corpws/$FSTEM.tok \ -f cy \ -e en \ -alignment grow-diag-final-and \ -reordering msd-bidirectional-fe \ -lm 0:3:$ROOT_DIR/lm_irst/$FSTEM.cy.irstlm.gz:1 \ -lm 0:3:$ROOT_DIR/lm_irst/$FSTEM.en.irstlm.gz:1 \ 1 $FSTEM.fphm.out \ 2 $FSTEM.fphm.err Thanks and best wishes Ivan -- Ivan Uemlianin Canolfan Bedwyr Safle'r Normal Site Prifysgol Bangor University BANGOR Gwynedd LL57 2PZ i.uemlia...@bangor.ac.uk ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Looking for text corpora
the only other source of lots of parallel data (I know about) is the LDC: http://www.ldc.upenn.edu/ but this is not free ... Miles 2009/9/6 Catalin Braescu cata...@braescu.com: Thanks, Miles! From your link I got http://www.statmt.org/europarl/ Any other such goodies? Catalin -- Omlulu.com On Sun, Sep 6, 2009 at 8:13 PM, Miles Osbornemi...@inf.ed.ac.uk wrote: http://www.statmt.org/wmt08/shared-task.html 2009/9/6 Catalin Braescu cata...@braescu.com: Obviously Moses (like any similar tool) is useless without having access to a huge amount of translated documents. While I am sure such corpora already exist and are available for free, I can't find them online, therefore I kindly ask the list colleagues for useful hints. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] EM Model 1 question
the good thing about probabilities is that they should sum to one (but you can get numerical errors giving you slightly more / less ...) Miles 2009/7/27 James Read j.rea...@sms.ed.ac.uk Ok. Thanks. I think I understand this now. I also think I have found the bug in the code which was causing the dodgy output. So, in conclusion, would you say that a good automated check to see if the code is working correctly would be to add up the probabilities at the end of the EM iterations and check that probabilities add up to 1 (or slightly less)? James Quoting Philipp Koehn pko...@inf.ed.ac.uk: Hi, because the final loop in each iteration is: // estimate probabilities for all foreign words f for all English words e t(e|f) = count}(e|f) / total(f) As I said, there are two normalizations: one on the sentence level, the other on the corpus level. -phi On Mon, Jul 27, 2009 at 10:30 PM, James Readj.rea...@sms.ed.ac.uk wrote: In that case I really don't see how the code is guaranteed to give results which add up to 1. Quoting Philipp Koehn pko...@inf.ed.ac.uk: Hi, this is LaTex {algorithmic} code. count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ means count(e|f) += t(e|f) / s-total(e) So, you got that right. -phi On Mon, Jul 27, 2009 at 10:18 PM, James Readj.rea...@sms.ed.ac.uk wrote: Hi, this seems to be pretty much what I implemented. What exactly do you mean by these three lines?: \STATE count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ \STATE total($f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ \STATE $t(e|f)$ = $\frac{\text{count}(e|f)}{\text{total}(f)}$ What do you mean by $\frac? The pseudocode I was using shows these lines as a simple division and this is what my code does. i.e t(e|f) = count(e|f) / total(f) In C code something like: for ( f = 0; f size_source; f++ ) { for ( e = 0; e size_target; e++ ) { t[f][e] = count[f][e] / total[f]; } } Is this the kind of thing you mean? Thanks James Quoting Philipp Koehn pko...@inf.ed.ac.uk: Hi, I think there was a flaw in some versions of the pseudo code. The probabilities certainly need to add up to one. There are two normalizations going on in the algorithm: one on the sentence level (so the probability of all alignments add up to one) and one on the word level. Here the most recent version: \REQUIRE set of sentence pairs $(\text{\bf e},\text{\bf f})$ \ENSURE translation prob. $t(e|f)$ \STATE initialize $t(e|f)$ uniformly \WHILE{not converged} \STATE \COMMENT{initialize} \STATE count($e|f$) = 0 {\bf for all} $e,f$ \STATE total($f$) = 0 {\bf for all} $f$ \FORALL{sentence pairs ({\bf e},{\bf f})} \STATE \COMMENT{compute normalization} \FORALL{words $e$ in {\bf e}} \STATE s-total($e$) = 0 \FORALL{words $f$ in {\bf f}} \STATE s-total($e$) += $t(e|f)$ \ENDFOR \ENDFOR \STATE \COMMENT{collect counts} \FORALL{words $e$ in {\bf e}} \FORALL{words $f$ in {\bf f}} \STATE count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ \STATE total($f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ \ENDFOR \ENDFOR \ENDFOR \STATE \COMMENT{estimate probabilities} \FORALL{foreign words $f$} \FORALL{English words $e$} \STATE $t(e|f)$ = $\frac{\text{count}(e|f)}{\text{total}(f)}$ \ENDFOR \ENDFOR \ENDWHILE -phi On Sun, Jul 26, 2009 at 5:24 PM, James Read j.rea...@sms.ed.ac.uk wrote: Hi, I have implemented the EM Model 1 algorithm as outlined in Koehn's lecture notes. I was surprised to find the raw output of the algorithm gives a translation table that given any particular source word the sum of the probabilities of each possible target word is far greater than 1. Is this normal? Thanks James -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu
Re: [Moses-support] How to create Two-way translator and accelerate.
and don't forget to look at RandLM -this can save you a lot of memory for your language model (a lot more than IRSTLM) plug over! Miles 2009/5/5 Marcin Miłkowski milek...@o2.pl: Jan Helak pisze: I have one last question. Final version will be builded with ap. 50 MB of polish text and 50 MB of english text. My computer has 3114632k total memory. It is enough for SRILM or I will need to use IRSTLM ? Heh, 50 MB is not much but I doubt it could fit in your memory. It all depends on your data but you should get something like 50 MB gzipped giza alignment, with something like 300 MB ungzipped, and the phrase table can be several times bigger. For example, for one project I had 200 MB input files, and got 1.2 GB gzipped text phrase table. IRSTLM should save more memory, especially if you quantize and binarize. BTW, I find using IRSTLM is much less cumbersome than SRI. Regards (and nie ma za co) Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] How to create Two-way translator and accelerate.
actually, i think Jan wants a speedup, not a space saving. your best bet is to reduce the size of the beam: http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc6 Miles 2009/5/4 Francis Tyers fty...@prompsit.com: El lun, 04-05-2009 a las 14:54 +0200, Jan Helak escribió: Hello everyone :) I try to build two-way translator for polish and english languages as a project on one of my subjects. By now, I created a one-way translator (polish-english) as a beta version, but severals problems have came: (1) A translator must work in two-ways. How to achieve this? Make another directory and train two models. (2) Time of traslating for phrases is two long ( 4 min. for one sentence). How to accelerate this (decresing a quality of translation is acceptable). You can try filtering the phrase table before translating (see PART V - Filtering Test Data), or using a binarised phrase table (see Memory-Map LM and Phrase Table). http://ufallab2.ms.mff.cuni.cz/~bojar/teaching/NPFL087/export/HEAD/lectures/02-phrase-based-Moses-installation-tutorial.html Regards, Fran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] How to create Two-way translator and accelerate.
the original question was about speed of decoding, not potential quality improvements due to filtering clearly, if you can identify phrases to prune then you will get a speed-boost. but this is not true for the general case and my advice was for the general case. Miles 2009/5/4 Marcin Miłkowski milek...@o2.pl: Miles Osborne pisze: filtering etc might give you a speed-up (eg a constant one --less stuff to load) but if filtering is safe w.r.t to the source data, then you shouldn't see much here. (pruning the table should make it faster since there will be fewer options to consider, but this is not safe) Actually, this is contrary to what Johnson et al. say in their paper, and my subjective (not measured) experience was definitely in their favor. As long as you have really clean data, you don't want to lose any of it, but if alignments are lousy, translations ambiguous etc., you want to cut it off, and Jan wants to do that (see his post). I was even filtering more and got better results by heuristically discarding unprobable phrases from the phrase table (based on Fran's idea he had about discarding unprobable alignments). Again, this is subjective, anecdotal, etc., but before that I was getting complete garbage. Note: my pair was English-Polish and Polish English. i guess you might also see fewer page faults and the like with a smaller model and that will help matters. btw, quantising and binarising language models helps as well Marcin but in general, the beam size is the most direct way to make it faster. Miles 2009/5/4 Francis Tyers fty...@prompsit.com: El lun, 04-05-2009 a las 14:08 +0100, Miles Osborne escribió: actually, i think Jan wants a speedup, not a space saving. Does filtering the phrase table before translation not decrease the total time to make a translation (including the time taken to load the phrase table etc.)? That was my experience, and it appears to be something that he hasn't done, but perhaps my set up is unusual... Fran your best bet is to reduce the size of the beam: http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc6 Miles 2009/5/4 Francis Tyers fty...@prompsit.com: El lun, 04-05-2009 a las 14:54 +0200, Jan Helak escribió: Hello everyone :) I try to build two-way translator for polish and english languages as a project on one of my subjects. By now, I created a one-way translator (polish-english) as a beta version, but severals problems have came: (1) A translator must work in two-ways. How to achieve this? Make another directory and train two models. (2) Time of traslating for phrases is two long ( 4 min. for one sentence). How to accelerate this (decresing a quality of translation is acceptable). You can try filtering the phrase table before translating (see PART V - Filtering Test Data), or using a binarised phrase table (see Memory-Map LM and Phrase Table). http://ufallab2.ms.mff.cuni.cz/~bojar/teaching/NPFL087/export/HEAD/lectures/02-phrase-based-Moses-installation-tutorial.html Regards, Fran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] How to create Two-way translator and accelerate.
filtering etc might give you a speed-up (eg a constant one --less stuff to load) but if filtering is safe w.r.t to the source data, then you shouldn't see much here. (pruning the table should make it faster since there will be fewer options to consider, but this is not safe) i guess you might also see fewer page faults and the like with a smaller model and that will help matters. but in general, the beam size is the most direct way to make it faster. Miles 2009/5/4 Francis Tyers fty...@prompsit.com: El lun, 04-05-2009 a las 14:08 +0100, Miles Osborne escribió: actually, i think Jan wants a speedup, not a space saving. Does filtering the phrase table before translation not decrease the total time to make a translation (including the time taken to load the phrase table etc.)? That was my experience, and it appears to be something that he hasn't done, but perhaps my set up is unusual... Fran your best bet is to reduce the size of the beam: http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc6 Miles 2009/5/4 Francis Tyers fty...@prompsit.com: El lun, 04-05-2009 a las 14:54 +0200, Jan Helak escribió: Hello everyone :) I try to build two-way translator for polish and english languages as a project on one of my subjects. By now, I created a one-way translator (polish-english) as a beta version, but severals problems have came: (1) A translator must work in two-ways. How to achieve this? Make another directory and train two models. (2) Time of traslating for phrases is two long ( 4 min. for one sentence). How to accelerate this (decresing a quality of translation is acceptable). You can try filtering the phrase table before translating (see PART V - Filtering Test Data), or using a binarised phrase table (see Memory-Map LM and Phrase Table). http://ufallab2.ms.mff.cuni.cz/~bojar/teaching/NPFL087/export/HEAD/lectures/02-phrase-based-Moses-installation-tutorial.html Regards, Fran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Results quality when using moses with randlm
there are many factors here. firstly, the randomised LM makes errors as a function of the false positive rate and the values (quantisation) level. roughly, the smaller these parameters are, the smaller your LM will be, but there may be a performance drop. secondly, the default count-based smoothing methods are only good when you use enormous quantities of data --look at the Google LM paper where they show that Stupid backoff approaches K-N smoothing. if you really want the best performance from moderate amounts of data (50 million lines is small: i have used 1 billion sentences) then you can get SRILM to produce an ARPA file as normal. (this is the result of ngram-count). Randlm can convert an arpa file into a randomised format. what this means is that RandLM will use Kneser-Ney smoothing and assuming reasonable error rates, your translation performance should be near identical to when using the SRILM Miles 2009/4/16 Michael Zuckerman michael90...@gmail.com: Hi, We used moses with randlm - we took a very big corpus of ~50 million lines for the language model and processed it with randlm. Then we compared the results with moses run with srilm used on much smaller corpus. Surprisingly, srilm gave much better results (better translation quality), although used on much smaller corpus. Both lm's ran on 5-grams. These results were repeated in different language pairs (german - english, russian - english, spanish - english etc.) Could you please explain these results ? Thanks, Michael. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Error when run moses with lattices format as input
in general, when you compile a c or c++ program, you add the switch -g to the options (usually in a Makefile). this will tell the compiler to add stuff to the program so that it works with gde. you then do: gdb moses and you will see a prompt. you then run moses within that prompt, but using the run command instead of moses: gdb run ... when it crashes, you then type where and you will see the various functions that were called prior to the crash. Miles 2009/4/16 Nguyen Manh Hung manhh...@cl.ics.tut.ac.jp: Sorry Chris. I'm a beginner with moses and C(I usually use perl and java), so I don't know how to run moses in gdb(debugger mode???). I just searched but have got any guide for its. Could you show me how to do this. Thanks very much, Manh Hung 2009-04-16 (木) の 11:57 -0400 に Chris Dyer さんは書きました: I was actually hoping for a stack trace. That is, run moses in gdb, and then when it crashes, uses the where command to show where the crash is. Thanks! 2009/4/16 Nguyen Manh Hung manhh...@cl.ics.tut.ac.jp: Dear Chris I have included the stack trace as a file. Thanks in advance, Manh Hung 2009-04-16 (木) の 11:34 -0400 に Chris Dyer さんは書きました: Can you send me a stack trace for where the SEGV is happening? Once the phrase table has been binarized, there's no need to have any special temporary space. On Tue, Apr 28, 2009 at 10:46 AM, Nguyen Manh Hung manhh...@cl.ics.tut.ac.jp wrote: Chris Dyer さんは書きました: You need to add a -weight-i flag to the command line which specifies how much weighting to apply to the arc feature. e.g.: moses ... -weight-i 0.5 -Chris On Thu, Apr 16, 2009 at 9:58 AM, Nguyen Manh Hung manhh...@cl.ics.tut.ac.jp wrote: Hi, I'm using Moese to decode with lattices format as input. Also I make lattices file content by hand. When I run moses with follow command MOSES_HOME/moses-cmd/src/moses -f config_file.ini -inputtype 2 -input-file input.txt out.put These error has come -- Creating lexical reordering... weights: 0.300 0.300 0.300 0.300 0.300 0.300 Loading table into memory...done. Created lexical orientation reordering Start loading LanguageModel /home/manhhung/smt/confusion/data/lm/lm_news_jan.srilm : [112.000] seconds Finished loading LanguageModels : [114.000] seconds ERROR:You specified 0 input weights (weight-i), but you specified 1 link parameters (link-param-count)! - Could you please explain these errors for me Thanks, Manh Hung ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support @Chris: Ohh, it's running OK now, thanks you very much, But... has come another error message. Segementation fault. I have got this error message when I made binary file type of phrasetable. Its seem that the size of /tmp is a small. So I added -T options(--temporary-directory) to resolve them. But in the options of moses command, I dont found any such option. How do you thinks about this error. Thanks in advance, Manh Hung ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Fetching older versions of moses
assuming the current version hasn't been fixed to deal with the LM problem affecting older versions of gcc: --check-out the code using SVN as usual, ie svn co https://svn.sourceforge.net/svnroot/mosesdecoder/trunk mosesdecoder then look at the SVN logs: svn log | less find some version which is okay for you (probably the one prior to Chris's changes) and then do: svn up -r VERSION where VERSION is the SVN update version. Miles 2009/3/12 Michael Zuckerman michael90...@gmail.com: Hi, How can I find out what moses version is in use now, and how can I fetch older versions of moses from the database (what is the command to do this) ? Thanks, Michael. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] How is the final LM score obtained?
a couple of points: --you are asking ngram for perplexities scores, but Moses uses log probs --Moses will append s and /s pseudo-words to the start and end ot a sentence; this will change the probabilities Miles 2009/3/5 Carlos Henriquez carlo...@gps.tsc.upc.es: Hi all. I'm making some tests extracting the nbest list from moses (-n-best-list option) with all models' weights set to 1 and I don't understand how do you get the final LM score. I'm using srilm. For instance, my best translation from Chinese to English on sentence 9 was 9 ||| after three hours . ||| d: 0 lm: -17.0614 tm: -7.41812 -0.944461 -4.79107 -2.87243 w: -4 ||| -37.0874 but if I run ngram alone with the same output sentence echo after three hours . | ngram -order 5 -lm ../marie/lm/train.tok.en.lm -ppl - the result is very different file -: 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -7.40966 ppl= 30.3341 ppl1= 71.1892 I tried with some other values from my nbest list and I always found a big difference between the two scores. If my initial weight is 1, why are the scores so different? I suppose I am misunderstanding something. The moses command to obtain the n-best-list was moses -f moses.ini -i ../../corpus/dev.zh -d 1 -tm 1 1 1 1 -lm 1 -w 1 -n-best-list devout.moses.nbest 10 -include-alignment-in-n-best true devout.moses 2 /dev/null (yep, I'm not using the last tm weight) and the moses.ini file does not have any weights. -- Carlos A. Henríquez Q. carlo...@gps.tsc.upc.es ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] word alignment symmetrisation heuristics
one thing to remember is that the link between AER and BLEU is not obvious; in my view at least AER-like scores should be treated with skepticism and the real merit of an alignment approach should be the corresponding translation performance (BLEU etc). can you provide associated BLEU scores for those AER numbers? Miles 2009/3/4 J.Tiedemann j.tiedem...@rug.nl: hi, I'm just wondering if Och's refined heuristics is also implemented in Moses. The grow-diag is not exactly the same as far as I understand. The reason why I'm asking is because I found out that in all of my experiments with europarl data the intersection always produces the best results in terms of AER (for example using the wpt03 data) whereas I see better performances reported for refined compared with intersection in various papers (also for the wpt03 data). However, I cannot believe that the grow-heuristics would perform so much worse than the original refined approach. My AER scores with standard GIZA settings and moses heuristics for wpt03 data are the following: moses.intersect AER = 0.0613 moses.grow-diag AER = 0.0843 moses.grow-diag-final-and AER = 0.0926 moses.grow-diag-final AER = 0.1312 moses.srctotgt AER = 0.1039 moses.tgttosrc AER = 0.1162 moses.union AER = 0.1444 does this sound reasonable? Jorg ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] word alignment symmetrisation heuristics
yep, that sounds reasonable. in that case it is good to remember that those heuristics are all designed for eventual translation and not for doing well at AER. i can easily imagine some other set of heuristics which will do well at word alignment-like tasks and not necessarily pan-out into good bleu scores etc. Miles 2009/3/4 J.Tiedemann j.tiedem...@rug.nl: it depends on what you want to do. I was interested in the word alignment in particular. not necessarily for running MT with moses. for SMT I usually just use the default grow-diag-final-and which probably gives the best input anyway. this is, I guess, because it's better on recall. AER seems to strongly prefer precision. jorg On Wed, 4 Mar 2009 13:46:36 + Miles Osborne mi...@inf.ed.ac.uk wrote: one thing to remember is that the link between AER and BLEU is not obvious; in my view at least AER-like scores should be treated with skepticism and the real merit of an alignment approach should be the corresponding translation performance (BLEU etc). can you provide associated BLEU scores for those AER numbers? Miles 2009/3/4 J.Tiedemann j.tiedem...@rug.nl: hi, I'm just wondering if Och's refined heuristics is also implemented in Moses. The grow-diag is not exactly the same as far as I understand. The reason why I'm asking is because I found out that in all of my experiments with europarl data the intersection always produces the best results in terms of AER (for example using the wpt03 data) whereas I see better performances reported for refined compared with intersection in various papers (also for the wpt03 data). However, I cannot believe that the grow-heuristics would perform so much worse than the original refined approach. My AER scores with standard GIZA settings and moses heuristics for wpt03 data are the following: moses.intersect AER = 0.0613 moses.grow-diag AER = 0.0843 moses.grow-diag-final-and AER = 0.0926 moses.grow-diag-final AER = 0.1312 moses.srctotgt AER = 0.1039 moses.tgttosrc AER = 0.1162 moses.union AER = 0.1444 does this sound reasonable? Jorg ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses on a mac
there is a related bug with randlm which i'm looking at now. whilst i'm doing this, can you verify that it is some mac-specific problem and not say something due to the gcc version you are using? Miles 2009/3/4 Kemal Oflazer k...@cs.cmu.edu: Dear All I just install moses on a large mac system and wanted to test out an earlier setup. Train went just fine but moses dies with Start loading LanguageModel /Users/oflazer/smt/models/lm/smorph-lm-n5.lm : [0.000] seconds pure virtual method called terminate called without an active exception Abort trap this seems to be perhaps related to srilm (does not seem to be loeading the file) which is properly installed. Is there anything special to mac that I need to be careful about? Thanks Kemal - Kemal Oflazer http://people.sabanciuniv.edu/oflazer/ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Error in running moses with randlm
ok, i'll try to work it out. can you: --mail me your moses.ini file --mail me the commands you ran to create your language model --tell me exactly how much language model data you used and what it is; if it is europarl then that should be ok Miles 2009/2/24 Michael Zuckerman michael90...@gmail.com: Hi, I am running moses on a small example containing two german sentences (in file in): das ist ein kleines haus das ist ein kleines haus I am using the attached randlm language model model.BloomMap, and the attached phrase table and moses.ini files. My command line is: $ ../../../../mosesdecoder/moses-cmd/src/moses -f moses.ini in out When loading the language model, moses gives an error: Defined parameters (per moses.ini or switch): config: moses.ini input-factors: 0 lmodel-file: 5 0 3 /home/michez/alfabetic/lm/randlm/test/model.BloomMap mapping: T 0 ttable-file: 0 0 1 phrase-table ttable-limit: 10 weight-d: 1 weight-l: 1 weight-t: 1 weight-w: 0 Added ScoreProducer(0 Distortion) index=0-0 Added ScoreProducer(1 WordPenalty) index=1-1 Added ScoreProducer(2 !UnknownWordPenalty) index=2-2 Loading lexical distortion models... have 0 models Start loading LanguageModel /home/michez/alfabetic/lm/randlm/test/model.BloomMap : [0.000] seconds pure virtual method called terminate called without an active exception Aborted Do you have a clue how to handle this error ? Thanks, Michael. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Error in RandLM
ok. i just did a clean install of RandLM on a 64 bit machine, running Suse. here is my test: ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model ../src/README this produces the expected results, so it must be that somehow there is a difference in either your Unix version or else in tools such as cat etc. so, which version of cat do you have? this is what i have here: cat --version cat (GNU coreutils) 6.11 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Torbjorn Granlund and Richard M. Stallman. Miles 2009/2/19 Miles Osborne mi...@inf.ed.ac.uk: what happens when you run this? Miles 2009/2/19 Michael Zuckerman michael90...@gmail.com: Hi, I am using sort (GNU coreutils) 6.10. Here is the full STDERR output, with the command I ran: $ ../bin/buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -input-type corpus ../../europarl.lower.token.en 2 errors User defined parameter settings: falsepos8 input-typecorpus output-prefixmodel structBloomMap values8 Default values set in PipelineTool: order3 tmp-dir/tmp input-path___stdin___ working-mem100 output-dir. add-bos-eos1 seed0 Default values set in Builder: falseneg0 misassign1 count-cut-off1 memory0 maxcount35 output-typerandlm smoothing-param0.4 Derived parameters settings: estimatorbatch smoothingStupidBackoff Pipeline converting data from corpus to counts output path = ./model.tokens output path = ./model.tokens.sorted output path = ./model.counts.sorted cat ./model.tokens | cat | sort --compress-program=cat -T /tmp -S 100M -k 1 -k 2 -k 3 -k 4 | cat ./model.tokens.sorted cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. cat: invalid option -- d Try `cat --help' for more information. rm ./model.tokens buildlm: RandLMStats.cpp:312: virtual bool randlm::CountStats::observe(const randlm::Word*, randlm::Value, int): Assertion `len 0' failed. Thanks, Michael. On Thu, Feb 19, 2009 at 4:06 PM, Miles Osborne mi...@inf.ed.ac.uk wrote: can you post the full STDERR output, along with the command you ran. also, which version of sort are you using? sort --version Miles 2009/2/19 Michael Zuckerman michael90...@gmail.com: Hi, We are trying
Re: [Moses-support] Error in RandLM
that might be it. but i seem to have it working here, using a non-gzipped version of Europarl. in any case, Michael: tell us if it works when the corpus is gzipped Miles 2009/2/19 Barry Haddow bhad...@inf.ed.ac.uk: Hi I've seen this error before. The short answer is that you need to use a gzipped version of the corpus. The reason is that randlm uses gzip to decompress/compress when you have a gzipped corpus, which is fine because gzip takes a -d argument for decompressing. If presented with a non-gzipped version of the corpus, randlm attempts to fake gzip with cat, which fails because cat doesn't accept -d. This has come up on the mailing list before, as far as I recall. regards Barry On Thursday 19 February 2009 13:53, Michael Zuckerman wrote: Hi, We are trying to run RandLM on our files. We use the command: $ ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -input-type corpus ../../europarl.lower.token.en And we get the following errors: cat: invalid option -- d Try `cat --help' for more information. rm ./model.tokens buildlm: RandLMStats.cpp:312: virtual bool randlm::CountStats::observe(const randlm::Word*, randlm::Value, int): Assertion `len 0' failed. Aborted Are you familiar with these errors ? Do you have an idea about how to solve them ? Thanks, Michael. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] RandLM compressor cat bug.
ah, ok. i think David hit it on the head: randlm is currently in the very first release and to my knowledge hasn't been extensively tested under various setups. we'll gather together these problems and add them into the next release. Miles 2008/12/7 Radek Bartoň xbart...@stud.fit.vutbr.cz: On Wednesday 03 of December 2008 02:08:05 you wrote: Hi Radek, Thanks for the patch. What's your system? (I've a feeling the part of RandLM is not very portable). David Gentoo ~amd64 here, cat is part of coreutils-6.12-r2 package. Sorry, I accidentaly posted this message to David instead of list first. -- Ing. Radek Bartoň Faculty of Information Technology Department of Computer Graphics and Multimedia Brno University of Technology E-mail: xbart...@stud.fit.vutbr.cz Web: http://blackhex.no-ip.org Jabber: black...@jabber.cz ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] RandLM compressor cat bug.
which version of unix are you using? MIles 2008/11/28 Radek Bartoň [EMAIL PROTECTED]: Hello. Since there is no RandLM mailing list (at least I haven't found one) I'm posting here. When creating language model with cat compressor, buildlm fails (on my system) with error: cat: invalid option -- 'd' Attached patch should fix that. -- Ing. Radek Bartoň Faculty of Information Technology Department of Computer Graphics and Multimedia Brno University of Technology E-mail: [EMAIL PROTECTED] Web: http://blackhex.no-ip.org Jabber: [EMAIL PROTECTED] ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] translation result change from time to time
it could be due to things like the way ties are broken, floating-point errors and the like Miles 2008/11/21 Hieu Hoang [EMAIL PROTECTED]: that would be a worrying. are you sure all parameters are the same ? loading the models and memory shouldn't affect the results. there may rarely be differences if u're running on different operating systems due to floating point operations 2008/11/21 שי מור יוסף [EMAIL PROTECTED] Hello , i found out when I try to translate same sentences in different time (from the moment the model loaded into the memory) I get different results . why this happen ? memory problem maybe ? or its regarding to loading process of the model ? I am using 3 binary models on strong server (16 GB ram) , I will be happy to get information regarding this problem Or suggestions for tests to find out the problem ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Announcement: RandLM
What is it? RandLM (randomised language modelling) is yet another language model for Moses. However, it is designed to be very space-efficient indeed: depending upon settings, it can represent an SRILM language model in about 1/10 of the space. The code can be used to estimate LMs either from raw text (similar to SRILM's ngram-count) or else can be used to load pre-built ARPA files. Best compression results are obtained when building LMs from raw text. You can get the code here: http://sourceforge.net/projects/randlm (This is the first public release and there are sure to be bugs) Read the files: BUILDING_WITH_MOSES.txt for Moses integration and: README for general information on building the release. Note that Moses can support SRILM and RandLM LMs at the same time --just use /configure --with-randlm=/path/to/randlm --with-randlm=/path/to/srilm If you want to read more about this, then look at our ACL and EMNLP papers: David Talbot and Miles Osborne. Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. EMNLP, Prague, Czech Republic 2007. http://www.iccs.informatics.ed.ac.uk/~osborne/papers/emnlp07.pdf David Talbot and Miles Osborne. Randomised Language Modelling for Statistical Machine Translation. ACL, Prague, Czech Republic 2007. http://www.iccs.informatics.ed.ac.uk/~osborne/papers/acl07.pdf Miles -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Significance of BLEU using Multi-bleu
firstly, do MERT and make sure that everything has reasonable parameters! this is how to think about testing. you are trying to estimate the error of your model (which you trained-up in the usual way). when estimating this error, the *training set* is the test set. so, the more `training' material you have, the better your confidence in estimating that error. in short, the more test material you use, the more reliable your results will be. results can vary in both directions --both up (you got lucky) and down (you are unlucky). increasing the test set size reduces the chances of either of these situations happening. when working with a narrow domain, you should need fewer sentences, exactly how few will depend upon what you are doing. Miles 2008/9/18 Vineet Kashyap [EMAIL PROTECTED]: Hi Miles Thanks for the fast reply. I am very sure that both testing and training data is different. Also, no optimization has been done using MERT and the training set is about 8948 sentences. But generally speaking would testing on a small set of sentences increase the BLEU scores and is it possible to get good scores with a small corpus when working with a narrow domain. I am doing further testing and will look at Corpus size vs BLEU Thanks Vineet ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Fwd: Fwd: Moses: Prepare Data, Build Language Model and Train Model
(my message bounced as it was too long ... here is a truncated version) Miles -- Forwarded message -- From: Miles Osborne [EMAIL PROTECTED] Date: 2008/8/14 Subject: Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model To: Llio Humphreys [EMAIL PROTECTED] Cc: moses-support moses-support@mit.edu building language models (using for example ngram-count) is computationally expensive. from what you tell the list, it seems that you don't have enough physical memory to run it properly. you have a number of options: --specify a lower order model (eg 4 rather than 5, or even 3); depending upon how much monolingual training material you have, this may not produce worse results and it will certainly run faster and will require less space. --divide your language model training material into chunks and run ngram-count on each chunk. this is one strategy for building LMs using all of the Giga word corpus (when you don't have access to a 64 bit machine). here you would create multiple LMs. --use a disk-based method of creating them. we have done this, and basically it trades speed for time. --take the radical option and simply don't bother smoothing at all (ie use Google's stupid backoff). this makes training LMs trivial --just compute the counts of ngrams and work-out how to store them. i reckon it should be possible to do this and create an ARPA file suitable for loading into the SRILM. --buy more machines. Miles ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
building language models (using for example ngram-count) is computationally expensive. from what you tell the list, it seems that you don't have enough physical memory to run it properly. you have a number of options: --specify a lower order model (eg 4 rather than 5, or even 3); depending upon how much monolingual training material you have, this may not produce worse results and it will certainly run faster and will require less space. --divide your language model training material into chunks and run ngram-count on each chunk. this is one strategy for building LMs using all of the Giga word corpus (when you don't have access to a 64 bit machine). here you would create multiple LMs. --use a disk-based method of creating them. we have done this, and basically it trades speed for time. --take the radical option and simply don't bother smoothing at all (ie use Google's stupid backoff). this makes training LMs trivial --just compute the counts of ngrams and work-out how to store them. i reckon it should be possible to do this and create an ARPA file suitable for loading into the SRILM. --buy more machines. Miles 2008/8/14 Llio Humphreys [EMAIL PROTECTED] Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, thank you all for your help. It is very, very much appreciated. I decided to try Eric's packages, and it looks like the installation worked. I typed some of the commands in the Baseline instructions without arguments, and the program either output to the screen that I missed some arguments or gave a description of the program. Thank you Eric!!! Following the Baseline instructions (http://www.statmt.org/wmt08/baseline.html) I have now got to the following step: Use SRILM to build language model: /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the path to ngram-count, but it was possible to invoke it without the path: ngram-count -order 5 -interpolate -kndiscount -text europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm I'm concerned about two things: 1) this ngram-count step is taking a very long time. I think I started it off around 6pm yesterday, but it's still going. It's very resource-intensive, and it's difficult to get to other windows open. I went to check up on it around 9pm, and couldn't find that particular terminal. I thought I had closed that terminal by mistake, so I stupidly opened another one, and entered the same command. I subsequently found that the original terminal was still open, so I closed the second one. I'm not sure if issuing this command a second time on the same program and files on a different terminal would corrupt the original ngramcount step, and whether I should start it off again, or whether starting it off again would make things worse? I looked up ngram-count ( http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) and I don't think it outputs to any file, so I guess you have to be in the same terminal to do the next step? I opened another terminal and typed 'top' to see what processes are running, and I know that ngram-count is doing something, but whether it's doing well or stuck in a loop, I can't say. What I do find strange is that the time for ngram-count is said to be 00:58:20, and it's been going for hours.. I searched this problem in previous Moses Group emails and I understand that if I run this with order 4 instead of 5 it will run quicker with very similar results? So, can I just stop what it's doing, and run this command in the same terminal with order 4? Are there any files I need to 'touch' to ensure that it doesn't leave any stone unturned? 2) how to do the next step: bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:working-dir/lm/europarl.lm:0 I assume that like ngram-count, I can just type in train-factored-phrase-model.perl without the full path...Do I need to set the -scripts-root-dir paramater? Are all the scripts in the same place? Thank you, Llio On 8/14/08, Murat ALPEREN [EMAIL PROTECTED] wrote: Dear Llio, You should be okay with installing moses finally if you have installed all tha dependant packages before. I am not aware of the 'whereis' command, but once you train your model, your moses.ini file which is created by training script will take care of the paths. However, you should carefully supply paths while training your model. Before training your model, you should have two seperate corpus files which are lowercased, sentence aligned and
Re: [Moses-support] Moses: Prepare Data, Build Language Model and Train Model
an ugly hack is to simply create a soft link to the i686-m64 directory (as i recently did on a new 64 bit machine) Miles 2008/8/13 Sara Stymne [EMAIL PROTECTED] Hi! When we installed SRILM and Moses on our 64-bit Ubuntu machine we had some troubles with getting the machine type right. What solved it in the end was to hack the machine-type script (found in srilm/sbin), so that it gave the correct machine type, from i686 to i686-m64: else if (`uname -m` == x86_64) then set MACHINE_TYPE = i686-m64 #set MACHINE_TYPE = i686 After doing that we could compile SRILM without specifying the MACHINE_TYPE. /Sara Llio Humphreys skrev: Dear Josh, thanks for the links. I had already found this information, and it helped me compile SRILM on the Mac. Here, the problem was finding the most appropriate Makefile for the Linux/Ubuntu machine I'm working on: amd athlon x2 dual core x86_64. $SRILM/common.Makefile.i686_m64 seemed the most appropriate, and the CC and CXX variables are correct, but I still ended up with a lot of errors, unfortunately. Llio On Wed, Aug 13, 2008 at 1:46 PM, Josh Schroeder [EMAIL PROTECTED] wrote: You can also check out the SRILM documentation: http://www.speech.sri.com/projects/srilm/manpages/ FAQ: http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html Or search the SRILM mailing list archives: http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/ -Josh On 13 Aug 2008, at 13:37, Anung Ariwibowo wrote: Hi Llio, I can compile SRILM in Linux Ubuntu without problem. Can you post the error message here, maybe we can help. Cheers, Anung On Wed, Aug 13, 2008 at 8:29 PM, Llio Humphreys [EMAIL PROTECTED] wrote: Dear Josh/Hieu, many thanks for your replies. The default shell is bash, and updating the .profile file worked - thanks for that tip. I look forward to hearing more from you about the ./model/extract.0-0.o.part* problem. My apologies for my ignorance of Unix matters: I'd like to think of myself as a newbie rather than one who is averse to learning about these things, and the further information you have provided has been useful and interesting. Hieu mentioned that Anung Ariwibowo got Moses to work when he transferred to a Linux machine. A colleague has kindly let me borrow a Linux/Ubuntu machine, but I have already run into problems compiling SRILM! So, I'll see if Eric Nichols's packages will take care of that: http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/feisty/nlp/http://cl.naist.jp/%7Eeric-n/ubuntu-nlp/dists/feisty/nlp/ Best regards, Llio On 8/13/08, Josh Schroeder [EMAIL PROTECTED] wrote: Hi Llio, you may have already received my email on the following problem when building the language model: Executing: cat ./model/extract.0-0.o.part* ./model/extract.0-0.o cat: ./model/extract.0-0.o.part*: No such file or directory Exit code: 1 That's building the phrase table, not the language model. It seems like several people on the list are having problems with this step, so I'm going to take a look at the training process and post something to the list in the next day or two. 1. You mention that Moses does not use environment variables. However, in order to get SRILM to work, I found it necessary to create environment variables and pass these on to SRILM's make: make SRILM=$PWD MACHINE_TYPE=macosx PATH=/bin:/sbin:/usr/bin:/usr/sbin:/Users/lliohumphreys/MT/MOSESSUITE/srilm:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin/macosx:/sw/bin/gawk MANPATH=/Users/lliohumphreys/MT/MOSESSUITE/srilm/man LC_NUMERIC=C In addition, I was also required to type in the following command for moses-scripts: export SCRIPTS_ROOTDIR=/Users/lliohumphreys/MT/MOSESSUITE/bin/moses-scripts/scripts-20080811-1801 Sorry, I should have been more clear. Moses itself, the decoder that loads a trained phrase table and language model and translates text, is a self-contained command-line program that doesn't require environment variables. Your first example is compiling SRILM. This is not part of the Moses toolkit: it's a toolkit of its own for language modeling and a ton of other stuff. We use it as one of two possible integrated language models (the other is IRSTLM) with Moses. Your second example is part of the training regime. Yes, there is some use of the SCRIPTS_ROOTDIR in the train-factored-phrase-model.perl, but for most training support scripts that come with moses there is a flag that lets you specify SCRIPTS_ROOTDIR at the command line instead of storing it as an environment variable. In train-factored-phrase-model it's -scripts-root-dir, which I think you've actually used in one of your other emails. If I open a new terminal and echo these variables, most of them are blank, and PATH just