Re: [Moses-support] RandLM make Error

2014-11-20 Thread Miles Osborne
LDHT is not really supported, but looking at your error message it seems
that you need to install Google Sparse Hash.

On Wed Nov 19 2014 at 12:47:27 PM Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 There is a script within the randlm project that compiles just the library
 needed to integrate the library into Moses.

 https://sourceforge.net/p/randlm/code/HEAD/tree/trunk/manual-compile/compile.sh
 It's been a while since people have asked about RandLM, I'm not sure who's
 still using it and who has time  experience to take care of it.

 On 19 November 2014 11:50, Achchuthan Yogarajah achch1...@gmail.com
 wrote:

 Hi Everyone,

 when i build RandLM with the following command
 make
 i got some error

 Making all in RandLM
 make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
 make[1]: Nothing to be done for `all'.
 make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
 Making all in LDHT
 make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
 /bin/bash ../../libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H
 -I. -I../..  -I./  -fPIC -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g
 -O2 -MT libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c -o
 libLDHT_la-Client.lo `test -f 'Client.cpp' || echo './'`Client.cpp
 libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -I./ -fPIC
 -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2 -MT
 libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c Client.cpp
 -fPIC -DPIC -o .libs/libLDHT_la-Client.o
 In file included from Client.cpp:6:0:
 Client.h:8:34: fatal error: google/sparse_hash_map: No such file or
 directory
  #include google/sparse_hash_map
   ^
 compilation terminated.
 make[1]: *** [libLDHT_la-Client.lo] Error 1
 make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
 make: *** [all-recursive] Error


 --


 *Thanks  Regards,**Yogarajah Achchuthan*
 [ LinkedIn http://lk.linkedin.com/in/achchuthany/ Twitter
 https://twitter.com/achchuthany Facebook
 https://www.facebook.com/achchuthany ]


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu

  The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] embeddings

2014-07-02 Thread Miles Osborne
I would model them as feature functions over phrases. You might imagine
that you can exploit vector similarity to do smoothing.

Good luck

Miles
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Miles Osborne
this perl snippet:

$line =~ tr/\040-\176/ /c;

On 30 May 2014 12:17,  moses-support-requ...@mit.edu wrote:
 Send Moses-support mailing list submissions to
 moses-support@mit.edu

 To subscribe or unsubscribe via the World Wide Web, visit
 http://mailman.mit.edu/mailman/listinfo/moses-support
 or, via email, send a message with subject or body 'help' to
 moses-support-requ...@mit.edu

 You can reach the person managing the list at
 moses-support-ow...@mit.edu

 When replying, please edit your Subject line so it is more specific
 than Re: Contents of Moses-support digest...


 Today's Topics:

1. removing non-printing character (Hieu Hoang)


 --

 Message: 1
 Date: Fri, 30 May 2014 16:24:30 +0100
 From: Hieu Hoang hieu.ho...@ed.ac.uk
 Subject: [Moses-support] removing non-printing character
 To: moses-support moses-support@mit.edu
 Message-ID:
 caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com
 Content-Type: text/plain; charset=utf-8

 does anyone have a script/program that can remove all non-printing
 characters?

 I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all
 non-printing chars

 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu
 -- next part --
 An HTML attachment was scrubbed...
 URL: 
 http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm

 --

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 End of Moses-support Digest, Vol 91, Issue 52
 *



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Miles Osborne
it is trivial to change it to say a ? mark.

but I'm not sure what you want as output now.  the original request
was for removing non-printable characters, which the Perl does,

Miles

On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote:
 forgot to say. The input is utf8. The snippet turns
gonzález
 to
gonz lez


 On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote:

 this perl snippet:

 $line =~ tr/\040-\176/ /c;

 On 30 May 2014 12:17,  moses-support-requ...@mit.edu wrote:
  Send Moses-support mailing list submissions to
  moses-support@mit.edu
 
  To subscribe or unsubscribe via the World Wide Web, visit
  http://mailman.mit.edu/mailman/listinfo/moses-support
  or, via email, send a message with subject or body 'help' to
  moses-support-requ...@mit.edu
 
  You can reach the person managing the list at
  moses-support-ow...@mit.edu
 
  When replying, please edit your Subject line so it is more specific
  than Re: Contents of Moses-support digest...
 
 
  Today's Topics:
 
 1. removing non-printing character (Hieu Hoang)
 
 
  --
 
  Message: 1
  Date: Fri, 30 May 2014 16:24:30 +0100
  From: Hieu Hoang hieu.ho...@ed.ac.uk
  Subject: [Moses-support] removing non-printing character
  To: moses-support moses-support@mit.edu
  Message-ID:
 
  caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com
  Content-Type: text/plain; charset=utf-8
 
  does anyone have a script/program that can remove all non-printing
  characters?
 
  I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
  all
  non-printing chars
 
  --
  Hieu Hoang
  Research Associate
  University of Edinburgh
  http://www.hoang.co.uk/hieu
  -- next part --
  An HTML attachment was scrubbed...
  URL:
  http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
 
  --
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
  End of Moses-support Digest, Vol 91, Issue 52
  *



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu


 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Miles Osborne
for those specific characters:

perl -C -pe 's/\x{200B}//g' tmp/baa

but as Lane mentions, you probably need to somehow specify the set of
naughty characters you need to deal with.

Miles

On 30 May 2014 13:23, Lane Schwartz dowob...@gmail.com wrote:
 We also used charlint. It might do what you want.

 On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz dowob...@gmail.com wrote:
 As far as I know, no such general purpose tool exists. We wrote a
 custom in-house script that removes many, but not all, possible
 non-printing Unicode characters as part of our WMT submission.

 I am interested in  writing one, though.

 I think the right way to do this would be to parse the Unicode
 character database for all characters of certain classes, and build
 the tool from that data.

 Lane


 On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote:
 in the attached file, there are 2 or more non-printing chars on the 1st
 line, between the words 'place' and 'binding'. They should be
 removed/replaced with a space. Those chars are deleted by parsers, making
 the word alignments incorrect and crashing extract

 The 2nd line is perfectly good utf8. It shouldn't be touched.

 just another friday nlp malaise



 On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote:

 it is trivial to change it to say a ? mark.

 but I'm not sure what you want as output now.  the original request
 was for removing non-printable characters, which the Perl does,

 Miles

 On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote:
  forgot to say. The input is utf8. The snippet turns
 gonzález
  to
 gonz lez
 
 
  On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote:
 
  this perl snippet:
 
  $line =~ tr/\040-\176/ /c;
 
  On 30 May 2014 12:17,  moses-support-requ...@mit.edu wrote:
   Send Moses-support mailing list submissions to
   moses-support@mit.edu
  
   To subscribe or unsubscribe via the World Wide Web, visit
   http://mailman.mit.edu/mailman/listinfo/moses-support
   or, via email, send a message with subject or body 'help' to
   moses-support-requ...@mit.edu
  
   You can reach the person managing the list at
   moses-support-ow...@mit.edu
  
   When replying, please edit your Subject line so it is more specific
   than Re: Contents of Moses-support digest...
  
  
   Today's Topics:
  
  1. removing non-printing character (Hieu Hoang)
  
  
  
   --
  
   Message: 1
   Date: Fri, 30 May 2014 16:24:30 +0100
   From: Hieu Hoang hieu.ho...@ed.ac.uk
   Subject: [Moses-support] removing non-printing character
   To: moses-support moses-support@mit.edu
   Message-ID:
  
   caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com
   Content-Type: text/plain; charset=utf-8
  
   does anyone have a script/program that can remove all non-printing
   characters?
  
   I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
   all
   non-printing chars
  
   --
   Hieu Hoang
   Research Associate
   University of Edinburgh
   http://www.hoang.co.uk/hieu
   -- next part --
   An HTML attachment was scrubbed...
   URL:
  
   http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
  
   --
  
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
  
  
   End of Moses-support Digest, Vol 91, Issue 52
   *
 
 
 
  --
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
 
  --
  Hieu Hoang
  Research Associate
  University of Edinburgh
  http://www.hoang.co.uk/hieu
 
 
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love



 --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A

Re: [Moses-support] Perplexity KenLM

2014-05-16 Thread Miles Osborne
you can get kenlm to report perplexity as follows:

bin/query foo.arpa text | tail -n 1

note that you need to be careful with OOVs if you are comparing models
that do not all use the same vocabulary.

(SRILM is broekn in this respect in that an OOV will give you a
probability of one)

Miles

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] about testing on part of training dataset

2013-12-21 Thread Miles Osborne
SMT  systems such as Moses do not guarantee that they can reproduce
the training set.  For example, phrases might be pruned due to
frequencies being too low,  not all words might be aligned, the
decoder might discard the true translation during etc etc.

This doesn't really have much to do with Indian languages per se;
instead, it is the way that systems are built in general.

Miles


Can anyone please tell me about why we got low BLEU score on a testset
we get from training set for sparse resourced languages like Indian
languages.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] incremental training

2013-10-30 Thread Miles Osborne
Incremental training in Moses is based upon work we did a few years back:

http://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf

Table 3 shows that there is essentially no quality difference between
incremental training and standard GIZA++ training.  incremental (re)
training is a lot faster.

Miles

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] compile error with LDHT in randlm

2013-09-25 Thread Miles Osborne
If I recall the decoder was modified to allow batching of LM requests.

Miles

On 25 September 2013 10:22, Hieu Hoang hieuho...@gmail.com wrote:
 I'm not sure how to compile LDHT but when i compiled randlm from svn, i had
 to change 2 minor things to get it to compile on my mac:
   1.  src/RandLM/Makefile.am: boost_thread -- boost_thread-mt
   2. autogen.sh: libtoolize -- glibtoolize

 Also, the distributed LM was supported in Moses v1. However, it has been
 deleted from the current Moses in the git repository. I will try and re-add
 it if a multi-pass, asynchronous decoding framework can be created. If
 you're interested in doing this, I would be very glad to help you

 On 24/09/2013 11:51, Hoai-Thu Vuong wrote:


 Hello

 I build LDHT in randlm  and have got some errors, look like

 MurmurHash3.cpp:81:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp:68:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp:60:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp:55:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp: In function 'void MurmurHash3_x86_32(const void*, int,
 uint32_t, void*)':
 MurmurHash3.cpp:55:23: error: inlining failed in call to always_inline
 'uint32_t getblock(const uint32_t*, int)': function body can be overwritten
 at link time


 I attach full error log here. My compiler is g++ version 4.7, OS is Ubuntu
 server 64bit 13.04, I clean install then install require package such as
 git, build essential, libtool, autoconf, google sparse hash, boost thread.
 With same source code I compile successful with g++ version 4.6, OS is
 ubuntu 64bit 12.04.

 I google solution to fix, and one guy recommend me change line (in
 MurmurHash3.cpp):

 #define FORCE_INLINE __attribute__((always_inline))

 to

 #define FORCE_INLINE inline __attribute__((always_inline))

 do this, I pass this error, however, I receive another error ::close(m_sd)
 not found in deconstructor of ~TransportTCP()




 --
 Thu.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] compile error with LDHT in randlm

2013-09-25 Thread Miles Osborne
have a look at:

SearchNormalBatch.h

in the source

Miles

On 25 September 2013 10:34, Lane Schwartz dowob...@gmail.com wrote:
 Miles,

 I heard that rumor as well. If anyone could point me to any
 documentation that describes how to do this, I would be interested in
 trying out this functionality.

 Cheers,
 Lane

 On Wed, Sep 25, 2013 at 10:24 AM, Miles Osborne mi...@inf.ed.ac.uk wrote:
 If I recall the decoder was modified to allow batching of LM requests.

 Miles

 On 25 September 2013 10:22, Hieu Hoang hieuho...@gmail.com wrote:
 I'm not sure how to compile LDHT but when i compiled randlm from svn, i had
 to change 2 minor things to get it to compile on my mac:
   1.  src/RandLM/Makefile.am: boost_thread -- boost_thread-mt
   2. autogen.sh: libtoolize -- glibtoolize

 Also, the distributed LM was supported in Moses v1. However, it has been
 deleted from the current Moses in the git repository. I will try and re-add
 it if a multi-pass, asynchronous decoding framework can be created. If
 you're interested in doing this, I would be very glad to help you

 On 24/09/2013 11:51, Hoai-Thu Vuong wrote:


 Hello

 I build LDHT in randlm  and have got some errors, look like

 MurmurHash3.cpp:81:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp:68:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp:60:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp:55:23: warning: always_inline function might not be
 inlinable [-Wattributes]
 MurmurHash3.cpp: In function 'void MurmurHash3_x86_32(const void*, int,
 uint32_t, void*)':
 MurmurHash3.cpp:55:23: error: inlining failed in call to always_inline
 'uint32_t getblock(const uint32_t*, int)': function body can be overwritten
 at link time


 I attach full error log here. My compiler is g++ version 4.7, OS is Ubuntu
 server 64bit 13.04, I clean install then install require package such as
 git, build essential, libtool, autoconf, google sparse hash, boost thread.
 With same source code I compile successful with g++ version 4.6, OS is
 ubuntu 64bit 12.04.

 I google solution to fix, and one guy recommend me change line (in
 MurmurHash3.cpp):

 #define FORCE_INLINE __attribute__((always_inline))

 to

 #define FORCE_INLINE inline __attribute__((always_inline))

 do this, I pass this error, however, I receive another error ::close(m_sd)
 not found in deconstructor of ~TransportTCP()




 --
 Thu.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mosese decoder android and ios porting

2012-11-27 Thread Miles Osborne
For a long time now I've wanted to see Moses on a small device. Apart from
all of the extra functionality that isn't needed, one would also need to
work on shrinking the phrase table and perhaps also the search graph.
 KenLM / RandLM already deal with making the language model smaller.

An interesting research question would be as follows:  can we frame
decoding on a small device in terms of a budget and optimise that budget?
 We normally don't bother thinking this way and instead focus entirely on
quality.  But it might be possible to instead have a better connection with
the amount of space / search done and quality than we have already.  I'm
not sure if this is just a matter of fiddling with the beam size etc.
 Evince seems to suggest that this doesn't always give the expected
behaviour (ie the relationship between BLEU and beam size isn't linear).

Miles

-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Including new features in moses decoding

2012-07-25 Thread Miles Osborne
this is a fairly typical result for MERT.  i notice you are using
MIRA, which is claimed to be more reliable.  see

http://www.aclweb.org/anthology/N/N09/N09-1025.pdf

note that getting MIRA to work takes a lot of tweaking, so read the
fine print carefully

Miles

On 25 July 2012 17:24, Cristina cristi...@lsi.upc.edu wrote:

 Dear all,

 We are doing some experiments by adding new features at phrase level in
 the translation table. We have done a first experiment to see the effects
 and they are quite weird:

  * We build a translation table with 9 features and a similar translation
 table with 18 features (the same 9 features + 9 new features)

  * We run MERT (or MIRA) on a dev set using the first translation table (9
 features)

  * We translate a test set with 2 configurations:
   - MERT on 9 features using the translation table with 9 features
   - MERT on 9 features using the translation table with 18 features (9 +
 9) where the weight for the 9 extra features is set to 0

 We loose more than 3 points of BLEU with the second configuration with
 respect to the first one. (Using MERT on the 18 features gives similar
 results to the second configuration)

 Does anyone know if there is some penalty when adding more features? Or
 has anyone encountered the same problem?
 Thanks in advance!

 Best,

  Cristina
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Including new features in moses decoding

2012-07-25 Thread Miles Osborne
if you have non-zero feature values at training time, but they become
zero at test time then you may have a problem.

the reason for this is that all weights are optimised together.  you
can think of this as the system trying to work-out how best to
translate, using everything. if some are zero, then you are forcing
the rest to do the work that they were not optimised for.

Miles

On 25 July 2012 17:51, Cristina cristi...@lsi.upc.edu wrote:

 Thanks for the quick answer!

 I think that the problem here cannot be in the development step, it
 must be more related to decoding.

 Regardless the way weights are estimated, translation changes when I add
 new features with zero weight (not in development but in test). They
 shouldn't contribute to score the final translation, right?

 Cristina


 On Wed, 25 Jul 2012, Miles Osborne wrote:

 this is a fairly typical result for MERT.  i notice you are using
 MIRA, which is claimed to be more reliable.  see

 http://www.aclweb.org/anthology/N/N09/N09-1025.pdf

 note that getting MIRA to work takes a lot of tweaking, so read the
 fine print carefully

 Miles

 On 25 July 2012 17:24, Cristina cristi...@lsi.upc.edu wrote:
 
  Dear all,
 
  We are doing some experiments by adding new features at phrase level in
  the translation table. We have done a first experiment to see the effects
  and they are quite weird:
 
   * We build a translation table with 9 features and a similar translation
  table with 18 features (the same 9 features + 9 new features)
 
   * We run MERT (or MIRA) on a dev set using the first translation table (9
  features)
 
   * We translate a test set with 2 configurations:
- MERT on 9 features using the translation table with 9 features
- MERT on 9 features using the translation table with 18 features (9 +
  9) where the weight for the 9 extra features is set to 0
 
  We loose more than 3 points of BLEU with the second configuration with
  respect to the first one. (Using MERT on the 18 features gives similar
  results to the second configuration)
 
  Does anyone know if there is some penalty when adding more features? Or
  has anyone encountered the same problem?
  Thanks in advance!
 
  Best,
 
   Cristina
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Including new features in moses decoding

2012-07-25 Thread Miles Osborne
then something is wrong

Miles

On 25 July 2012 19:42, Cristina cristi...@lsi.upc.edu wrote:
 mmm... but the others were optimised altogether, without the new ones I'm
 giving a weight zero...

 On Wed, 25 Jul 2012, Miles Osborne wrote:

 if you have non-zero feature values at training time, but they become
 zero at test time then you may have a problem.

 the reason for this is that all weights are optimised together.  you
 can think of this as the system trying to work-out how best to
 translate, using everything. if some are zero, then you are forcing
 the rest to do the work that they were not optimised for.

 Miles

 On 25 July 2012 17:51, Cristina cristi...@lsi.upc.edu wrote:
 
  Thanks for the quick answer!
 
  I think that the problem here cannot be in the development step, it
  must be more related to decoding.
 
  Regardless the way weights are estimated, translation changes when I add
  new features with zero weight (not in development but in test). They
  shouldn't contribute to score the final translation, right?
 
  Cristina
 
 
  On Wed, 25 Jul 2012, Miles Osborne wrote:
 
  this is a fairly typical result for MERT.  i notice you are using
  MIRA, which is claimed to be more reliable.  see
 
  http://www.aclweb.org/anthology/N/N09/N09-1025.pdf
 
  note that getting MIRA to work takes a lot of tweaking, so read the
  fine print carefully
 
  Miles
 
  On 25 July 2012 17:24, Cristina cristi...@lsi.upc.edu wrote:
  
   Dear all,
  
   We are doing some experiments by adding new features at phrase level in
   the translation table. We have done a first experiment to see the 
   effects
   and they are quite weird:
  
* We build a translation table with 9 features and a similar 
   translation
   table with 18 features (the same 9 features + 9 new features)
  
* We run MERT (or MIRA) on a dev set using the first translation table 
   (9
   features)
  
* We translate a test set with 2 configurations:
 - MERT on 9 features using the translation table with 9 features
 - MERT on 9 features using the translation table with 18 features (9 +
   9) where the weight for the 9 extra features is set to 0
  
   We loose more than 3 points of BLEU with the second configuration with
   respect to the first one. (Using MERT on the 18 features gives similar
   results to the second configuration)
  
   Does anyone know if there is some penalty when adding more features? Or
   has anyone encountered the same problem?
   Thanks in advance!
  
   Best,
  
Cristina
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
  --
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: a question about moses

2012-05-01 Thread Miles Osborne
The standard way to do this is pretend that each word pair in a dictionary
is a little sentence. Append this to the usual parallel corpus and train
with Giza

Miles
On May 1, 2012 5:53 PM, Abby Levenberg leven...@gmail.com wrote:

 Hi,

 I assume the answer is no but wanted to be sure.

 Thanks,
 Abby

 -- Forwarded message --
 From: Niraj Aswani nirajasw...@gmail.com
 Date: Tue, May 1, 2012 at 4:25 PM
 Subject: a question about moses
 To: Abby Levenberg leven...@gmail.com


 hi Abby,

 I hope you are fine.  I am running a moses experiment on my system and
 wanted to know how can you supply an external dictionary to support the
 translation model?  Is there a way to do it?

 Regards,
 Niraj




 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE

2012-04-26 Thread Miles Osborne
Very short sentences will give you high scores.

Also multiple references will boost them

Miles
On Apr 26, 2012 8:13 PM, John D Burger j...@mitre.org wrote:

 I =think= I recall that pairwise BLEU scores for human translators are
 usually around 0.50, so anything much better than that is indeed suspect.

 - JB

 On Apr 26, 2012, at 14:18 , Daniel Schaut wrote:

  Hi all,
 
 
  I’m running some experiments for my thesis and I’ve been told by a more
 experienced user that the achieved scores for BLEU/METEOR of my MT engine
 were too good to be true. Since this is the very first MT engine I’ve ever
 made and I am not experienced with interpreting scores, I really don’t know
 how to reflect them. The first test set achieves a BLEU score of 0.6508
 (v13). METEOR’s final score is 0.7055 (v1.3, exact, stem, paraphrase). A
 second test set indicated a slightly lower BLEU score of 0.6267 and a
 METEOR score of 0.6748.
 
 
  Here are some basic facts about my system:
 
  Decoding direction: EN-DE
 
  Training corpus: 1.8 mil sentences
 
  Tuning runs: 5
 
  Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain)
 
  LM type: trigram
 
  TM type: unfactored
 
 
  I’m now trying to figure out if these scores are realistic at all, as
 different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang
 2011. Any comments regarding the mentioned decoding direction and related
 scores will be much appreciated.
 
 
  Best,
 
  Daniel
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Evaluation

2012-04-20 Thread Miles Osborne
no it works as I just verified.

On 20 April 2012 11:29, sara hamza sarahamz...@gmail.com wrote:
 Good Morning everyOne ,

 Can anyone tell me please where can I get  the  mteval‐v11b.pl used in
 evaluation ?? I found this URL in some documentation : ftp://
 jaguar.ncsl.nist.gov/mt/resources/mteval‐v11b.pl but access failed . Thank
 you in advance .

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Incremental training

2012-02-21 Thread Miles Osborne
incremental training for Giza is distinct from incremental training
for the language model.

we have worked on both --see Abby Levenberg's PhD

http://homepages.inf.ed.ac.uk/miles/phd-projects/levenberg.pdf

the short answer is yes, but I don't think the incremental LM code
has migrated from Abby's thesis work into the Moses distribution

Miles

On 20 February 2012 20:23, marco turchi marco.tur...@gmail.com wrote:
 Dear all,
 I'm starting to use the incremental training and I was wondering if it
 updates the language model as well. If the answer is not, is it possible to
 update the language model without restarting Moses?

 Thanks a lot
 Marco

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Remote LM protocol?

2012-02-14 Thread Miles Osborne
Oliver is in the process of finishing it.

Miles
On Feb 14, 2012 3:45 PM, Lane Schwartz dowob...@gmail.com wrote:

 Miles,

 Just ran across this email and thought I'd follow up. How is this
 coming along? :)

 Cheers,
 Lane


 On Thu, Nov 17, 2011 at 11:31 AM, Miles Osborne mi...@inf.ed.ac.uk
 wrote:
  what we have is something that is very similar to the Google bloomier
  filter setup --ie a randomised LM, with the actual LM sharded across
  multiple machines.  we have been working on making it faster and have
  some results here.
 
  with any luck we will release this sometime early next year
  Miles
 
  On 17 November 2011 16:25, Christian Federmann cfederm...@dfki.de
 wrote:
  Hi Peter, Hieu, all,
 
  my thesis stuff is rather outdated and likely not working with current
 Moses code.
 
  As Hieu pointed out, the whole thing is problematic as networked
 requests take much
  longer than in-memory n-gram lookups.  In the Dublin MT Marathon, Mark
 Fishel and I
  worked on optimal batching of LMServer requests and got pretty far;
  the combination
  of Miles' RandLM and such a batched, remote LM interface could be a
 nice thing...
 
  Cheers,
Christian
 
 
 
  On Nov 17, 2011, at 2:53 PM, Hieu Hoang wrote:
 
  hi peter
 
  i think christian federmann worked on the remote LM :
 
 
 https://www.google.com/search?hl=enq=federmann+Very+large+language+models+for+machine+translation
  however, IMO, the decoder is lacking the infrastructure to do remote
 LM.
 
  to do it well, the decoder has to batch the LM calls to minimise second
  too many queries. Also, it has to make the calls asynchronously rather
  than wait for the LM query to complete.
 
  I'm not sure how far christian got but i suspect this is a major
  undertaking
 
  ps. your email to the mailing list went through fine. Why did you think
  it didn't?
 http://news.gmane.org/gmane.comp.nlp.moses.user
 
  On 17/11/2011 14:54, P.J. Berck wrote:
  Hi,
 
  I was looking at the possibility to use a remote LM in moses, but I
 can't find any documentation.
 
  I know about the 6 0 3 host:port specification in moses.ini, but a
 naive test just gives errors like Your data containss  in a position
 other than the first word.
 
  Is there some kind of protocol I need to implement? What kind of
 results does moses expect?
 
  Thanks for pointers,
  -peter
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
 
  --
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support



 --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Remote LM protocol?

2012-02-14 Thread Miles Osborne
integration with the search process needs doing but the backend and
batching  of requests is done.
Miles
On Feb 14, 2012 4:37 PM, Lane Schwartz dowob...@gmail.com wrote:

 Cool. :) I'm definitely looking forward to giving it a try when it is
 released.

 Cheers,
 Lane


 On Tue, Feb 14, 2012 at 10:33 AM, Miles Osborne mi...@inf.ed.ac.uk
 wrote:
  Oliver is in the process of finishing it.
 
  Miles
 
  On Feb 14, 2012 3:45 PM, Lane Schwartz dowob...@gmail.com wrote:
 
  Miles,
 
  Just ran across this email and thought I'd follow up. How is this
  coming along? :)
 
  Cheers,
  Lane
 
 
  On Thu, Nov 17, 2011 at 11:31 AM, Miles Osborne mi...@inf.ed.ac.uk
  wrote:
   what we have is something that is very similar to the Google bloomier
   filter setup --ie a randomised LM, with the actual LM sharded across
   multiple machines.  we have been working on making it faster and have
   some results here.
  
   with any luck we will release this sometime early next year
   Miles
  
   On 17 November 2011 16:25, Christian Federmann cfederm...@dfki.de
   wrote:
   Hi Peter, Hieu, all,
  
   my thesis stuff is rather outdated and likely not working with
 current
   Moses code.
  
   As Hieu pointed out, the whole thing is problematic as networked
   requests take much
   longer than in-memory n-gram lookups.  In the Dublin MT Marathon,
 Mark
   Fishel and I
   worked on optimal batching of LMServer requests and got pretty far;
the combination
   of Miles' RandLM and such a batched, remote LM interface could be a
   nice thing...
  
   Cheers,
 Christian
  
  
  
   On Nov 17, 2011, at 2:53 PM, Hieu Hoang wrote:
  
   hi peter
  
   i think christian federmann worked on the remote LM :
  
  
  
 https://www.google.com/search?hl=enq=federmann+Very+large+language+models+for+machine+translation
   however, IMO, the decoder is lacking the infrastructure to do remote
   LM.
  
   to do it well, the decoder has to batch the LM calls to minimise
   second
   too many queries. Also, it has to make the calls asynchronously
 rather
   than wait for the LM query to complete.
  
   I'm not sure how far christian got but i suspect this is a major
   undertaking
  
   ps. your email to the mailing list went through fine. Why did you
   think
   it didn't?
  http://news.gmane.org/gmane.comp.nlp.moses.user
  
   On 17/11/2011 14:54, P.J. Berck wrote:
   Hi,
  
   I was looking at the possibility to use a remote LM in moses, but I
   can't find any documentation.
  
   I know about the 6 0 3 host:port specification in moses.ini, but
 a
   naive test just gives errors like Your data containss  in a
 position
   other than the first word.
  
   Is there some kind of protocol I need to implement? What kind of
   results does moses expect?
  
   Thanks for pointers,
   -peter
  
  
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
  
  
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
  
  
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
  
  
  
  
   --
   The University of Edinburgh is a charitable body, registered in
   Scotland, with registration number SC005336.
  
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
  --
  When a place gets crowded enough to require ID's, social collapse is not
  far away.  It is time to go elsewhere.  The best thing about space
 travel
  is that it made it possible to go elsewhere.
  -- R.A. Heinlein, Time Enough For Love



 --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] New multi-parallel corpus available (Indic Languages)

2012-01-24 Thread Miles Osborne
The Indic multi-parallel corpus consists of approximately 2000
Wikipedia sentences translated into the following Indic languages:

Bengali
Hindi
Malayalam
Tamil
Telugi
Urdu

The data was translated by non-expert translators hired over
Mechanical Turk and so it is of mixed quality. Every source source
segments was translated redundantly by four different Turkers.
Note that we have translated paragraphs, so the data should be of
interest to researchers looking at discourse as well as machine
translation.

http://homepages.inf.ed.ac.uk/miles/babel.html

Miles Osborne (Edinburgh)
Chris Callison-Burch (JHU)


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Filtering LMs

2011-11-24 Thread Miles Osborne
this can be done, but it tends to not save much space.  also it does
not help deal with OOVs, which the language model can still score even
though they are not in the parallel set.

if you are worried about saving space then you should either look at
KenLM or RandLM

Miles

On 24 November 2011 12:58, Thomas Schoenemann
thomas_schoenem...@yahoo.de wrote:
 Dear all,
  I hope that this is not too stupid a question, and that it hasn't been
 asked recently.
 In the MOSES EMS, when running experiments the phrase table is automatically
 reduced to only those phrases that actually occur in the respective dev/test
 set. Obviously this saves a lot of memory without changing the resulting
 translations.

 Now, I was wondering if something similar can be done/is done with the
 language model. That is, can one reduce the ARPA-file to only those words
 that occur on the target side in the (filtered) phrase table? The objective
 would of course be to maintain the translation result. Would the LM-software
 renormalize internally if some of the original entries are removed? Then the
 results would differ.
 This may even depend on what language model you use to load (rather than
 train) the ARPA file. I am using SRILM in my own translation programs, but
 would also be interested in other toolkits in case they behave more
 suitably.

 Can anyone point me to anything?
 Many thanks!
   Thomas Schoenemann (currently University of Pisa)

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Randomisation by MGIZA and tuning result is worse than no tuning

2011-11-22 Thread Miles Osborne

 --in general, Machine Translation training is non-convex.  this means
 that there are multiple solutions and each time you run a full
 training job, you will get different results.  in particular, you will
 see different results when running Giza++ (any flavour) and MERT.


 Is there no way to stop the variant in Giza++? I look at the code but has no
 idea where it occurs.

no, this is a property of the task, not the method.  put it another
way, there is nothing which tells the model how words are translated.
Giza++ makes a guess based upon how well it `explains's the training
data (log-likelihood / cross entropy).  there are many ways to achieve
the same log-likelihood and each guess amounts to a different
translation model.  on average these alternative models will all be
similar to each other (words are translated in similar ways), but in
general you will find differences.



 --the best way to deal with this (and most expensive) would be to run
 the full pipe-line, from scratch and multiple times.  this will give
 you a feel for variance --differences in results.  in general,
 variance arising from Giza++ is less damaging than variance from MERT.

 How many run is enough for this? As you say, it would be very expensive to
 do so.

how long is a piece of string?



 --to reduce variance it is best to use as much data as possible at
 each stage.  (100 sentences for tuning is far too low;  you should be
 using at least 1000 sentences).  it is possible to reduce this
 variability by using better machine learning, but in general it will
 always be there.

 What do you mean by better machine learning? Isn't the 500,000 words corpus
 enough? For the 1,000 sentences for tuning, can I use the same sentences as
 used in the training or they shall be separate sets of sentences?

lattice MERT is an example, or the Berkeley Aligner.

you cannot use the same sentences for training and tuning, as has been
explained earlier on the list




 --another strategy I know about is to fix everything once you have a
 set of good weights and never rerun MERT.  should you need to change
 say the language model, you will then manually alter the associated
 weight.  this will mean stability, but at the obvious cost of
 generality.  it is also ugly.

 Could you elaborate a bit about the fixing everything and never rerun MERT
 part? Do you mean after running n times, we find the best variation of
 variables (there are so many of them) and don't run MERT which I understand
 is for tuning?

if you have some problem that is fairly stable (uses the same training
set, language models etc) then after running MERT many times and
evaluating it on a disjoint test set, you pick the weights that
produce good results.  afterwards you do not re-run MERT even if you
have changed the model.

as i mentioned, this is ugly and something you do not want to do
unless you are forced to do it

Miles

 Thanks and sorry to answer it with more questions.

 Cheers,

 Jelita




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Various questions about training and tuning

2011-11-18 Thread Miles Osborne
re: not tuning on training data, in principle this shouldn't matter
(especially if the tuning set is large and/or representative of the
task).

in reality, Moses will assign far too much weight to these examples,
at the detriment of the others.  (it will drastically overfit).  this
is why the tuning and training sets are typically disjoint.  this is a
standard tactic in NLP and not just Moses.

re:  assigning more weight to certain translations, you have two
options here.  the first would be to assign more weight to these pairs
when you run Giza++.  (you can assign per-sentence pair weights at
this stage).  this is really just a hint and won't guarantee anything.
 the second option would be to force translations (using the XML
markup).

Miles

On 18 November 2011 08:42, Jehan Pages je...@mygengo.com wrote:
 Hi,

 On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
 tah...@precisiontranslationtools.com wrote:
 Jehan, here are my strategies, others may vary.

 Thanks.

 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not
 just a convenience for speed. If you make the effort to use the
 BerkeleyAligner, this limit disappears.

 Ok I didn't know this alternative to GIZA++. I see there are some
 explanation on the website for switching to this aligner. I may give
 it a try someday then. :-)

 2/ From a statistics and survey methodology point of view, your training
 data is a subset of individual samples selected from a whole population
 (linguistic domain) so-as to estimate the characteristics of the whole
 population. So, duplicates can exist and they play an important role in
 determining statistical significance and calculating probabilities. Some
 data sources, however, repeat information with little relevance to the
 linguistic balance of the whole domain. One example is a web sites with
 repetitive menus on every page. Therefore, for our use, we keep duplicates
 where we believe they represent a balanced sampling and results we want to
 achieve. We remove them when they do not. Not everyone, however, agrees with
 this approach.

 I see. And that confirms my thoughts. I don't know for sure what will
 be my strategy, but I think that will be keeping them all then, most
 probably. Making conditional removal like you do is interesting, but
 that would prove hard to do on our platform as we don't have context
 on translations stored.

 3/ Yes, none of the data pairs in the tuning set should be present in your
 training data. To do so skews the tuning weights to give excellent BLEU
 scores on the tuning results, but horrible scores on real world
 translations.

 I am not sure I understand what you say. How do you do so? Also why
 would we want to give horrible score to real world translations? Isn't
 the point exactly that the tuning data should actually represent
 this real world translations that we want to get close to?


 4/ Also I was wondering something else that I just remember. So that
 will be a fourth question!
 Suppose in our system, we have some translations we know for sure are
 very good (all are good but some are supposed to be more like
 certified quality). Is there no way in Moses to give some more
 weight to some translations in order to influence the system towards
 quality data (still keeping all data though)?

 Thanks again!

 Jehan

 Tom


 On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages je...@mygengo.com wrote:

 Hi all,

 I have a few questions about quality of training and tuning. If anyone
 has any clarifications, that would be nice! :-)

 1/ According to the documentation:
 «
 sentences longer than 100 words (and their corresponding translations)
 have to be eliminated
   (note that a shorter sentence length limit will speed up training
 »
 So is it only for the sake of training speed or can too long sentences
 end up being a liability in MT quality? In other words, when I finally
 need to train for real usage, should I really remove long sentences?

 2/ My data is taken from real crowd-sourced translated data. As a
 consequence, we end up with some duplicates (same original text and
 same translation). I wonder if for training, that either doesn't
 matter, or else we should remove duplicates, or finally that's better
 to have duplicates.

 I would imagine the latter (keep duplicates) is the best as this is
 statistical machine learning and after all, these represent real
 life duplicates (text we often encounter and that we apparently
 usually translate the same way) so that would be good to insist on
 these translations during training.
 Am I right?

 3/ Do training and tuning data have necessarily to be different? I
 guess for it to be meaningful, it should, and various examples on the
 website seem to go in that way, but I could not read anything clearly
 stating this.

 Thanks.

 Jehan

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 

Re: [Moses-support] Remote LM protocol?

2011-11-17 Thread Miles Osborne
we have been working on making distributed LMs efficient.  stay tuned

Miles

On 17 November 2011 13:53, Hieu Hoang hieuho...@gmail.com wrote:
 hi peter

 i think christian federmann worked on the remote LM :

 https://www.google.com/search?hl=enq=federmann+Very+large+language+models+for+machine+translation
 however, IMO, the decoder is lacking the infrastructure to do remote LM.

 to do it well, the decoder has to batch the LM calls to minimise second
 too many queries. Also, it has to make the calls asynchronously rather
 than wait for the LM query to complete.

 I'm not sure how far christian got but i suspect this is a major
 undertaking

 ps. your email to the mailing list went through fine. Why did you think
 it didn't?
    http://news.gmane.org/gmane.comp.nlp.moses.user

 On 17/11/2011 14:54, P.J. Berck wrote:
 Hi,

 I was looking at the possibility to use a remote LM in moses, but I can't 
 find any documentation.

 I know about the 6 0 3 host:port specification in moses.ini, but a naive 
 test just gives errors like Your data containss  in a position other than 
 the first word.

 Is there some kind of protocol I need to implement? What kind of results 
 does moses expect?

 Thanks for pointers,
 -peter


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Multi-run mert to average non-deterministic results

2011-11-07 Thread Miles Osborne

Question: do you think it's better to run mert-moses.pl more times
with smaller sets, or fewer times with larger sets?


you should run tuning with larger sets, multiple times

no amount of rerunning tuning on a small set will tell you anything

Miles

On 7 November 2011 13:45, Tom Hoar tah...@precisiontranslationtools.com wrote:
 A recent list thread recommended running mert several times and averaging
 the various non-deterministic results. If we adopt multiple mert tests, I
 want optimize the sizes of the tuning/test set, without taking too many
 segments from the total population.

 Currently, we extract statistically significant number of randomly selected
 segments (pairs) for one tuning set and one test set. We calculate a sample
 size with a basic population sampling formula that uses the population size,
 user-selected confidence level and confidence interval (e.g. 97% ±2%). We
 always assume an equal probabilistic proportion (50/50), which I understand
 results in the highest population sample.

 Of course, higher confidence levels with tighter intervals result in larger
 tuning/testing sample sizes. Reducing the confidence level, for example to
 90%, with an interval of ±5%, gives significantly smaller random sample
 sets. Smaller random sample sets are less representative of the overall
 population, but mert-moses.pl runs faster allowing us to evaluate more sets.

 Question: do you think it's better to run mert-moses.pl more times with
 smaller sets, or fewer times with larger sets?

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Multi-threading / Boost lib / compile error for threaded Moses

2011-09-22 Thread Miles Osborne
that doesn't work, as all of the locking code etc would still be invoked.

you really want something like

--threads 0

which should bypass everything and truly run in single threaded mode

Miles

On 22 September 2011 10:26, Kenneth Heafield mo...@kheafield.com wrote:
 -threads 1 ?

 On 09/22/11 10:06, Tom Hoar wrote:

 Re: the survey. I suggest if multi-threading is always enabled, there should
 be a command-line option that allows users to disable multi-threading for
 debugging.

 Tom



 On Thu, 22 Sep 2011 09:56:57 +0100, Kenneth Heafield mo...@kheafield.com
 wrote:

 My fault.  Sorry.  Fixed.

 On 09/22/11 09:41, Hieu Hoang wrote:

 hiya

 There's currently a compile error in trunk when multi-threading is enabled.
 However, I think the root cause of the problem is that there's currently too
 many compile flags so developers can't test the different combinations.
 Specifically, the boost library and multi-threading options.

 I've made a little poll to to see if people want to make Boost library a
 prerequisite, and threading always turned on:
    http://www.doodle.com/g7tgw778m9mp7dvw

 The poll also asks if you're willing to chip in and help out whichever way
 you vote.

 Having Boost only as an option makes it difficult to develop in Moses and
 makes it error prone, as we see with the compile error.

 Mandating Boost may mean some people have to install the correct Boost
 version on their machine. There may be Boost questions on this mailing list
 as a result.

 Hieu

 ps. the compile error is

 /bin/sh ../../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.
 -I../..  -W -Wall -ffor-scope -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread
 -DTRACE_ENABLE=1 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include
 -I/home/s0565741/workspace/sourceforge/trunk/kenlm  -g -O2 -MT
 AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c -o AlignmentInfo.lo
 AlignmentInfo.cpp
 libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -W -Wall -ffor-scope
 -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread -DTRACE_ENABLE=1
 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include
 -I/home/s0565741/workspace/sourceforge/trunk/kenlm -g -O2 -MT
 AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c AlignmentInfo.cpp -o
 AlignmentInfo.o
 In file included from StaticData.h:41:0,
  from AlignmentInfo.cpp:23:
 FactorCollection.h: In member function \u2018bool
 Moses::FactorCollection::EqualsFactor::operator()(const Moses::Factor,
 const Moses::FactorFriend) const\u2019:
 FactorCollection.h:80:19: error: \u2018const class Moses::Factor\u2019 has
 no member named \u2018in\u2019
 make[3]: *** [AlignmentInfo.lo] Error 1
 make[3]: Leaving directory
 `/disk1/hieu/workspace/sourceforge/trunk/moses/src'
 make[2]: *** [all] Error 2
 make[2]: Leaving directory
 `/disk1/hieu/workspace/sourceforge/trunk/moses/src'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk'
 make: *** [all] Error 2

 ___ Moses-support mailing list
 Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Multi-threading / Boost lib / compile error for threaded Moses

2011-09-22 Thread Miles Osborne
this has nothing to do with speed, more about actual debugging.

running just a single thread is a special case of running multiple
threads, so all the code to ensure things being safe is the same in
all situations.

should someone want to debug with no threading, then there would need
to be a mess of ifdefs removing all support for threading.  i agree,
this will be a pain to deal with, but this is what debugging with no
threads means.

Miles

On 22 September 2011 10:43, Kenneth Heafield mo...@kheafield.com wrote:
 It works for debugging.

 Perhaps your argument is that the single-threaded version will be slower
 due to unnecessary locking.  My response is that, if you care about
 performance, then you shouldn't be running single-theaded.

 Wrapping every lock in an if statement is arguably worse than wrapping
 them in ifdefs, especially due to the RAII nature of boost locks.  So
 compile-time does a better job at meeting a goal that I don't buy into.

 On 09/22/11 10:31, Miles Osborne wrote:
 that doesn't work, as all of the locking code etc would still be invoked.

 you really want something like

 --threads 0

 which should bypass everything and truly run in single threaded mode

 Miles

 On 22 September 2011 10:26, Kenneth Heafield mo...@kheafield.com wrote:
 -threads 1 ?

 On 09/22/11 10:06, Tom Hoar wrote:

 Re: the survey. I suggest if multi-threading is always enabled, there should
 be a command-line option that allows users to disable multi-threading for
 debugging.

 Tom



 On Thu, 22 Sep 2011 09:56:57 +0100, Kenneth Heafield mo...@kheafield.com
 wrote:

 My fault.  Sorry.  Fixed.

 On 09/22/11 09:41, Hieu Hoang wrote:

 hiya

 There's currently a compile error in trunk when multi-threading is enabled.
 However, I think the root cause of the problem is that there's currently too
 many compile flags so developers can't test the different combinations.
 Specifically, the boost library and multi-threading options.

 I've made a little poll to to see if people want to make Boost library a
 prerequisite, and threading always turned on:
    http://www.doodle.com/g7tgw778m9mp7dvw

 The poll also asks if you're willing to chip in and help out whichever way
 you vote.

 Having Boost only as an option makes it difficult to develop in Moses and
 makes it error prone, as we see with the compile error.

 Mandating Boost may mean some people have to install the correct Boost
 version on their machine. There may be Boost questions on this mailing list
 as a result.

 Hieu

 ps. the compile error is

 /bin/sh ../../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.
 -I../..  -W -Wall -ffor-scope -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread
 -DTRACE_ENABLE=1 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include
 -I/home/s0565741/workspace/sourceforge/trunk/kenlm  -g -O2 -MT
 AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c -o AlignmentInfo.lo
 AlignmentInfo.cpp
 libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -W -Wall -ffor-scope
 -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -pthread -DTRACE_ENABLE=1
 -DWITH_THREADS -I/home/s0565741/workspace/srilm/include
 -I/home/s0565741/workspace/sourceforge/trunk/kenlm -g -O2 -MT
 AlignmentInfo.lo -MD -MP -MF .deps/AlignmentInfo.Tpo -c AlignmentInfo.cpp -o
 AlignmentInfo.o
 In file included from StaticData.h:41:0,
                  from AlignmentInfo.cpp:23:
 FactorCollection.h: In member function \u2018bool
 Moses::FactorCollection::EqualsFactor::operator()(const Moses::Factor,
 const Moses::FactorFriend) const\u2019:
 FactorCollection.h:80:19: error: \u2018const class Moses::Factor\u2019 has
 no member named \u2018in\u2019
 make[3]: *** [AlignmentInfo.lo] Error 1
 make[3]: Leaving directory
 `/disk1/hieu/workspace/sourceforge/trunk/moses/src'
 make[2]: *** [all] Error 2
 make[2]: Leaving directory
 `/disk1/hieu/workspace/sourceforge/trunk/moses/src'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/disk1/hieu/workspace/sourceforge/trunk'
 make: *** [all] Error 2

 ___ Moses-support mailing list
 Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support









-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Multi-threading / Boost lib / compile error for threaded Moses

2011-09-22 Thread Miles Osborne
this is the last thing i will post here on this subject:

debugging with a single thread running invokes the threading code.
***if you suspect that this is somehow broken, then you need to debug
without it***.  it is that simple.

running gdb in single thread mode still uses threading.

Miles

On 22 September 2011 11:28, Kenneth Heafield mo...@kheafield.com wrote:
 But I don't see a use case for it.  I can run gdb just fine on a
 multithreaded program that happens to be running one thread.  And the
 stderr output will be in order.

 On 09/22/11 11:21, Miles Osborne wrote:
 should someone want to debug with no threading, then there would need
 to be a mess of ifdefs removing all support for threading.  i agree,
 this will be a pain to deal with, but this is what debugging with no
 threads means.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Phrase probabilities

2011-09-20 Thread Miles Osborne
exactly,  the only correct way to get real probabilities out would be
to compute the normalising constant and renormalise the dot products
for each phrase pair.

remember that this is best thought of as a set of scores, weighted
such that the relative proportions of each model are balanced

Miles

On 20 September 2011 16:07, Burger, John D. j...@mitre.org wrote:
 Taylor Rose wrote:

 I am looking at pruning phrase tables for the experiment I'm working on.
 I'm not sure if it would be a good idea to include the 'penalty' metric
 when calculating probability. It is my understanding that multiplying 4
 or 5 of the metrics from the phrase table would result in a probability
 of the phrase being correct. Is this a good understanding or am I
 missing something?

 I don't think this is correct.  At runtime all the features from the phrase 
 table and a number of other features, some only available during decoding, 
 are combined in an inner product with a weight vector to score partial 
 translations.  I believe it's fair to say that at no point is there an 
 explicit modeling of a probability of the phrase being correct, at least 
 not in isolation from the partially translated sentence.  This is not to say 
 you couldn't model this yourself, of course.

 - John Burger
  MITRE
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Phrase probabilities

2011-09-20 Thread Miles Osborne
some terminology:  these are feature values, not metrics.

feature values have a number of roles to play eg P(e | f) indicates
the chance that phrase e should be the translation of phrase f.  these
values are designed to be used together, and weighted to produce an
overall score for a translation choice.  this is the basis of a
log-linear model.

if you take them all and multiply them together then I guess that is
equivalent to assuming each is equally weighted and that you have
something like the geometric mean of them (a product of logs, without
the divisor).  you may well be able to use the scores in the way you
suggest, but whether you have `good' or `bad' results will be by
chance.

if you want to prune the phrase table then a starting point is here:

http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc16

Miles

On 20 September 2011 16:47, Taylor Rose tr...@languageintelligence.com wrote:
 So what exactly can I infer from the metrics in the phrase table? I want
 to be able to compare phrases to each other. From my experience,
 multiplying them and sorting by that number has given me more accurate
 phrases... Obviously calling that metric probability is wrong. My
 question is: What is that metric best indicative of?
 --
 Taylor Rose
 Machine Translation Intern
 Language Intelligence
 IRC: Handle: trose
     Server: freenode


 On Tue, 2011-09-20 at 16:14 +0100, Miles Osborne wrote:
 exactly,  the only correct way to get real probabilities out would be
 to compute the normalising constant and renormalise the dot products
 for each phrase pair.

 remember that this is best thought of as a set of scores, weighted
 such that the relative proportions of each model are balanced

 Miles

 On 20 September 2011 16:07, Burger, John D. j...@mitre.org wrote:
  Taylor Rose wrote:
 
  I am looking at pruning phrase tables for the experiment I'm working on.
  I'm not sure if it would be a good idea to include the 'penalty' metric
  when calculating probability. It is my understanding that multiplying 4
  or 5 of the metrics from the phrase table would result in a probability
  of the phrase being correct. Is this a good understanding or am I
  missing something?
 
  I don't think this is correct.  At runtime all the features from the 
  phrase table and a number of other features, some only available during 
  decoding, are combined in an inner product with a weight vector to score 
  partial translations.  I believe it's fair to say that at no point is 
  there an explicit modeling of a probability of the phrase being correct, 
  at least not in isolation from the partially translated sentence.  This is 
  not to say you couldn't model this yourself, of course.
 
  - John Burger
   MITRE
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 




 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] build 5 gram with SRILM and moses

2011-09-06 Thread Miles Osborne
yes

On 6 September 2011 17:28, Cyrine NASRI cyrine.na...@gmail.com wrote:

 Hi all,
 Is it possible tu uses 5 gram Language model built bu SRILM with MOses?
 Thanks
 Best

 Cyrine

 --
 *Cyrine
 Ph.D. Student in Computer Science*


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] KenLM build-binary trie problems (SRILM ARPA file)

2011-08-16 Thread Miles Osborne
for the SRILM, you use the -unk flag;  RandLM does this by default if I
recall

Miles

On 16 August 2011 06:28, Tom Hoar tah...@precisiontranslationtools.comwrote:

 Ken,

 Does the online moses documentation refer to how to ensure the language
 model has unk in the vocabulary? I've never seen it.

 What's the best way to ensure a LM has the unk token in the vocabulary?
 Is it as simple as appending one line consisting of one unk token to the
 language model corpus? Or, is there command line switch for ngram-count,
 build-lm.sh, buildlm? Or, should we just edit the raw text language model
 and add it to the vocabulary manually?

 Thanks,
 Tom



 On Mon, 15 Aug 2011 22:12:36 +0100, Kenneth Heafield mo...@kheafield.com
 wrote:

 Ok I have reproduced the problem.  It only happens when the ARPA file is
 missing and is probably an off-by-one on vocabulary size.  I'll have a fix
 soon.

 Kenneth

 On 08/15/11 19:20, Kenneth Heafield wrote:

 Hi,

 Back from vacation and sorry but I'm having trouble reproducing this
 locally.

 - Latest Moses (revision 4143); I haven't made any changes that should
 impact language modeling since 4096.
 - svn status says the relevant source code is unmodified.
 - Tried an SRI model, including rebuilding with build_binary that ships
 with Moses.
 - Ran threaded and not threaded.

 Can you send me your very small SRILM model?  Does it have ?

 Kenneth

 On 08/04/11 11:42, Kenneth Heafield wrote:

 Sorry I am slow to respond. This is my first thing to look at, but I am
 traveling a lot through the 14th.

 Alex Fraser alexfra...@gmail.com wrote:

 Hi Kenneth --

 Latest revision, 4096. Single threaded also crashes.

 Cheers, Alex


 On Fri, Jul 29, 2011 at 6:00 PM, Kenneth Heafield  mo...@kheafield.com 
 wrote:

  Hi,
 
 There was a problem with this; thought it was fixed but maybe it 
  came
  back.  Which revision are you running?  Does it still happen if you run
  single-threaded?
 
  Kenneth

 
  On 07/29/11 09:39, Alex Fraser wrote:
  Hi Folks,
 
  Tom Hoar previously mentioned that he had a problem with KenLMs built
  from SRILM crashing Moses.
 

  Fabienne Cap and I also have had a problem with this. It seems to be
  restricted to using the trie option with build-binary.
 
  Ken, if you have any problems repr!
  oducing
 this, please let me know. I

  can send you a very small SRILM trained language model that crashes
  moses when converted to binary with the trie option, but works fine as
  a probing binary and using the original ARPA. (BTW, this is running

  the decoder multi-threaded and the crash comes at some point during
  decoding the first sentence, not during loading files)
 
  Cheers, Alex
 
 --

  Moses-support mailing list

  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support

 
 
 --

  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support

 

   ___ Moses-support mailing
 list Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___ Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Improvements to MERT

2011-08-12 Thread Miles Osborne
good to see the variance reduction.

why not repeat this with more features?  you should see a greater effect
this way.  an easy way to do this is to just add more language models.

Miles

On 11 August 2011 19:53, Philipp Koehn pko...@inf.ed.ac.uk wrote:

 Hi,

 I added a number of improvements to MERT that have been recently
 proposed in the literature, with the aim to support more features and
 greater stability.

 The improvements are:
 (1) Optimization in random directions [Cer et al., 2008]
 (2) Re-use of best weight settings from last n iterations as starting
 points [Foster and Kuhn, 2009]
 (3) Pairwise-Ranked Optimization [Hopkins and May, 2011]

 To give some more details:

 (1) Traditional MERT optimizes each parameter in isolation, finding
 the best gain for any parameter, applying it, and repeating this process
 until convergence. With the switch -number-of-random-directions NUM,
 in addition to these directions of exploring the multi-dimensional
 weight space, a specified number of random directions are also explored.

 (2) In each iteration of the running the decoder to produce n-best lists
 and and optimizing weights, the first starting point is the last best
 weight
 setting found. 20 additional starting points are randomly generated.
 With the switch -historic-best, the best found weights of each prior
 iterations are used as starting points in addition to the random starting
 points.

 (3) A recent paper proposed an alternative to MERT that trains a classifier
 to predict which of two candidates in the n-best list is better. Candidates
 are randomly sampled (with a bias towards candidates with large metric
 score differences) and passed to a standards linear model classifier
 (maximum entropy, support vector machines, etc.). The current Moses
 implementation uses MegaM by Hal Daume (check for license terms).
 This alternative to traditional MERT can be used with the switch
 -pairwise-ranked.

 Notes:

 * the indicated switch are either specified when calling mert-moses.pl
  or in the parameter tuning-settings in EMS.

 * option (3) is incompatible with (1) and (2), but the latter can be used
 together.

 * for -number-of-random-directions I used 50 random directions, which
  slows down MERT quite a bit.

 * option (3) does not converge under the current Moses stopping criteria,
  so it runs for 25 iterations, but you may want to reduce this to 10 with
  the additional switch -max-iterations 10

 Some results:
 Urdu-English, SAMT Model

  MERT setting iterations tuning set test set baseline 11.6 (std 4.8) 22.73
 (std 0.07) 21.54 (std 0.38) 50 random directions 9.4 (std 2.3) 22.82 (std
 0.14) *21.58* (std 0.38) rand.dir. + historic best 9.2 (std 5.9) 22.79
 (std 0.23) 21.40 (std 0.37) pairwise-ranked max-iter 10 10 - 21.33 *(std
 0.13)*

 Urdu-English, Hierarchical Model

  MERT setting iterations tuning set test set baseline 8.8 (std 2.2) 23.91
 (std 0.18) *23.02* (std 0.42) 50 random directions 8.4 (std 3.3) 23.85
 (std 0.35) 22.80 (std 0.70) rand.dir. + historic best 12.0 (std 3.5) 24.03
 (std 0.23) 22.89 *(std 0.18)* pairwise-ranked max-iter 10 10 - 21.93 (std
 0.36)

 German-English, Phrase-based

  MERT setting iterations tuning set test set baseline 7.2 (std 14.3) 24.82
 (std 0.04) *21.29* (std 0.05) rand.dir. + historic best 6.6 (std 1.8)24.88 
 (std 0.07)21.28 (std 0.16)pairwise-ranked max-iter 1010-
 *21.29 (std 0.02)*

 German-English, Factored Backoff

  MERT setting iterations tuning set test set baseline 12.0 (std 15.2)24.89 
 (std 0.25)21.35 (std 0.15)rand.dir. + historic best11.4 (std 7.6)25.01 (std 
 0.12)21.45 (std 0.12)pairwise-ranked25-
 *21.58 (std 0.11)* pairwise-ranked max-iter 10 10 - 21.54 (std 0.10)

 Results are reported over 5 runs of each optimization method, in terms of
 average and standard deviation. What we are looking for is high test set
 scores and low variance.

 The Urdu-English systems use a smaller tuning set of less than a 1000
 sentences
  (with 4 references), so I would tend to give it less faith. Test set for
 German-English
 is WMT 2011.

 Your milage may vary, but it is worth a tryout.

 -phi



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Running Giza++ on subsets of data

2011-06-15 Thread Miles Osborne
that isn't the expected answer here.  i think the OP wants some kind of
incremental (re) training.

firstly: it is not really possible to guarantee that performance is not
degraded when running from subsets up to the full set (compared with just
running it on the full set).

secondly,  you may wish to investigate a version of Giza which supports
incremental retraining.  this would allow you to train on a subset and then
add more and more data, without retraining at each point from scratch.   the
current version has minimal documentation, but right now this is hopefully
being fixed.  if you are feeling brave, look here:

http://code.google.com/p/inc-giza-pp/

Miles


On 15 June 2011 18:50, Kenneth Heafield mo...@kheafield.com wrote:

 Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview

 On 06/15/11 04:51, Prasanth K wrote:
  Hello All,
 
  I am conducting a series of experiments to build translation systems
  using Moses in which the corpus of the current experiment is a subset of
  the corpora used in the previous experiment. I have started with the
  Europarl corpora and am likely to repeat this process about 20 times.
  Unless I am mistaken, this is going to take me nearly a month and I am
  looking for ways to speeden up the whole process.
 
  Is there any optimal way to run Giza++ on these different subsets of
  data without having to run it again and again?
  I do not want to use the alignments obtained from running Giza++ on the
  entire Europarl corpora, for the other experiments (by selecting the
  alignment information from aligned.grow-final-and-diag for the sentences
  in the subsets).
 
  The order of the experiments does not matter, so the experiments can be
  done on the smallest dataset followed by supersets of the previous
  dataset, provided there is a way to modify the translation probabilities
  from Giza++ using just the additional data alone and this does not
  affect the performance of Giza++ in comparison to when Giza++ is run on
  the corpus in stand-alone mode.
 
  Kindly let me know if there is some way to do this and I am missing it.
 
  - regards,
  Prasanth
 
 
  --
  Theories have four stages of acceptance. i) this is worthless nonsense;
  ii) this is an interesting, but perverse, point of view, iii) this is
  true, but quite unimportant; iv) I always said so.
 
--- J.B.S. Haldane
 
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Running Giza++ on subsets of data

2011-06-15 Thread Miles Osborne
it is this:

Abby Levenberg, Chris Callison-Burch and Miles Osborne. Stream-based
Translation Models for Statistical Machine
Translationhttp://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf.
NAACL, Los Angeles, USA, 2010.

http://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf

Miles

On 15 June 2011 19:28, Qin Gao q...@cs.cmu.edu wrote:

 Yes, MGIZA isn't really incrementally training, it only initialize the
 model parameters with that trained previously, since it does not store
 sufficient statistics of the previous training. It will give bad performance
 if

 1. You train only model 1 or
 2. The incremental data or sub set is really small

 It is more suitable for the following scenario:

 You train a model on corpus A, and have new data B, you want to train
 several iterations of model 4 on A+B.

 For incremental training giza, do you know does it use online EM (as in
 Liang and Klein 2009) or just storing the sufficient statistics of previous
 training?
 --Q



 On Wed, Jun 15, 2011 at 11:07 AM, Miles Osborne mi...@inf.ed.ac.ukwrote:

 that isn't the expected answer here.  i think the OP wants some kind of
 incremental (re) training.

 firstly: it is not really possible to guarantee that performance is not
 degraded when running from subsets up to the full set (compared with just
 running it on the full set).

 secondly,  you may wish to investigate a version of Giza which supports
 incremental retraining.  this would allow you to train on a subset and then
 add more and more data, without retraining at each point from scratch.   the
 current version has minimal documentation, but right now this is hopefully
 being fixed.  if you are feeling brave, look here:

 http://code.google.com/p/inc-giza-pp/

 Miles


 On 15 June 2011 18:50, Kenneth Heafield mo...@kheafield.com wrote:

 Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview

 On 06/15/11 04:51, Prasanth K wrote:
  Hello All,
 
  I am conducting a series of experiments to build translation systems
  using Moses in which the corpus of the current experiment is a subset
 of
  the corpora used in the previous experiment. I have started with the
  Europarl corpora and am likely to repeat this process about 20 times.
  Unless I am mistaken, this is going to take me nearly a month and I am
  looking for ways to speeden up the whole process.
 
  Is there any optimal way to run Giza++ on these different subsets of
  data without having to run it again and again?
  I do not want to use the alignments obtained from running Giza++ on
 the
  entire Europarl corpora, for the other experiments (by selecting the
  alignment information from aligned.grow-final-and-diag for the
 sentences
  in the subsets).
 
  The order of the experiments does not matter, so the experiments can be
  done on the smallest dataset followed by supersets of the previous
  dataset, provided there is a way to modify the translation
 probabilities
  from Giza++ using just the additional data alone and this does not
  affect the performance of Giza++ in comparison to when Giza++ is run on
  the corpus in stand-alone mode.
 
  Kindly let me know if there is some way to do this and I am missing it.
 
  - regards,
  Prasanth
 
 
  --
  Theories have four stages of acceptance. i) this is worthless
 nonsense;
  ii) this is an interesting, but perverse, point of view, iii) this is
  true, but quite unimportant; iv) I always said so.
 
--- J.B.S. Haldane
 
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 The University of Edinburgh is a charitable body, registered in Scotland,
 with registration number SC005336.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How to change phrase representation

2011-06-13 Thread Miles Osborne
the simplest approach would be to use another character to join words
together.  the tokeniser thinks you have hyphenated words, which is
probably what you don't want.

Miles

On 13 June 2011 18:39, Anna c annac...@hotmail.com wrote:
 Hi,
 I've tried what you suggested, but I'm not sure if I'm doing it right...
 I've replaced all the occurrences in the input files as you said, adding a
 '~' between the words (as in the~man), but when I see the file
 training.tok.en or training.tok.es (resulting of the first steps in the
 guide), the words have been separated and it appears as the ~ man. Should
 I change the tokenizer.perl to ignore the '~' or should I skip that steps?
 Or it is correct in that way?

 Thank you very much!
 Best regards,
 Anna




 Date: Fri, 10 Jun 2011 10:48:07 +0100
 Subject: Re: [Moses-support] How to change phrase representation
 From: pko...@inf.ed.ac.uk
 To: annac...@hotmail.com
 CC: moses-support@mit.edu

 Hi,

 I am not entirely sure if I fully understand your question,
 but let me try to answer.

 the phrase-based model implementation considers tokens
 separated by a white space as a word. It does also learn
 translation entries for sequences of words (phrases).

 If you want to group words into larger tokens, then you
 have to replace the white spaces.

 For instance, if you want to force the training setup and decoder
 to treat the man as a unit, then you should replace all
 occurrences (in training data and decoder input) with the~man.

 -phi

 On Fri, Jun 10, 2011 at 10:38 AM, Anna c annac...@hotmail.com wrote:
  Hi!
  I'm doing a master's degree and I need some help with one of my
  subjects.
  I've already installed GIZA++ and Moses correctly, and made the step by
  step
  guide of the web, checking that everything was ok. But I'm a newbie in
  this
  and I'm a bit lost. What I have to do is to change the representation so
  the
  basic unit won't be the word, but pairs or triplets of words, and
  compare it
  with the normal representation. How do I do that? Do I have to change
  the
  preparation step in the training?
 
  Thank you very much!
  Best regards,
  Anna
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] experiment.perl with IRSTLM only (no SRILM installed)

2011-05-27 Thread Miles Osborne
is this after running with SRILM?

if so, then look for the script which creates the LM and delete it.
that should force it to be re-created, using IRSLM

Miles

On 27 May 2011 09:16, Greg Wilson gre...@gmail.com wrote:
 Hi, first let me thank the people who are making Moses available, your
 work is very appreciated!

 I am trying to run experiment.perl on an installation with only IRSTLM
 (SRILM is not installed). This works perfectly fine if I do the
 experiments manually.

 I followed this instruction for configuring the experiment to only use IRSTLM:
     http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc13
 The instruction is quite clear: uncomment lm-binarizer and
 lm-quantizer, which I did.

 The problem is that it seems like experiment.perl still try to use SRILM:

 perl $m/scripts-20110520-1542/ems/experiment.perl -config config.toy
 Use of implicit split to @_ is deprecated at
 /usr/local/bin/scripts-20110520-1542/ems/experiment.perl line 2145.
 STARTING UP AS PROCESS 14381 ON liveserver0 AT Fri May 27 06:45:00 UTC 2011
 LOAD CONFIG...
 find: `/usr/local/srilm/bin/i686/ngram-count*': No such file or directory
 LM:lm-training: file /usr/local/srilm/bin/i686/ngram-count does not exist!
 find: `/usr/local/srilm/bin/i686*': No such file or directory
 GENERAL:srilm-dir: file /usr/local/srilm/bin/i686 does not exist!
 Died at /usr/local/bin/scripts-20110520-1542/ems/experiment.perl line 360.

 Is it possible to do what I want; to configure an
 experiment.perl-experiment to only use IRSTLM, or are there hardwired
 calls to SRILM somewhere in there?

 Thankful for any advice,
 /Greg
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Can't compile latest Moses with irstlm and srilm

2011-05-21 Thread Miles Osborne
It looks like you are using 64 bit versions eg srilm. Make sure everything
is 32 bit

Miles

On 21 May 2011 13:45, Bartosz Grabski bartosz.grab...@gmail.com wrote:

Hello,

I'm using quite fresh Ubuntu 11.04 (on a 32bit machine).
I downloaded and compiled latest srilm and irstlm (not without some
troubles), then downloaded latest Moses from sourceforge. I ran
regenerate-makefiles.sh, and configure (with srilm and irstlm). Then
after make I get following errors. Do you have any suggestions? Thanks
in advance.

make  all-recursive
make[1]: Entering directory `/home/bar/moses'
Making all in kenlm
make[2]: Entering directory `/home/bar/moses/kenlm'
/bin/sh ../libtool --tag=CXX   --mode=link g++  -g -O2
-L/home/bar/lm/srilm/lib/i686 -L/home/bar/lm/srilm/flm/obj/i686
-L/home/bar/lm/irstlm/lib -o build_binary build_binary.o libkenlm.la
-loolm  -loolm -ldstruct -lmisc -lflm -lirstlm -lz
libtool: link: g++ -g -O2 -o build_binary build_binary.o
-L/home/bar/lm/srilm/lib/i686 -L/home/bar/lm/srilm/flm/obj/i686
-L/home/bar/lm/irstlm/lib ./.libs/libkenlm.a -loolm -ldstruct -lmisc
-lflm -lirstlm -lz
build_binary.o: In function `lm::ngram::(anonymous
namespace)::ParseFloat(char const*)':
/home/bar/moses/kenlm/lm/build_binary.cc:44: undefined reference to
`util::ParseNumberException::ParseNumberException(StringPiece)'
build_binary.o: In function `main':
/home/bar/moses/kenlm/lm/build_binary.cc:77: undefined reference to
`lm::ngram::Config::Config()'
/home/bar/moses/kenlm/lm/build_binary.cc:115: undefined reference to
`lm::ngram::detail::GenericModellm::ngram::trie::TrieSearch,
lm::ngram::SortedVocabulary::GenericModel(char const*,
lm::ngram::Config const)'
build_binary.o: In function `~SortedVocabulary':
/home/bar/moses/kenlm/./lm/vocab.hh:46: undefined reference to
`lm::base::Vocabulary::~Vocabulary()'
build_binary.o: In function `util::scoped_memory::reset()':
/home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to
`util::scoped_memory::reset(void*, unsigned int,
util::scoped_memory::Alloc)'
/home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to
`util::scoped_memory::reset(void*, unsigned int,
util::scoped_memory::Alloc)'
build_binary.o: In function `~Backing':
/home/bar/moses/kenlm/./lm/binary_format.hh:42: undefined reference to
`util::scoped_fd::~scoped_fd()'
build_binary.o: In function `~ModelFacade':
/home/bar/moses/kenlm/./lm/facade.hh:45: undefined reference to
`lm::base::Model::~Model()'
build_binary.o: In function `ShowSizes':
/home/bar/moses/kenlm/lm/build_binary.cc:56: undefined reference to
`util::FilePiece::FilePiece(char const*, std::basic_ostreamchar,
std::char_traitschar *, long long)'
/home/bar/moses/kenlm/lm/build_binary.cc:57: undefined reference to
`lm::ReadARPACounts(util::FilePiece, std::vectorunsigned long long,
std::allocatorunsigned long long )'
/home/bar/moses/kenlm/lm/build_binary.cc:58: undefined reference to
`lm::ngram::detail::GenericModellm::ngram::detail::ProbingHashedSearch,
lm::ngram::ProbingVocabulary::Size(std::vectorunsigned long long,
std::allocatorunsigned long long  const, lm::ngram::Config
const)'
/home/bar/moses/kenlm/lm/build_binary.cc:66: undefined reference to
`lm::ngram::detail::GenericModellm::ngram::trie::TrieSearch,
lm::ngram::SortedVocabulary::Size(std::vectorunsigned long long,
std::allocatorunsigned long long  const, lm::ngram::Config
const)'
/home/bar/moses/kenlm/lm/build_binary.cc:56: undefined reference to
`util::FilePiece::~FilePiece()'
build_binary.o: In function `main':
/home/bar/moses/kenlm/lm/build_binary.cc:107: undefined reference to
`lm::ngram::detail::GenericModellm::ngram::detail::ProbingHashedSearch,
lm::ngram::ProbingVocabulary::GenericModel(char const*,
lm::ngram::Config const)'
build_binary.o: In function `~ProbingVocabulary':
/home/bar/moses/kenlm/./lm/vocab.hh:97: undefined reference to
`lm::base::Vocabulary::~Vocabulary()'
build_binary.o: In function `util::scoped_memory::reset()':
/home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to
`util::scoped_memory::reset(void*, unsigned int,
util::scoped_memory::Alloc)'
/home/bar/moses/kenlm/./util/mmap.hh:64: undefined reference to
`util::scoped_memory::reset(void*, unsigned int,
util::scoped_memory::Alloc)'
build_binary.o: In function `~Backing':
/home/bar/moses/kenlm/./lm/binary_format.hh:42: undefined reference to
`util::scoped_fd::~scoped_fd()'
build_binary.o: In function `~ModelFacade':
/home/bar/moses/kenlm/./lm/facade.hh:45: undefined reference to
`lm::base::Model::~Model()'
build_binary.o: In function `main':
/home/bar/moses/kenlm/lm/build_binary.cc:113: undefined reference to
`lm::ngram::detail::GenericModellm::ngram::detail::ProbingHashedSearch,
lm::ngram::ProbingVocabulary::GenericModel(char const*,
lm::ngram::Config const)'
build_binary.o: In function `~ProbingVocabulary':
/home/bar/moses/kenlm/./lm/vocab.hh:97: undefined reference to
`lm::base::Vocabulary::~Vocabulary()'
build_binary.o: In function `util::scoped_memory::reset()':

Re: [Moses-support] How much Ram for Europarl?

2011-04-18 Thread Miles Osborne
naturally, the parallel data could be down-sampled (eg use 1/2 of it).
you probably won't see a significant degradation in translation
quality and the whole training process will use less RAM and will be
quicker.

Miles

On 18 April 2011 15:05, Tom Hoar tah...@precisiontranslationtools.com wrote:
  Your report of 100% physical usage, growing swap usage and low CPU load
  is normal when working with limited RAM machines. With only 4 Gb Ram and
  the new (larger) EuroParl v6 corpus, you could train for 3 or 4 days
  depending on how you setup your swap partition. Even then, it's possible
  you will run out of RAM before it's finished. Upgrading to 8 Gb ram is a
  move in the right direction.

  Once it's finished training, you'll want to use the binarized the
  tables and language model, which MMM's train-1.11 script creates.

  Tom


  On Mon, 18 Apr 2011 14:52:10 +0100, Philipp Koehn pko...@inf.ed.ac.uk
  wrote:
 Hi,

 I am not familiar with the MMM setup, but one of the causes
 of memory use may be the translation table. You should use
 the on-disk translation table.

 -phi

 On Mon, Apr 18, 2011 at 2:47 PM, David Wilkinson
 davidzw...@hotmail.com wrote:
 I have set up an Ubuntu 10.04 system with the moses-for-mere-mortals
 scripts. The default corpus trained in about 6-7 hours on my system
 (Athlon
 x3 3.2Ghz, 4Gb Ram). I am now trying to train the system with the
 Europarl
 German-English parallel corpus (about 45m words in each language),
 again
 using the default moses-for-mere-mortals settings. The system has
 been
 running for 24 hrs and is currently using all the physical memory
 and about
 1.2Gb of swap. None of the cores are being used more than 10%, so
 like this
 it will take a very long time to finish. If I double the ram to 8gb,
 will
 this be sufficient?
 Many Thanks
 David
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

2011-03-25 Thread Miles Osborne
There is work published on making mert more stable (on the train so can't
easily dig it up)

Miles

sent using Android

On 25 Mar 2011 12:49, Lane Schwartz dowob...@gmail.com wrote:

We know that there is nondeterminism during optimization, yet virtually all
papers report results based on a single MERT run. We know that results can
very widely based on language pair and data sets, but a large majority of
papers report results on a single language pair, and often for a single data
set.

While these issues are widely known at the informal level, I think that
Suzy's point is well taken. I think there would be value in published
studies showing just how wide the gap due to nondeterminism can be expected
to be. It may be that such studies already exist, and I'm just not aware of
them. Does anyone know of any?

Cheers,
Lane

On Fri, Mar 25, 2011 at 7:03 AM, Barry Haddow bhad...@inf.ed.ac.uk wrote:

 Hi

 This is an is...
-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, Time Enough For Love
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

2011-03-25 Thread Miles Osborne
Yes, that is the one

Miles

sent using Android

On 25 Mar 2011 13:08, Barry Haddow bhad...@inf.ed.ac.uk wrote:

This might be what Miles is referring too
http://www.statmt.org/wmt09/pdf/WMT-0939.pdf

There was some progress towards getting this into moses
http://lium3.univ-lemans.fr/mtmarathon2010/ProjectFinalPresentation/MERT/StabilizingMert.pdf


On Friday 25 March 2011 13:02, Miles Osborne wrote:
 There is work published on making mert more s...

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number S...
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] running moses on a cluster with sge

2011-02-02 Thread Miles Osborne
to add to Barry's excellent answer, we are currently working on a
client-server language model.  this will mean that a cluster of
machines can be used, with a shared resource.  it should also work
with multicore

but in the short-term, you are probably better off with multicore

Miles

On 2 February 2011 06:06, Noubours, Sandra
sandra.noubo...@fkie.fraunhofer.de wrote:
 Hello Barry, hello Tom,

 thank you for your answers. I think I have a better idea about different 
 approaches to MOSES efficiency issues now.

 Best regards,
 Sandra

 -Ursprüngliche Nachricht-
 Von: Barry Haddow [mailto:bhad...@inf.ed.ac.uk]
 Gesendet: Montag, 31. Januar 2011 10:52
 An: moses-support@mit.edu
 Cc: Noubours, Sandra; Tom Hoar
 Betreff: Re: [Moses-support] running moses on a cluster with sge

 Hi Sandra

 The short answer is that it really depends how big your models are. Running on
 a cluster helps speed up tuning because most of the time in tuning is spent
 decoding, which can be easily parallelised by splitting up the file into
 chunks. So each of the individual machines should be capable of loading your
 models and running a decoder.

 The problem with using a cluster (as opposed to multicore) is that each
 machine has to have its own ram, and if you want to load large models then
 you need a lot of ram. Whereas with multicore, each thread can access the
 same model. Sure, binarising saves a lot on ram usage, but it slows you down
 and puts a lot of load on the filesystem which can cause problems on
 clusters.

 Our group's machines are a mixture of 8 and 16 core Xeon 2.67GHz, with 36-72G
 ram, no sge. We also have access to the university cluster, but since the
 most ram you can get is 16G and sge hold jobs don't work at the moment we
 don't really use it for moses any more,

 hope that helps - regards - Barry

 On Monday 31 January 2011 07:42, Noubours, Sandra wrote:
 Hello,



 thanks for the tips! When talking about using a Sun Grid Engine I was
 referring tuning. Making use of a cluster is supposed to speed up the
 tuning process (see http://www.statmt.org/moses/?n=Moses.FAQ#ntoc10). In
 this context I wondered what hardware exactly is needed for such a cluster.



 Sandra







 Von: Tom Hoar [mailto:tah...@precisiontranslationtools.com]
 Gesendet: Freitag, 28. Januar 2011 09:01
 An: Noubours, Sandra
 Cc: moses-support@mit.edu
 Betreff: Re: [Moses-support] running moses on a cluster with sge



 Sandra,

 What kind of capacity do you need to support? I just finished translating
 21,000 pages, over 1/2 million phrases, in 22 hours on an old Intel
 Core2Quad, 2.4 Ghz with 4 GB RAM and a 4-disk RAID-0. Moses was configured
 with binarized phrase/reordering tables and kenlm binarized language model.
 The advances in Moses supporting efficient binarized tables/models are
 great!

 We're planning tests for a 2-socket host with two Intel Xeon 5680 6-core
 3.33 Ghz CPU's, 48 GB RAM and 4 1-TB disks as RAID0. With 12 cores
 (totaling 24 simultaneous threads according to Intel specs), we're
 expecting to boot capacity to well over 15 million phrases per day on one
 host.

 What's the advantage of running Moses on a grid or cluster?

 Tom



 On Fri, 28 Jan 2011 08:40:22 +0100, Noubours, Sandra
 sandra.noubo...@fkie.fraunhofer.de wrote:

       Hello,



       I would like to run Moses on a cluster. I am yet inexperienced in using
 Sun Grid as well as clusters in common. Could you give me any instructions
 or tips for implementing a Linux-Cluster with Sun Grid Engine for running
 Moses?

       a)      What kind of cluster would you recommend, i.e. how many 
 machines,
 how many cpus, what memory, etc.?

       b)      When tuning is performed with the multicore option it does not 
 use
 more than one cpu. Does the tuning step use more than one cpu when run on a
 cluster?

       c)       Can Sun Grid implement a cluster virtually on one computer, so
 that jobs are spread locally to different cpus of one computer?



       Thank you and best regards!



       Sandra

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] skip tuning in ems

2011-01-31 Thread Miles Osborne
supply a weights file, eg

weight-config = /home/miles/nist09/run9.moses.ini

add this to the TUNING section.

Miles

On 31 January 2011 21:22, John Morgan johnjosephmor...@gmail.com wrote:
 --
 Regards,
 John J Morgan




 Hello,
 I'd like to run an experiment with the ems without tuning.  Is it
 enough to write IGNORE on the [TUNING] line in the configuration
 file?
 This doesn't seem to be working for me, so I've been changing
 experiment.meta.  Under the decode section I write in: TRAINING:config
 instead of in: TUNING:weight-config.
 What is the right way to do this?
 Thanks,
 John
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] skip tuning in ems

2011-01-31 Thread Miles Osborne
no.  just create a dummy one (with uniform weights) if you want to
skip tuning and don't have the weights handy.

Miles

On 31 January 2011 22:31, John Morgan johnjosephmor...@gmail.com wrote:
 Miles, I don't think this does what I need.  I think your example
 assumes that the weight-config file already exist when experiment.perl
 is run.
 I tried setting
 weight-config = $working-dir/model/moses.ini.*
 and
 weight-config = $working-dir/model/moses.ini
 In both cases I get a file does not exist error.  I can skip the
 [RECASING] module , , why can't I skip the [TUNING] module?
 Is there a way to use pas-unless, ignore-unless, or template-if for this?
 Thanks,
 John



 On 1/31/11, Miles Osborne mi...@inf.ed.ac.uk wrote:
 supply a weights file, eg

 weight-config = /home/miles/nist09/run9.moses.ini

 add this to the TUNING section.

 Miles

 On 31 January 2011 21:22, John Morgan johnjosephmor...@gmail.com wrote:
 --
 Regards,
 John J Morgan




 Hello,
 I'd like to run an experiment with the ems without tuning.  Is it
 enough to write IGNORE on the [TUNING] line in the configuration
 file?
 This doesn't seem to be working for me, so I've been changing
 experiment.meta.  Under the decode section I write in: TRAINING:config
 instead of in: TUNING:weight-config.
 What is the right way to do this?
 Thanks,
 John
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



 --
 Regards,
 John J Morgan




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Train moses incrementally

2011-01-16 Thread Miles Osborne
Not yet

Miles

sent using Android

On 15 Jan 2011 10:00, Sébastien Druon s.dr...@ml-technologies.com wrote:

Thanks!

Do you approximately know in what time frame?

Regards,

Sebastien


On Wed, 2011-01-12 at 09:44 +, Miles Osborne wrote:
 sorry, the code is not publically availab...
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Train moses incrementally

2011-01-12 Thread Miles Osborne
sorry, the code is not publically available yet.  we will probably
release it in the near future

Miles

On 12 January 2011 09:36, Sébastien Druon s.dr...@ml-technologies.com wrote:
 Thanks for this answer...
 Is there some code available?
 When will it be integrated into Moses?
 Thanks again
 Sebastien

 On 12 Jan 2011 09:21, Miles Osborne mi...@inf.ed.ac.uk wrote:

 yes.  we have done this for both Giza++ and for the language model:

 Stream-based Translation Models for Statistical Machine Translation,
 Abby Levenberg, Chris Callison-Burch and Miles Osborne, NAACL 2010

 Stream-based Randomised Language Models for SMT, Abby Levenberg and
 Miles Osborne, EMNLP 2009

 this isn't integrated into Moses (yet)

 Miles

 On 12 January 2011 08:10, Sébastien Druon s.dr...@ml-technologies.com
 wrote:
 Hello,

 Is it p...

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] SRILM problem

2010-11-26 Thread Miles Osborne
in general you should send SRILM requests to their mailing list and
not to this one.

but i can tell you straight away that the ngram server is behaving
correctly.  it waits for requests ...

Miles

On 26 November 2010 11:28, Korzec, Sanne sanne.kor...@wur.nl wrote:
 Hi,

 I have compiled SRILM on a machine type of: ppc64

 The make world seems to have finished ok. These files are in place:

 libdstruct.a
 libflm.a
 liblattice.a
 libmisc.a
 liboolm.a

 The make test seems to perform great. However it hangs (more than an hour)
 on this line:

 *** Running test ngram-server ***

 I have no idea what might cause this. Can anyone help me solve the problem.
 I have tried to ignore this and compile moses anyway, but that generates an
 error during make moses.

 Thanks in advance.

 Sanne




 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Miles Osborne
i second this.

but can I make another suggestion.  make the default be *non* factored
input.  i reckon that most people using Moses don't actually use
factors (hands-up if you do).
this means, plain input, with absolutely no meta chars in them.

and if you are going to use meta-chars, why not just have a flag such as:

--factorDelimiter=|

etc.

Miles

On 15 November 2010 21:30, Hieu Hoang hieuho...@gmail.com wrote:
 That's a good idea. In the decoder, there's 4 places that has to be
 changed cos it's hardcoded
   ConfusionNet
    GenerationDictionary
   LanguageModelJoint
    Word::createFromString

 However, the train-model.perl is more difficult to change

 Hieu
 Sent from my flying horse

 On 15 Nov 2010, at 09:00 PM, Lane Schwartz dowob...@gmail.com wrote:

 I'd like to propose changing the current factor delimiter to something other 
 than the single vertical bar |

 Looking through the mailing archives, it seems that the failure to properly 
 purge your corpus of vertical bars is a frequent source of headaches for 
 users. I know I've encountered this problem before, but even knowing that I 
 should do this, just today I had to track down another vertical bar-related 
 problem.

 I don't really care what the replacement character(s) ends up being, just so 
 that any corpus munging related to this delimiter gets handled internally by 
 moses rather than being the user's responsibility.

 If moses could easily be modified to take a multi-character delimeter, that 
 would probably be best. My suggestion for a single-character delimiter would 
 be something with the following characteristics:

 * Character should be printable (ie not a control character)
 * Character should be one that's implemented in most commonly used fonts
 * Character should be highly obscure, and extremely unlikely to appear in a 
 corpus
 * Character should not be confusable with any commonly used character.

 Many characters in the Dingbats section of Unicode (block 2700) would fit 
 these desiderata.

 I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
 obscure printable character that looks like a thick vertical bar. It's 
 obviously a vertical bar, but just as obviously not the same thing as the 
 regular vertical bar |.

 Cheers,
 Lane
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] bag of words language model

2010-10-25 Thread Miles Osborne
i implemented this years ago (the idea then was to see if for
free-word-order languages, phrases could be generalised).  at the time
it didn't seem that there was a more efficient way to do it than just
generate permutations and score them.

and if you think about it, this is essentially the reordering problem

Miles

On 25 October 2010 12:59, Philipp Koehn pko...@inf.ed.ac.uk wrote:
 Hi,

 I am not familiar with that, but somewhat related is
 Arne Mauser's global lexical model, which also exists
 as a secret feature in Moses (secret because no
 effiencient training exists):

 Citation:
 A. Mauser, S. Hasan, and H. Ney. Extending Statistical Machine
 Translation with Discriminative and Trigger-Based Lexicon Models. In
 Conference on Empirical Methods in Natural Language Processing
 (EMNLP), Singapore, August 2009.
 http://www-i6.informatik.rwth-aachen.de/publications/download/628/MauserArneHasanSav%7Bs%7DaNeyHermann--ExtendingStatisticalMachineTranslationwithDiscriminativeTrigger-BasedLexiconModels--2009.pdf

 -phi


 On Fri, Oct 22, 2010 at 7:02 PM, Francis Tyers fty...@prompsit.com wrote:
 Hello all,

 I have a rather strange request. Does anyone know of any papers (or
 impementations) on bag-of-words language models ? That is, a language
 model which does not take into account the order in which the words
 appear in an ngram, so if you have the string 'police chief of' in your
 model, you will get a result for both 'chief of police' and 'police
 chief of'. I have thought of using IRSTLM or some generic model and
 scoring all the permutations, but wondered if there was a more efficient
 implementation already in existence. I have searched without much luck
 in Google, but perhaps I am searching with the wrong words.

 Best regards,

 Fran

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] train-truecaser.perl proposed tweak

2010-10-25 Thread Miles Osborne
this sounds risky to me.  it would be better to allow the user to
specify the behaviour;  for your suggestions, you would add an extra
flag which would enable this.  the default would be for truecasing to
operate as it used to.

Miles

On 25 October 2010 17:37, Ben Gottesman gottesman@gmail.com wrote:
 Hi,

 Are truecase models still widely in use?

 I have a proposal for a tweak to the train-truecaser.perl script.

 Currently, we don't take the first token of a sentence as evidence for the
 true casing of that type, on the basis that the first word of a sentence is
 always capitalized.  The first token of a segment is always assumed to be
 the first word of a sentence, and thus is never taken as casing evidence.

 However, if a given segment is only one token long, then the segment is
 probably not a sentence, and the token is quite possibly in its natural
 case.  So my proposal is to take the sole token of one-token segments as
 evidence for true casing.

 I attach the code change.

 Any objections?  If not, I'll check it in.

 Ben

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] about Morph tagging

2010-10-20 Thread Miles Osborne
ah, my apologies --I didn't realise you also wanted morphological
information.  in that case, you will need something like Fran's
suggestion

Miles

On 20 October 2010 11:12, Francis Tyers fty...@prompsit.com wrote:
 You could use the morphological analysers from the Apertium project.

 http://wiki.apertium.org/wiki/Using_an_lttoolbox_dictionary
 http://wiki.apertium.org/wiki/Lttoolbox
 http://wiki.apertium.org/wiki/HFST

 Fran

 El dc 20 de 10 de 2010 a les 17:58 +0800, en/na JiaHongwei va escriure:
     Hi,

     I need to train a model with POS tags and morphological
 information for Moses involving languages such as German, Spanish,
 French and Italian.

     By using TreeTagger, I can get POS tags in the format 'form pos
 lemma'.

     But I want it further processed to be like this, such as 'form
 pos lemma morph'.

     So the job is taking 'form pos lemma' as input and output in
 format 'form pos lemma morph'.

     Could you recommend a way or a tool to help me do this job
 automatically or in pipeline?

     Thanks in advance!



     Best Regards

     Henry


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mteval-v11b

2010-10-17 Thread Miles Osborne
note also that NIST changed to IBM BLEU recently which has a different
treatment of multiple references.

(mteval 13 uses IBM BLEU if i recall)

generally the BLEU scores will be a little lower than before, but MERT
performance should be more robust

Miles

On 17 October 2010 09:57, liu chang liuchangj...@gmail.com wrote:
 On Sun, Oct 17, 2010 at 3:41 PM, Somayeh Bakhshaei
 s.bakhsh...@yahoo.com wrote:

 Hello,

 I have some question about mteval-v11b.pl

 1) It can not use multi-reference with mteval what is a equivalent tool for 
 this aim?
 2) I tried multi-bleu.perl, but the scores reduced ! while we expect to 
 increase while adding more reference sets !! How it is may?
 3) I test mteval-v11b.pl and multi-bleu.perl in equivalent situations, they 
 do not always agree ! sometimes mteval and sometimes the other gives better 
 scores. Is there any problem?
 4) and at the end, isn't there any better tool with the property of 
 multi-reference?

 Hi Somayeh,

 BLEU has defined treatment for multiple references from the very
 beginning (see the original Papineni et al 2002 paper for details).
 Any implementation of BLEU that does not support multiple references
 should be considered defective.

 Personally I've always used mteval-v13a from
 http://www.itl.nist.gov/iad/mig/tests/mt/2009/ which has no problem
 dealing with multiple references at all. All you need to do is to
 provide the multiple references as multiple doc sections in your
 reference set:

 doc docid=document sysid=r1
  seg.../seg
  ...
 /doc
 doc docid=document sysid=r2
 ...

 Disclaimer: The above definitely works for v13a but I'm not
 specifically familiar with v11b.

 Cheers,
 Liu Chang
 National University of Singapore
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] max-phrase-length vs. number of scores

2010-10-06 Thread Miles Osborne
the phrase length refers to the number of words in a phrase and the
number of scores to the number of feature function, per phrase.

they have nothing to do with each other

On 6 October 2010 11:31,  supp...@precisiontranslationtools.com wrote:
 I found this message below, which mentions the topic, but leaves my
 question unanswered.

 The train-model.perl script has an option called max-phrase-length.
 Documentation shows its default is 7.

 The processPhraseTable binarizer has an option called -nscores that refers
 to number of scores. The moses binary's fourth numeric option in
 moses.ini's [ttable-file] section is also number of scores. Documentation
 and the message below define a default of 5.

 Are the max-phrase-length and number of scores values the same? If not
 the same, is there a connection and if so, what is it? If there's no
 connection, What criteria should one choose when setting number of scores
 and what the consequence of changing it from the default of 5?

 Thanks,
 Tom


 On Fri, 25 Jun 2010 18:14:07 +0100, Philipp Koehn pko...@inf.ed.ac.uk
 wrote:
 Hi,

 something has gone awry in your use of the binarizer.

 A typical way to call the binarizer is:

 LC_ALL=C sort phrase-table | ~/bin/processPhraseTable -ttable 0 0 -
 -nscores 5 -out phrase-table 

 -nscores refers to the number of scores in the phrase translation table
 which are by default 5.

 -phi

 On Fri, Jun 25, 2010 at 5:45 PM, Cyrine NASRI cyrine.na...@gmail.com
 wrote:

 Good morning everybody
 I dont understand the meaning of -nscores 5
 When i make the command wich Binaryze  the Phrase Tables, a message
 appears
 to me processing ptree for 5
 Can't read 5

 Thank you very much

 PS : i'm  not english so please excuse me for the very bad english wich
 i
 write
 Cyrine


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] giza++ best alignment

2010-10-03 Thread Miles Osborne
clearly changing the configuration will change the alignment results.

i suggest that before mailing the list again, you read this article:

A Systematic Comparison of Various. Statistical Alignment Models.
Franz Josef Och*. Hermann Ney

http://acl.ldc.upenn.edu/J/J03/J03-1002.pdf

Miles

2010/10/3 musa ghurab mossaghu...@hotmail.com:
 Thank Venkataramani,

 But on giza++ website http://fjoch.com/GIZA++.html  they said Alignment
 models depending on word classes.  And on mkcls website
 http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/mkcls.html
 they said -n number of optimization runs (Default: 1); larger number =
 better results
 I changed this number to -n10 where it was -n2 on train-model.perl, and that
 gaves a different alignment file.
 Any explanation?

 Thanks

 
 Date: Sun, 3 Oct 2010 02:23:06 -0400
 Subject: Re: [Moses-support] giza++ best alignment
 From: eknath.i...@gmail.com
 To: mossaghu...@hotmail.com

 That purely depends on your corpus. There is no such thing as the best
 configuration

 2010/10/2 musa ghurab mossaghu...@hotmail.com

 Hi

 Please if someone tell me, what is the best configuration for giza++ to get
 the best alignment file?if time and size are ignored (or not important)


 with best regard
 musa


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 Eknath Venkataramani

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Problem with tuning

2010-09-27 Thread Miles Osborne
looking at your output:
[ERROR] Malformed input at
 Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...)
 but instead received input with 0 factor(s).
sh: line 1:  5114 Aborted


make sure you have no bar (|) characters in the data

Miles


On 27 September 2010 14:45, Souhir Gahbiche s.gahbi...@gmail.com wrote:
 Hi all,

 I'm trying to tune my system, but the tuning stops at the first
 iteration. Here is my mert.log file:

 After default: -l mem_free=0.5G -hard
 Using SCRIPTS_ROOTDIR: 
 /vol/mt2/tools/nadi/moses-scripts/scripts-20090923-1833/
 SYNC distortionchecking weight-count for ttable-file
 checking weight-count for lmodel-file
 checking weight-count for distortion-file
 Executing: mkdir -p /working/tuningdev2009/mert
 Executing: 
 /vol/mt2/tools/nadi/moses-scripts/scripts-20090923-1833//training/filter-model-given-input.pl
 ./filtered /tmp/souhir
 /model/mosesdev2009.ini
 /working/tuningdev2009/ar.project-syndicate.2009-07.v1.de
 v.bw.mada.tok
 filtering the phrase tables... Fri Aug 27 16:43:10 CEST 2010
 The filtered model was ready in /working/tuningdev2009/mert/filtered,
 not doing a
 nything.
 run 1 start at Fri Aug 27 16:43:10 CEST 2010
 Parsing --decoder-flags: |-v 0|
 Saving new config to: ./run1.moses.ini
 Saved: ./run1.moses.ini
 Normalizing lambdas: 0 1 1 1 1 1 1 1 1 0.3 0.2 0.3 0.2 0
 DECODER_CFG = -w %.6f -lm %.6f -d %.6f %.6f %.6f %.6f %.6f %.6f %.6f
 -tm %.6f %.6f %.6f %.6f %.6f
     values = 0 0.111 0.111 0.111
 0.111 0.111 0.111 0.1
 11 0.111 0.0333 0.0222
 0.0333 0.0222 0
 Executing: /vol/mt2/tools/nadi/moses/moses-cmd/src/moses -v 0  -config
 filtered/moses.ini -inputtype 0 -w 0.00 -lm 0.11
  -d 0.11 0.11 0.11 0.11 0.11 0.11 0.11 -tm
 0.03 0.02 0.03 0.02 0.00  -n-best-li
 st run1.best100.out 100 -i
 /working/tuningdev2009/ar.project-syndicate.2009-07.v1
 .dev.bw.mada.tok  run1.out
 (1) run decoder to produce n-best lists
 params = -v 0
 decoder_config = -w 0.00 -lm 0.11 -d 0.11 0.11
 0.11 0.11 0.11 0.11 0.11 -tm 0.03 0.0222
 22 0.03 0.02 0.00
 Loading lexical distortion models...
 have 1 models
 Creating lexical reordering...
 weights: 0.111 0.111 0.111 0.111 0.111 0.111
 Loading table into memory...done.
 Created lexical orientation reordering
 [ERROR] Malformed input at
  Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...)
  but instead received input with 0 factor(s).
 sh: line 1:  5114 Aborted
 /vol/mt2/tools/nadi/moses/moses-cmd/src/moses -v 0 -config
 filtered/moses.ini -inputt
 ype 0 -w 0.00 -lm 0.11 -d 0.11 0.11 0.11 0.11
 0.11 0.11 0.11 -tm 0.03 0.02 0.03
  0.02 0.00 -n-best-list run1.best100.out 100 -i
 /working/tuningdev2009/ar
 .project-syndicate.2009-07.v1.dev.bw.mada.tok  run1.out
 Exit code: 134
 The decoder died. CONFIG WAS -w 0.00 -lm 0.11 -d 0.11
 0.11 0.11 0.11 0.11 0.11 0.11 -tm 0.0
 3 0.02 0.03 0.02 0.00

 The file run1.out is empty. I tried many times, but every time it
 stopps at the same level.
 I looked for the moses.ini. It works perfectly when I use two phrase tables.

 Here's my moses.ini used :

 #
 ### MOSES CONFIG FILE ###
 #

 # input factors
 [input-factors]
 0

 # mapping steps
 [mapping]
 0 T 0

 # translation tables: source-factors, target-factors, number of scores, file
 [ttable-file]
 0 0 5 /working/model/phrase

 # no generation models, no generation-file section

 # language models: type(srilm/irstlm), factors, order, file
 [lmodel-file]
 0 0 4 /working/lmm/newsLM+news-train08.fr.4gki.arpa.gz

 # limit on how many phrase translations e for each phrase f are loaded
 # 0 = all elements loaded
 [ttable-limit]
 20

 # distortion (reordering) files
 [distortion-file]
 0-0 msd-bidirectional-fe 6 /working/model/reordering-table.gz

 # distortion (reordering) weight
 [weight-d]
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3

 # language model weights
 [weight-l]
 0.5000

 # translation model weights
 [weight-t]
 0.2
 0.2
 0.2
 0.2
 0.2

 # no generation models, no weight-generation section

 # word penalty
 [weight-w]
 -1

 [distortion-limit]
 6

 Any ideas?
 Thanks
 SG
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] wrong alignment

2010-09-24 Thread Miles Osborne
it is probably more helpful to give the number of sentences you used
for language model training (and other details, eg ngram order).

but at first glance that looks like a tiny amount of language model
data --i would expect to see something closer to 2GB or so, depending
upon representation

Miles

2010/9/24 musa ghurab mossaghu...@hotmail.com:

 Thank Burger,


 here are some informations:
 Language model:   45MB
 Phrase Table:  26MB
 Reordering Model: 36MB

 but I'm still waiting for tuning to finish



 From: j...@mitre.org
 To: moses-support@mit.edu
 Date: Fri, 24 Sep 2010 13:40:40 -0400
 Subject: Re: [Moses-support] wrong alignment

 musa ghurab wrote:

  I trained a system of Chinese-Arabic language, but many alignments
  are wrong.
  The same thing to lexical model, where are many words are wrongly
  aligned
  Here is an example of lexical model (lex.e2f):

 The point of Moses is not to get good alignments, but to get good
 translation output. The target language model will help the decoder
 to pick good translations, even if the translation probabilities that
 come out of the alignment do not appear to be ideal. A great deal of
 research effort has been wasted (in my opinion) on getting better
 alignments, without actually achieving better translation.

 Have you run the resulting model! s on a test set? What was the score?
 How big is your language model? More LM data is probably the easiest
 way to make up for what might appear to be poor alignments.

 - John D. Burger
 MITRE

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] qsub and EMS again

2010-09-03 Thread Miles Osborne
yes, not doing the checking during the planning stage seems sensible.
(you could just change the delay at this point to speed things up).

here in Edinburgh we use experiment.perl mainly in a multicore /
single machine setting and that is why support for slow STDERR
creation is not really there yet.  but, there are plans to port this
to Hadoop, which should solve synchronisation problems like this.
this is the next major piece of development I'll be involved with.
(the current one involves more language modelling)

Miles

On 3 September 2010 01:18, Suzy Howlett s...@showlett.id.au wrote:
 Thanks for the responses. I think I will go with the loop. I was a bit
 confused about this at first - it considers the step to have crashed if the
 STDERR file does not exist, but since the STDERR file is the output of the
 script that creates the DONE file, I would have thought that the DONE file
 could not be created without the STDERR file ultimately following. However
 presumably if the STDERR file didn't appear for some reason, that is a
 problem, and so should be considered a crash.

 The unfortunate thing about putting a loop like this in check_if_crashed is
 that it also has to go through this when it's planning what steps to do,
 which could lead to a long delay in planning if a step has actually crashed
 through not creating a STDERR file.

 I think the problem is ultimately with our cluster. I noticed sometimes some
 jobs would be sitting on the queue with status exiting for several minutes
 - so the DONE file had been created but the STDERR file would not appear
 until after the job had been finally removed from the queue. Having given it
 some more thought, I think the issue may be with writing to disk. I'm pretty
 sure that the slave nodes do not have their own hard disks, only the master,
 and I think jobs may have been stalled while they waited for a chance to
 write results to disk - the master node was very very busy at the time. I
 don't know if that accounts for it! I'm not sure how there being no hard
 disks in the slaves interacts with Hieu's point - I don't really understand
 how the setup works.

 Thanks again,
 Suzy

 On 2/09/10 8:26 PM, Miles Osborne wrote:

 a better setup would be to have a loop which did the following:

 --for a given version number and step, check for STDERR, STDOUT and DONE
 --if they are all found, exit
 --otherwise sleep and recheck

 (and put some limit overall to prevent an endless loop)

 Miles

 On 2 September 2010 11:16, Hieu Hoanghieuho...@gmail.com  wrote:

  sounds like a bad case of a network file system. you prob need to
 harass your sysadmin and try a few of these too
    http://fixunix.com/nfs/61890-forcing-nfs-sync.html

 On 02/09/2010 04:09, Suzy Howlett wrote:

 Hi everyone,

 I'm running Moses through its experiment management system across a
 cluster and I'm finding that sometimes jobs will finish successfully but
 the .STDERR and .STDOUT files will be slow in appearing relative to the
 .DONE file, meaning that the EMS concludes that the step crashed. I can
 run the system again and it successfully reuses the results of the step
 (it doesn't have to rerun the step) but this is becoming frustrating as
 I have to restart the system
 frequently. I tried adding a call to sleep() in the check_if_crashed()
 method in experiment.perl but this is not helping in general - I think
 sometimes the delay is as much as a couple of minutes.

 Has anyone else faced this problem, or have a better idea for how to get
 around it?

 Cheers,
 Suzy

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 Suzy Howlett
 http://www.showlett.id.au/




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mert-moses.pl working-dir tmp

2010-09-01 Thread Miles Osborne
this is after a crash I presume?

if so, then you should delete the step which creates the first config
file.  this will force it to be recreated, using the current version.

below is a small perl script I use (for an older version of
experiment.perl, but it should work for you too).  this was intended
for experiments which use new language models. it forces tuning and
removed older versions of filtered phrase tables.

#  test out a new LM, making sure experiment.prl uses it

$config = $ARGV[0];
system rm -fr /disk1/miles/work4/steps/TUNING*;
system rm /disk1/miles/work4/steps/TRAINING\_create-config*;
system rm -fr /disk1/miles/work4/tuning/tmp.*/filtered/;
system  nohup perl experiment.perl -config $config -exec -no-graph ;


Miles



On 1 September 2010 22:17, John Morgan johnjosephmor...@gmail.com wrote:
 Hello,
 I'm running the basic demo for the ems an the experiment is crashing
 at the tuning step.  There's a problem transitioning from the step
 where the moses.ini config file is created to the step where tuning is
 started.  The moses.ini file is created in the model directory, but
 the tuning step looks for it under the tuning directory.  Then
 experiment.perl puts the moses.ini file under tuning/tmp.$VERSION
 which doesn't exist.

 What am I missing?
 Thanks,
 John
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] problem with tokenizer.perl

2010-06-27 Thread Miles Osborne
see here:

http://jeremy.zawodny.com/blog/archives/010546.html

for a discussion of utf8 v UTF8

... now off to see England triumphant against Germany

Miles

On 27 June 2010 13:23, Miles Osborne mi...@inf.ed.ac.uk wrote:
 on the subject of UTF8, i think the Moses tokeniser may be using the
 version that is too strict.

 i've just changed it to this:

 binmode(STDIN, :encoding(UTF-8));
 binmode(STDOUT, :encoding(UTF-8));



 and later on in the same file,:

 open(PREFIX, ::encoding(UTF-8), $prefixfile);


 see if this helps.

 Miles

 On 27 June 2010 13:15, Ingrid Falk ingrid.f...@loria.fr wrote:
 Hi Cyrine,

 I think this is because tokenizer.perl expects utf-8 input (on STDIN).

 This is because of the binmode(STDIN, ':utf8'); line in the tokenizer
 script.

 Your input is maybe not utf-8?

 Ingrid

 On 06/27/2010 01:08 PM, Cyrine NASRI wrote:

 Hello everyone,
 I try to run the script for my two tokenizer.perl development file.
 I'm having a problem when running, but I do not understand why.
 A message appears:

  /home/Bureau/moses/moses/scripts/tokenizer$ ./tokenizer.perl -l fr 
 /home/Bureau/work/test-fr.fr http://test-fr.fr 
 /home/Bureau/work/input.tok
 Tokenizer Version 1.0
 Language: fr
 WARNING: No known abbreviations for language 'fr', attempting fall-back
 to English version...
 utf8 \xE9 does not map to Unicode at ./tokenizer.perl line 47, STDIN
 line 1.
 Malformed UTF-8 character (fatal) at ./tokenizer.perl line 67, STDIN
 line 1.

 Thank you very much.

 Sincerely
 Cyrine



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] moses may 10

2010-05-11 Thread Miles Osborne
On 11 May 2010 17:33, Christian Hardmeier c...@rax.ch wrote:
 For my purposes, even a hard-coded assumption of 1, along with a more
 transparent error message if the model isn't found, would do. Does
 anybody actually decode with in-memory phrase tables in real life?
 (well, I suppose some people do...)

Google and anyone who actually wants to do more than optimise against
a fixed dev/test set

You can't afford to filter the phrase table when dealing with any old
translation request

Miles


 /Christian

 On Tue, 11 May 2010, Barry Haddow wrote:

 Maybe a more transparent error message would help?

 On Tuesday 11 May 2010 17:20:26 Hieu Hoang wrote:
  i thought about making it back-compatible but the code gets messy and
  error prone. Theres now 3 more phrase table - the text SCFG, binary
  SCFG, and the suffix array.
 
  So i thought it better to take the punch now and feel a short, sharp
  pain rather than let it linger.
 
  however, anyone wants to put back the old code to make it back comp,
  they're welcome to, as long as u look after it
 
  On 11/05/2010 17:04, Christian Hardmeier wrote:
   Hi,
  
   The first error that you give is because the format of the moses.ini
   file has changed. You need to add an extra digit at the beginning of the
   line that specifies the ttable-file. Add 0 for a memory-based ttable,
   and 1 for a binarised ttable.
  
   Is there a reason why we can't have backwards compatibility here? I'm a
   bit concerned about moving to the latest decoder version since it will
   require me to update the configuration file of each and every system
   I've ever trained, and then they won't work with the old decoders any
   more. Couldn't the decoder figure out on its own whether it should be 0
   or 1 if the indication is missing, as it used to do?
  
   Cheers,
   Christian
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] A few MOSES questions (Arabic, missing scripts, Moses error)

2010-05-07 Thread Miles Osborne
MADA can create tokens that are bar characters (ie | )

you need to rename them to something like BAR.  Moses treats these as
factor delimiters, hence the message you are seeing

(i've been using MADA+TOKAN for Arabic, using the D2 setting)

Miles

On 7 May 2010 23:26, David Edelstein dedelst...@ucdavis.edu wrote:
 Hello,

 I'm using Moses to do some SMT on Arabic, experimenting with
 diacritized vs. undiacritized Arabic training corpora. (I am using
 MADA+TOKAN to perform automatic diacritization.) So, if anyone happens
 to be specifically interested in Arabic, has some tips on using Moses
 for Arabic (right now I am just trying to get a baseline system
 running, so I haven't even begun exploring which parameters I need to
 tweak from the defaults), or can give me any other insights, I'd be
 very pleased to talk to you about it off-list; please email me.

 Now, I have a specific question and a specific problem, to which I
 have not found a solution by searching the archives.

 1. There are two scripts referenced in scripts/released-files (read by
 the scripts Makefile):
   training/train-factored-phrase-model.perl
   training/filter-and-binarize-model-given-input.pl

 These scripts do not exist in the most recent SVN release so 'make
 release' reports an error since obviously it cannot install them.

 The tutorials alternately reference train-factored-phrase-model.perl
 and train-model.perl; reading the latter, it seems to do factored
 training. Is this just an error (and something that should be updated
 in the online docs and released-files), and I should only be using
 train-model.perl? Or is there a difference between the two scripts?
 And is the same true of
 training/filter-and-binarize-model-given-input.pl vs.
 filter-model-given-input.pl?

 2. I went through the entire tutorial using the French-English
 Europarl data sets, and got it working. Now I'm going through the same
 process with my Arabic-English parallel corpora. I've gotten as far as
 tuning. I've been trying to use train-model.perl, and it gets to this
 part:

 my-moses-dir/moses-cmd/src/moses -v 0 -config
 my-model-dir/moses.ini -inputtype 0 -w 0.00 -lm 0.33 -d
 0.33 -tm 0.10 0.07 0.10 0.07 0.00
 -n-best-list run1.best100.out 100 -i my-arabic-input-file  run1.out

 It generates run1.best100.out and run1.out, but then chokes with this
 error message:

 Translation took 0.060 seconds
 Finished translating
 [ERROR] Malformed input at
  Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...)
  but instead received input with 2 factor(s).
 Aborted

 So I gather somewhere I have a setting wrong, but I cannot figure out
 where it is. I basically followed the exact same steps with my
 Arabic-English corpora as in the tutorial, just substituting my own
 training data. I'm not trying to do factored training at this time.

 Any advice appreciated. Thanks!
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] different tune set diferent tuned parameters !

2010-05-02 Thread Miles Osborne
there is a large amount of randomness involved with parameter tuning.  each
time you run it (using the same language resources) you might get different
parameters,

also, the parameters are not scaled.  this means that one run might give you
these values:

10 20 30

and the next run might give you these ones:

0.1 0.2 0.3

Miles

On 2 May 2010 09:34, Somayeh Bakhshaei s.bakhsh...@yahoo.com wrote:


 Hi All,

 A problem:
 Isn't it true that the parameter tuning must gain the structure of the
 language so  i must get the same set of tuned parameters sets with different
 kind of tune sets?
 So why with changing the tuning set i get different amounts for parameters?


 another awful result:

 I changed my test set, the Bleu result changed from 19 to 3 !
 How its may while there is no overlap between none of the test sets and
 train set?!!
 --
 Best Regards,
 S.Bakhshaei

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] IRSTLM error: converting iARPA to ARPA format

2010-04-21 Thread Miles Osborne
this means you have run out of memory.

you can either:

--get more memory
--use less data
--use a lower-order LM
--use RandLM, which can easily handle this amount of data (i am
currently building LMs using more than 30 billion words with it for
example)

Miles

On 21 April 2010 09:57, Zahurul Islam zai...@gmail.com wrote:
 Hi,
 I am trying to build a language model large amount text (13GB). In the step
 of converting iARPA format to ARPA format i met following error:

 /tools/irstlm-5.22.01/bin/compile-lm wiki.it.truecase.ilm.gz --text yes
 wiki.it.lm
 inpfile: wiki.it.truecase.ilm.gz
 dub: 1000
 Reading wiki.it.truecase.ilm.gz...
 iARPA
 loadtxt()
 terminate called after throwing an instance of 'std::bad_alloc'
   what():  std::bad_alloc
 /tools/irstlm-5.22.01/bin/compile-lm: line 9: 20328 Aborted
 $dir/$name $@
 Any help to identify|solve this problem will be appreciated. Thank you very
 much.
 Regards,
 Zahurul
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 41, Issue 36

2010-03-28 Thread Miles Osborne
a quick question.  will this break compatibility with existing training runs?

also, adding new features --even if they are not used-- can impact
upon MERT and may slow things down / make things worse.  have you
verified (using multiple runs) that this new feature doesnt' make
things worse than before?

Miles

On 28 March 2010 19:46, Lane Schwartz dowob...@gmail.com wrote:
 On 28 Mar 2010, at 11:02 AM, moses-support-requ...@mit.edu wrote:

 Hiya Mosers and Mosettes,

 It's been a year since the last release  there's been lots of changes, by 
 lots of people, that we thought you should know about.

 A new release tar ball and zip file are on sourceforge, or svn update as 
 usual
    https://sourceforge.net/projects/mosesdecoder/

 Also, there is likely to be big changes in the next month as we merge the 
 hierarchical/syntax branch into trunk. Please avoid svn up after today, and 
 double check with someone else before committing large chunks of code to the 
 trunk.

 Hieu,

 I've got a handful of changes from last week that I was planning to merge 
 from my new branch back into trunk tomorrow. The changes pretty much involve 
 adding one new feature, and should not affect anyone not using the new 
 feature.

 I'll wait for your go-ahead before I do this merge. If there are plans for 
 lots of updates to trunk tomorrow, I could probably do my merge later today 
 (Sunday) instead, if that would help.

 Lane


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Dictonary use during training

2010-02-23 Thread Miles Osborne
re: adding dictionary entries, this is certainly a hack.  but the
standard trick is to pretend that the dictionary actually consists of
tiny parallel sentences.  you therefore just append each word-entry as
a new sentence pair.  don't bother with that -d option.

Miles

On 23 February 2010 18:34, maria sol ferrer mariasol.fer...@gmail.com wrote:
 Hi all, I'm wondering if you would know where I can find an english to
 spanish parallel, word to word dictionary to complement my training corpus.

 Also, from what I have searched I understand you can either add the
 dictionary words at the end of the corpus or use the giza option. I would
 like to try both, but for the giza option -d I see that the file format uses
 the word's ids, then where will the real words (from the parallel
 dictionary) go? in the corpus as well? or in a separate file?

 Any other suggestions for using a dictionary are welcome.

 Thank you.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] skipping incompatible liboolm.a

2010-02-22 Thread Miles Osborne
this is a standard error.  you need to build SRILM using 64-bit
support  (i686-m64)

Miles

On 22 February 2010 11:40, Marce van Velden marcevanvelde...@gmail.com wrote:
 Hi,
 I get the folowing error when trying to compile moses on a intel64 pc. What
 could cause the liboolm.a to be incompatible?
 (/usr/bin/ld: skipping incompatible /home/marce/srilm64/lib/i686/liboolm.a
 when searching for -loolm)
 ma...@moses:~/moses/trunk$ sudo make
 make  all-recursive
 make[1]: Entering directory `/home/marce/moses/trunk'
 Making all in moses/src
 make[2]: Entering directory `/home/marce/moses/trunk/moses/src'
 make  all-am
 make[3]: Entering directory `/home/marce/moses/trunk/moses/src'
 make[3]: Nothing to be done for `all-am'.
 make[3]: Leaving directory `/home/marce/moses/trunk/moses/src'
 make[2]: Leaving directory `/home/marce/moses/trunk/moses/src'
 Making all in moses-cmd/src
 make[2]: Entering directory `/home/marce/moses/trunk/moses-cmd/src'
 g++  -g -O2  -L/home/marce/srilm64/lib/i686 -o moses Main.o mbr.o
 IOWrapper.o TranslationAnalysis.o LatticeMBR.o -L../../moses/src -lmoses
 -L/usr/include/boost/lib -lboost_thread-mt -loolm -ldstruct -lmisc -lz
 /usr/bin/ld: skipping incompatible /home/marce/srilm64/lib/i686/liboolm.a
 when searching for -loolm
 /usr/bin/ld: cannot find -loolm
 collect2: ld returned 1 exit status
 make[2]: *** [moses] Error 1
 make[2]: Leaving directory `/home/marce/moses/trunk/moses-cmd/src'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/home/marce/moses/trunk'
 make: *** [all] Error 2
 Thanks,
 Marce
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Build Moses for translating English to Chinese.

2010-02-11 Thread Miles Osborne
How words are tokenised / segmented etc is crucial when using small
amounts of data.  For the vast numbers of people using Moses (people
not training-up on millions of sentence pairs) this is the kind of
thing that needs to be done correctly.

It would be a service to extend the Moses tokeniser to deal with
languages other than just those ones you mentioned before.

Miles

On 11 February 2010 17:51, Christof Pintaske christof.pinta...@sun.com wrote:
 Hi,

 you may want to have a closer look at tokenizer.perl which is used for
 word-breaking. It seems there is some special logic to handle English,
 French, and Italian but nothing much else.

 I'm not sure if you can or plan to reveal your findings here on the list
 but at any rate I'd be very interested to learn how Chinese worked for you.

 best regards
 Christof

 nati g wrote:
 Hello,
  Do we need any special scripts to build moses for translating english
 to chinese.

 thanks in advance.


 

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] moses for haitian relief

2010-01-27 Thread Miles Osborne
it looks to me like you have not correctly compiled  / installed the srilm.

Miles

2010/1/27 christopher taylor christopher.paul.tay...@gmail.com:
 hello everyone!

 i'm currently trying to build an instance of moses to support
 crisiscommons.org's machine translation project (i'm currently the
 PM).

 i really want to give moses a spin *but* i'm having issues building it.

 my build trouble is related to liboolm.a - here's out put from my compilation:

 Making all in moses-cmd/src
 make[2]: Entering directory `../mt/moses/moses-cmd/src'
 g++  -g -O2  -L..//mt/srilm/lib/i686 -L..//mt/irstlm//lib/x86_64 -o
 moses  Main.o mbr.o IOWrapper.o TranslationAnalysis.o
 -L../../moses/src -lmoses   -loolm -ldstruct -lmisc -lirstlm -lz
 /usr/bin/ld: skipping incompatible ../mt/srilm/lib/i686/liboolm.a when
 searching for -loolm
 /usr/bin/ld: cannot find -loolm
 collect2: ld returned 1 exit status
 make[2]: *** [moses] Error 1
 make[2]: Leaving directory `..//mt/moses/moses-cmd/src'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `..//mt/moses'
 make: *** [all] Error 2

 thanks so much for your help!

 chris taylor
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses on the iPhone

2010-01-12 Thread Miles Osborne
you should also look at RandLM, as it will enable you to run a
language model in small space.

that aside, i would look hard at pruning the various tables (eg phrase
tables, reordering, language models) so you can just the core that you
need.  this will make for faster loading etc.  note also that you
probably shouldn't prune the phrase table for a test set (as is
commonly done).

Miles

2010/1/12 Hieu Hoang hieuho...@gmail.com:
 hi andrew

 some of us have been working on putting moses onto the OLPC
    http://wiki.laptop.org/go/Projects/Automatic_translation_software
 which has roughly the same resources as an iphone. We've got it working for
 reasonable size models

 my advice would be:
    1. The moses-cmd shows you how to interact with the moses library. For
 normal decoding, it's quite simple. To make it even more simple for the gui
 developers, I would create a static library as a replacement for moses-cmd.
 Call the static library functions from your gui, rather than the moses
 functions directly
    2. from what i know of ARM development, there are compiler switches to
 enable fast floating point operations. Make sure these are enabled.
    3. the moses library assumes lots of memory so caches certain objects.
 Look throught this mailing list to see how to turn caching off.
    4. Iphone apps can't run in the background so it would be best to have
 instant loading. This is not the case with any of our models, which can take
 some time to initialize. Speciically the phrase table and language models.
 You may have to write new implementations for them.
    5. There may be littendian/bigendian issues with the binary phrase tables
  language models. i.e you may not be able to create a binary phrase
 table/LM on your desktop and expect it to work on the iphone.

 i think its definitely doable, but don't expect just to be able to compile 
 go

 sounds like a fun project, let us know how it goes.

 On 11/01/2010 17:57, Andrew W. Haddad wrote:

 Hello,

 My name is Andrew Haddad. I am a Graduate Research Assistant at Purdue
 University. I have been given the task of getting moses working on the
 iphone. The moses package, which we have successfully installed and have
 running in simulation on the iphone will of course not work due to some
 limitations put for by Apple.

 I am going to be forced to cross compile the moses static library, used in
 moses-cmd, for the arm and i386 architecture. And then rewrite the
 functionality of moses-cmd to be used in our application. Do you know of
 anyone who has attempted something similar, that might be able to explain
 the process?

 --
 Sláinte
 Andrew W. Haddad

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] different servers + different time - differentresult?

2010-01-11 Thread Miles Osborne
yes, you can easily get a 1BP drop between multiple runs.

if you want to do experiments and report BLEU scores then people
really need to do multiple runs and report on averages, along with
variances.  i think from no-one i'm going to start penalising papers i
get to review if people don't do something about this

(and i do a lot of reviewing ...)

Miles

2010/1/11 李贤华 08lixian...@gmail.com:
 hi,

 Thanks for you quick response.
 But, will this cause a drop of BLEU, like, 0.5 point?
 I thinks that's too much...

 I have run my baseline experiments three times, and got three different
 results.
 The results for test set are: 0.2798, 0.2741, 0.2790.
 The first is run on server1 previously,
 the second and the third are run recently,
 while the second is run on server2, and the third is run on server1.

 Now I don't know what is my baseline.




 Regards,

 Lee Xianhua

 
 2010-01-11


 
 发件人: Miles Osborne
 发送时间: 2010-01-11  16:12:38
 收件人: 李贤华
 抄送: moses-support
 主题: Re: [Moses-support] different servers + different time -
 differentresult?
 Giza++ and MERT both can produce different results, even when using
 the same code, corpora etc.  This is because multiple solutions exist
 and each time you run Moses, you find one of these (different) optima.
 Miles
 2010/1/11 李贤华 08lixian...@gmail.com:
 Hi all,

 I ran some experiments with moses like, half a year ago.
 And recently I ran them for a second time.
 The time I got the reuslts, I got confused.
 Beacause they're so different from those I got previously.

 The softwares I used was not changed, the same version.
 The corpus is of course the same. I just copied them.
 And I used the same script the run the experiments, just changed some
 directory.
 It seems I ran the same experiments on two different servers at different
 time, and got different results.

 I checked alignment results, aligned.grow-diag-and-final,
 and there're a lot of differences.
 I also checked moses.ini, and the parameters are greatly different.

 So, has anybody ever come into this situation? I'm really confused...



 Regards,

 Lee Xianhua

 
 2010-01-11
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] The results of your email commands

2009-12-23 Thread Miles Osborne
randlm is already in a binary format, so there is no extra conversion

loading randomised models faster is not something that we have really
looked at.

Miles

2009/12/23 Arda Tezcan arda...@yahoo.com:
 Hi,
 I would really appreciate it if you could help me with the following
 question I have:
 I was wondering if a LM created with RANDLM can be converted into a binary
 format?
 Or is there maybe another way of loading the model faster?

 I know it is possible with IRSTLM and SRILM but I couldn't find anything
 about RANDLM.
 Thank you in advance for your support

 Best regards,
 Arda Tezcan





 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] moses threads compilation problem (with RandLM)

2009-12-17 Thread Miles Osborne
Making RandLM thread-safe is something I've been thinking about.
There are a number of bug fixes which need dealing with too, so
perhaps at some point I'll push out a new release.

Miles

2009/12/17 Alexander Fraser fra...@ims.uni-stuttgart.de:
 Hi Barry and Philipp,

 Philipp is correct, multi-threaded moses is unlikely to work with randlm 
 since
 the latter uses a (presumably) non-thread-safe cache.

 Darn, RandLM would be really useful together with threading because
 our cluster has low memory machines with 8 processors. David or Miles,
 any chance I could convince you to fix this soon?

 As regards the compile error, this is due to a recent change in the mbr code,
 and the fact that we don't have a regression test to pick it up. I should be
 able to fix the compile error fairly quickly, but at the moment I'm not sure
 what to do about the regression test.  Ideally I'd like to get rid of the
 separate main for mosesmt, although we'd still have to have compiler switches
 which would leave us open to these issues.

 If you roll back to 2636 then mosesmt should compile fine.

 It compiles for me without RandLM fine. BTW, it would be cool to stick
 something about -threads in the moses usage when compiled with
 threads.

 You mentioned some weird use of -DWITH_THREADS - do you mean the failure of
 (eg) the test of Ngram.h ? I think this is due to a different problem with
 configure.ac, which would explain why I keep seeing errors like WARNING:
 Ngram.h: accepted by the compiler, rejected by the preprocessor! !


 Ngram.h always fails for me (regardless of whether using threads or
 not), there is something that causes it to try to invoke a null
 command:

 configure:5997: checking Ngram.h presence
 configure:6012:   -I/home/users6/fraser/statmt/srilm-1.5.7/include 
 conftest.cpp
 ./configure: line 6014:
 -I/home/users6/fraser/statmt/srilm-1.5.7/include: No such file or
 directory


 The -DWITH_THREADS thing causes other things to fail (only when
 building with threads), such as getopt.h.


 These failures make no difference, since all of the things that fail
 get something like:

 configure:6537: WARNING: getopt.h: accepted by the compiler, rejected
 by the preprocessor!
 configure:6539: WARNING: getopt.h: proceeding with the compiler's result


 See the config.log I posted on my previous message (or let me know if
 I should send you a copy directly) for more examples.

 Cheers, Alex



 cheers
 Barry

 On Wednesday 16 December 2009 16:28, Alexander Fraser wrote:
 Hi Barry and other folks,

 I'm also having trouble compiling Moses with threads and RandLM, there
 seems to be a bug in MainMT.cpp ?

 Here is what I am doing:

 Get fresh copy of Moses (I did this on Monday night).

 ./regenerate-makefiles.sh
 ./configure --enable-threads
 --with-srilm=/home/users6/fraser/statmt/srilm-1.5.7
 --with-randlm=/home/users6/fraser/statmt/randlm-v0.11
 --with-boost=/home/users6/fraser --with-boost-thread=boost_thread
 make

 (The last argument --with-boost-thread is necessary to stop it from
 picking up the globally installed boost thread library).

 I attach config.log, which makes it through fine (though I think there
 is some weird use of -DWITH_THREADS in there which might be
 interesting).

 I also attach make.log (which only contains the compilation error, I
 typed make twice).

 Let me know if I can provide any more info.

 Thanks a lot for your help!

 Cheers, Alex

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Looking for non-CLI tool for aligning parallel text

2009-10-28 Thread Miles Osborne
or even see our own ACL paper from this year, which applies MC
techniques correctly

http://aclweb.org/anthology-new/P/P09/P09-1088.pdf

(a problem with the paper you mentioned is that they only ran the
sampler for 100 rounds --that is barely enough to move from the
initial distribution)

Miles

2009/10/28 Adam Lopez alo...@inf.ed.ac.uk:
 See this paper (which I believe is current state of the art for direct
 alignment of phrases) and references therein:
 http://aclweb.org/anthology-new/D/D08/D08-1033.pdf

 This strand of research goes back at least as far as this paper:
 http://aclweb.org/anthology-new/W/W02/W02-1018.pdf

 On Tue, Oct 27, 2009 at 10:51 PM, Catalin Braescu cata...@braescu.com wrote:
 Then I wonder how can aligning be done automatically for phrases? And
 what's the accuracy of such process?


 Catalin Braescu



 On Wed, Oct 28, 2009 at 12:36 AM, Miles Osborne mi...@inf.ed.ac.uk wrote:
 well, alignment is a task that is really done en mass and not
 sentence-by-sentence.  apart from say teaching, there isn't really a
 need for a GUI to do it.

 (convince me that you are ready to use this to align 8 million
 sentence pairs and i'd be impressed)

 Miles

 2009/10/27 Catalin Braescu cata...@braescu.com:
  Big thanks for the links!

 But I have to say I cannot believe my eyes... most of these programs
 are jar files launcged with parameters from the command line... and
 the way they work could be a textbook for user unfriendliness :-(

 How can people stand such primitive and bizarre apps? I am not bashing
 their authors, I am only surprised there weren't any authors of better
 programs...


 Catalin Braescu

 On Tue, Oct 27, 2009 at 9:57 PM, Adam Lopez alo...@inf.ed.ac.uk wrote:
 There are several of these around.  Note that I have not used any of them.

 http://www.cs.utah.edu/~hal/HandAlign/
 http://www.umiacs.umd.edu/~nmadnani/alignment/forclip.htm
 http://www.d.umn.edu/~tpederse/parallel.html
 http://www.let.rug.nl/~tiedeman/Uplug/

 Ulrich Germann also demonstrated such an editor at last year's ACL,
 although it does not seem to be online; perhaps email him.

 Adam


 On Tue, Oct 27, 2009 at 6:25 PM, Catalin Braescu cata...@braescu.com 
 wrote:
 Ok, so what I'm looking for is a non-CLI alignment editor. Any ideas?


 Catalin Braescu
 Omlulu.com


 On Tue, Oct 27, 2009 at 1:41 PM, Catalin Braescu cata...@braescu.com 
 wrote:
 I am asking in advance for your forgiveness if my question is trivial
 (or, rather, the answer).

 I am looking for a non-CLI tool that a not-very-technical person can
 use to align 2 documents in different languages.

 When I'm saying non--CLI I mean anything that has a window and a
 visual way of handling things: anything between a dual pane Notepad,
 a php-backed web form, a Java Applet, whatever. as in, not a command
 line thing - our newly hired PC operators won't be able to handle
 it.

 Any suggestions?



 Catalin Braescu
 Omlulu.com

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Looking for non-CLI tool for aligning parallel text

2009-10-27 Thread Miles Osborne
phrases are not usually directly aligned.  instead, words are (this
is what Giza++ does for example).  phrases are usually extracted using
heuristics.

the accuracy of word alignment is a function of the number of sentence
pairs and also the actual language pair.  for example, you need a lot
more data to do well at Chinese-English than with Spanish-English.

Miles

2009/10/27 Catalin Braescu cata...@braescu.com:
 Then I wonder how can aligning be done automatically for phrases? And
 what's the accuracy of such process?


 Catalin Braescu



 On Wed, Oct 28, 2009 at 12:36 AM, Miles Osborne mi...@inf.ed.ac.uk wrote:
 well, alignment is a task that is really done en mass and not
 sentence-by-sentence.  apart from say teaching, there isn't really a
 need for a GUI to do it.

 (convince me that you are ready to use this to align 8 million
 sentence pairs and i'd be impressed)

 Miles

 2009/10/27 Catalin Braescu cata...@braescu.com:
  Big thanks for the links!

 But I have to say I cannot believe my eyes... most of these programs
 are jar files launcged with parameters from the command line... and
 the way they work could be a textbook for user unfriendliness :-(

 How can people stand such primitive and bizarre apps? I am not bashing
 their authors, I am only surprised there weren't any authors of better
 programs...


 Catalin Braescu

 On Tue, Oct 27, 2009 at 9:57 PM, Adam Lopez alo...@inf.ed.ac.uk wrote:
 There are several of these around.  Note that I have not used any of them.

 http://www.cs.utah.edu/~hal/HandAlign/
 http://www.umiacs.umd.edu/~nmadnani/alignment/forclip.htm
 http://www.d.umn.edu/~tpederse/parallel.html
 http://www.let.rug.nl/~tiedeman/Uplug/

 Ulrich Germann also demonstrated such an editor at last year's ACL,
 although it does not seem to be online; perhaps email him.

 Adam


 On Tue, Oct 27, 2009 at 6:25 PM, Catalin Braescu cata...@braescu.com 
 wrote:
 Ok, so what I'm looking for is a non-CLI alignment editor. Any ideas?


 Catalin Braescu
 Omlulu.com


 On Tue, Oct 27, 2009 at 1:41 PM, Catalin Braescu cata...@braescu.com 
 wrote:
 I am asking in advance for your forgiveness if my question is trivial
 (or, rather, the answer).

 I am looking for a non-CLI tool that a not-very-technical person can
 use to align 2 documents in different languages.

 When I'm saying non--CLI I mean anything that has a window and a
 visual way of handling things: anything between a dual pane Notepad,
 a php-backed web form, a Java Applet, whatever. as in, not a command
 line thing - our newly hired PC operators won't be able to handle
 it.

 Any suggestions?



 Catalin Braescu
 Omlulu.com

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How many and/or which language model(s) to use?

2009-10-22 Thread Miles Osborne
you can't supply language models for both directions:  you need to
supply them for the target and not the source

Miles

2009/10/22 Ivan Uemlianin i.uemlia...@bangor.ac.uk:
 Dear All

 I am using Moses with irstlm.  The language pair I am developing is
 English and Welsh.  I have built language models, and I am now exploring
 train-factored-phrase-model.perl.

 My question is: which language model should I supply to the perl script,
 or should I supply both (I have built a separate language model for each
 language), and how?

 Below is the script I'm using (I've wrapped the perl command in a shell
 script for readability).  This script runs without errors, but I should
 like to know if I'm supplying the language models correctly:


    #! /bin/bash

    SCRIPTS_ROOTDIR=/path/to/moses_scripts
    ROOT_DIR=/path/to/project
    FSTEM=project_name

    nohup  nice
    $SCRIPTS_ROOTDIR/training/Train-factored-phrase-model.perl  \
    -scripts-root-dir  $SCRIPTS_ROOTDIR    \
    -root-dir  $ROOT_DIR                   \
    -corpus  $ROOT_DIR/corpws/$FSTEM.tok   \
    -f  cy   \
    -e  en   \
    -alignment  grow-diag-final-and    \
    -reordering  msd-bidirectional-fe  \
    -lm  0:3:$ROOT_DIR/lm_irst/$FSTEM.cy.irstlm.gz:1  \
    -lm  0:3:$ROOT_DIR/lm_irst/$FSTEM.en.irstlm.gz:1  \
    1 $FSTEM.fphm.out  \
    2 $FSTEM.fphm.err  



 Thanks and best wishes

 Ivan


 --
 
 Ivan Uemlianin

 Canolfan Bedwyr
 Safle'r Normal Site
 Prifysgol Bangor University
 BANGOR
 Gwynedd
 LL57 2PZ

 i.uemlia...@bangor.ac.uk
 

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Looking for text corpora

2009-09-06 Thread Miles Osborne
the only other source of lots of parallel data (I know about) is the LDC:

http://www.ldc.upenn.edu/

but this is not free ...

Miles

2009/9/6 Catalin Braescu cata...@braescu.com:
 Thanks, Miles! From your link I got http://www.statmt.org/europarl/

 Any other such goodies?


 Catalin

 --
 Omlulu.com


 On Sun, Sep 6, 2009 at 8:13 PM, Miles Osbornemi...@inf.ed.ac.uk wrote:
 http://www.statmt.org/wmt08/shared-task.html

 2009/9/6 Catalin Braescu cata...@braescu.com:
 Obviously Moses (like any similar tool) is useless without having
 access to a huge amount of translated documents.

 While I am sure such corpora already exist and are available for free,
 I can't find them online, therefore I kindly ask the list colleagues
 for useful hints.
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] EM Model 1 question

2009-07-27 Thread Miles Osborne
the good thing about probabilities is that they should sum to one

(but you can get numerical errors giving you slightly more / less ...)
Miles

2009/7/27 James Read j.rea...@sms.ed.ac.uk

 Ok. Thanks. I think I understand this now. I also think I have found
 the bug in the code which was causing the dodgy output.

 So, in conclusion, would you say that a good automated check to see if
 the code is working correctly would be to add up the probabilities at
 the end of the EM iterations and check that probabilities add up to 1
 (or slightly less)?

 James

 Quoting Philipp Koehn pko...@inf.ed.ac.uk:

  Hi,
 
  because the final loop in each iteration is:
 
  // estimate probabilities
  for all foreign words f
for all English words e
  t(e|f) = count}(e|f) / total(f)
 
  As I said, there are two normalizations: one on the
  sentence level, the other on the corpus level.
 
  -phi
 
  On Mon, Jul 27, 2009 at 10:30 PM, James Readj.rea...@sms.ed.ac.uk
 wrote:
  In that case I really don't see how the code is guaranteed to give
 results
  which add up to 1.
 
  Quoting Philipp Koehn pko...@inf.ed.ac.uk:
 
  Hi,
 
  this is LaTex {algorithmic} code.
 
  count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$
 
  means
 
  count(e|f) += t(e|f) / s-total(e)
 
  So, you got that right.
 
  -phi
 
  On Mon, Jul 27, 2009 at 10:18 PM, James Readj.rea...@sms.ed.ac.uk
 wrote:
 
  Hi,
 
  this seems to be pretty much what I implemented. What exactly do you
 mean
  by
  these three lines?:
 
  \STATE count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$
  \STATE total($f$)   += $\frac{t(e|f)}{\text{s-total}(e)}$
  \STATE $t(e|f)$ = $\frac{\text{count}(e|f)}{\text{total}(f)}$
 
  What do you mean by $\frac? The pseudocode I was using shows these
 lines
  as
  a simple division and this is what my code does. i.e
 
  t(e|f) = count(e|f) / total(f)
 
  In C code something like:
 
  for ( f = 0; f  size_source; f++ )
  {
   for ( e = 0; e  size_target; e++ )
   {
t[f][e] = count[f][e] / total[f];
   }
  }
 
 
  Is this the kind of thing you mean?
 
  Thanks
  James
 
  Quoting Philipp Koehn pko...@inf.ed.ac.uk:
 
  Hi,
 
  I think there was a flaw in some versions of the pseudo code.
  The probabilities certainly need to add up to one. There are
  two normalizations going on in the algorithm: one on the sentence
  level (so the probability of all alignments add up to one) and
  one on the word level.
 
  Here the most recent version:
 
  \REQUIRE set of sentence pairs $(\text{\bf e},\text{\bf f})$
  \ENSURE translation prob. $t(e|f)$
  \STATE initialize $t(e|f)$ uniformly
  \WHILE{not converged}
   \STATE \COMMENT{initialize}
   \STATE count($e|f$) = 0 {\bf for all} $e,f$
   \STATE total($f$) = 0 {\bf for all} $f$
   \FORALL{sentence pairs ({\bf e},{\bf f})}
\STATE \COMMENT{compute normalization}
\FORALL{words $e$ in {\bf e}}
  \STATE s-total($e$) = 0
  \FORALL{words $f$ in {\bf f}}
\STATE s-total($e$) += $t(e|f)$
  \ENDFOR
\ENDFOR
\STATE \COMMENT{collect counts}
\FORALL{words $e$ in {\bf e}}
  \FORALL{words $f$ in {\bf f}}
\STATE count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$
\STATE total($f$)   += $\frac{t(e|f)}{\text{s-total}(e)}$
  \ENDFOR
\ENDFOR
   \ENDFOR
   \STATE \COMMENT{estimate probabilities}
   \FORALL{foreign words $f$}
\FORALL{English words $e$}
  \STATE $t(e|f)$ = $\frac{\text{count}(e|f)}{\text{total}(f)}$
\ENDFOR
   \ENDFOR
  \ENDWHILE
 
  -phi
 
 
 
  On Sun, Jul 26, 2009 at 5:24 PM, James Read j.rea...@sms.ed.ac.uk
  wrote:
 
  Hi,
 
  I have implemented the EM Model 1 algorithm as outlined in Koehn's
  lecture notes. I was surprised to find the raw output of the
 algorithm
  gives a translation table that given any particular source word the
  sum of the probabilities of each possible target word is far greater
  than 1.
 
  Is this normal?
 
  Thanks
  James
 
  --
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
 
 
  --
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
 
 
 
 
 
 
 
 
  --
  The University of Edinburgh is a charitable body, registered in
  Scotland, with registration number SC005336.
 
 
 
 
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu

Re: [Moses-support] How to create Two-way translator and accelerate.

2009-05-05 Thread Miles Osborne
and don't forget to look at RandLM  -this can save you a lot of memory
for your language model (a lot more than IRSTLM)

plug over!

Miles

2009/5/5 Marcin Miłkowski milek...@o2.pl:
 Jan Helak pisze:
  I have one last question. Final version will be builded with ap. 50 MB
 of polish text and 50 MB of english text. My computer has 3114632k total
 memory. It is enough for SRILM or I will need to use IRSTLM ?

 Heh, 50 MB is not much but I doubt it could fit in your memory. It all
 depends on your data but you should get something like 50 MB gzipped
 giza alignment, with something like 300 MB ungzipped, and the phrase
 table can be several times bigger. For example, for one project I had
 200 MB input files, and got 1.2 GB gzipped text phrase table.

 IRSTLM should save more memory, especially if you quantize and binarize.
 BTW, I find using IRSTLM is much less cumbersome than SRI.

 Regards (and nie ma za co)
 Marcin
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How to create Two-way translator and accelerate.

2009-05-04 Thread Miles Osborne
actually, i think Jan wants a speedup, not a space saving.

your best bet is to reduce the size of the beam:

http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc6

Miles
2009/5/4 Francis Tyers fty...@prompsit.com:
 El lun, 04-05-2009 a las 14:54 +0200, Jan Helak escribió:
 Hello everyone :)

 I try to build two-way translator for polish and english languages as a
 project on one of my subjects. By now, I created a one-way translator
 (polish-english) as a beta version, but severals problems have came:

 (1) A translator must work in two-ways. How to achieve this?

 Make another directory and train two models.

 (2) Time of traslating for phrases is two long ( 4 min. for one
 sentence). How to accelerate this  (decresing a quality of translation
 is acceptable).

 You can try filtering the phrase table before translating (see PART V -
 Filtering Test Data), or using a binarised phrase table (see Memory-Map
 LM and Phrase Table).

 http://ufallab2.ms.mff.cuni.cz/~bojar/teaching/NPFL087/export/HEAD/lectures/02-phrase-based-Moses-installation-tutorial.html

 Regards,

 Fran

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How to create Two-way translator and accelerate.

2009-05-04 Thread Miles Osborne
the original question was about speed of decoding, not potential
quality improvements due to filtering

clearly, if you can identify phrases to prune then you will get a
speed-boost.  but this is not true for the general case and my advice
was for the general case.

Miles

2009/5/4 Marcin Miłkowski milek...@o2.pl:
 Miles Osborne pisze:

 filtering etc might give you a speed-up (eg  a constant one --less
 stuff to load) but if filtering is safe w.r.t to the source data, then
 you shouldn't see much here.

 (pruning the table should make it faster since there will be fewer
 options to consider, but this is not safe)

 Actually, this is contrary to what Johnson et al. say in their paper, and my
 subjective (not measured) experience was definitely in their favor. As long
 as you have really clean data, you don't want to lose any of it, but if
 alignments are lousy, translations ambiguous etc., you want to cut it off,
 and Jan wants to do that (see his post).

 I was even filtering more and got better results by heuristically discarding
 unprobable phrases from the phrase table (based on Fran's idea he had about
 discarding unprobable alignments). Again, this is subjective, anecdotal,
 etc., but before that I was getting complete garbage.

 Note: my pair was English-Polish and Polish English.

 i guess you might also see fewer page faults and the like with a
 smaller model and that will help matters.

 btw, quantising and binarising language models helps as well

 Marcin

 but in general, the beam size is the most direct way to make it faster.



 Miles

 2009/5/4 Francis Tyers fty...@prompsit.com:

 El lun, 04-05-2009 a las 14:08 +0100, Miles Osborne escribió:

 actually, i think Jan wants a speedup, not a space saving.

 Does filtering the phrase table before translation not decrease the
 total time to make a translation (including the time taken to load the
 phrase table etc.)?  That was my experience, and it appears to be
 something that he hasn't done, but perhaps my set up is unusual...

 Fran

 your best bet is to reduce the size of the beam:

 http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc6

 Miles
 2009/5/4 Francis Tyers fty...@prompsit.com:

 El lun, 04-05-2009 a las 14:54 +0200, Jan Helak escribió:

 Hello everyone :)

 I try to build two-way translator for polish and english languages as
 a
 project on one of my subjects. By now, I created a one-way translator
 (polish-english) as a beta version, but severals problems have came:

 (1) A translator must work in two-ways. How to achieve this?

 Make another directory and train two models.

 (2) Time of traslating for phrases is two long ( 4 min. for one
 sentence). How to accelerate this  (decresing a quality of translation
 is acceptable).

 You can try filtering the phrase table before translating (see PART V -
 Filtering Test Data), or using a binarised phrase table (see Memory-Map
 LM and Phrase Table).


 http://ufallab2.ms.mff.cuni.cz/~bojar/teaching/NPFL087/export/HEAD/lectures/02-phrase-based-Moses-installation-tutorial.html

 Regards,

 Fran

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support












-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How to create Two-way translator and accelerate.

2009-05-04 Thread Miles Osborne
filtering etc might give you a speed-up (eg  a constant one --less
stuff to load) but if filtering is safe w.r.t to the source data, then
you shouldn't see much here.

(pruning the table should make it faster since there will be fewer
options to consider, but this is not safe)

i guess you might also see fewer page faults and the like with a
smaller model and that will help matters.

but in general, the beam size is the most direct way to make it faster.

Miles

2009/5/4 Francis Tyers fty...@prompsit.com:
 El lun, 04-05-2009 a las 14:08 +0100, Miles Osborne escribió:
 actually, i think Jan wants a speedup, not a space saving.

 Does filtering the phrase table before translation not decrease the
 total time to make a translation (including the time taken to load the
 phrase table etc.)?  That was my experience, and it appears to be
 something that he hasn't done, but perhaps my set up is unusual...

 Fran

 your best bet is to reduce the size of the beam:

 http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc6

 Miles
 2009/5/4 Francis Tyers fty...@prompsit.com:
  El lun, 04-05-2009 a las 14:54 +0200, Jan Helak escribió:
  Hello everyone :)
 
  I try to build two-way translator for polish and english languages as a
  project on one of my subjects. By now, I created a one-way translator
  (polish-english) as a beta version, but severals problems have came:
 
  (1) A translator must work in two-ways. How to achieve this?
 
  Make another directory and train two models.
 
  (2) Time of traslating for phrases is two long ( 4 min. for one
  sentence). How to accelerate this  (decresing a quality of translation
  is acceptable).
 
  You can try filtering the phrase table before translating (see PART V -
  Filtering Test Data), or using a binarised phrase table (see Memory-Map
  LM and Phrase Table).
 
  http://ufallab2.ms.mff.cuni.cz/~bojar/teaching/NPFL087/export/HEAD/lectures/02-phrase-based-Moses-installation-tutorial.html
 
  Regards,
 
  Fran
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 








-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Results quality when using moses with randlm

2009-04-16 Thread Miles Osborne
there are many factors here.  firstly, the randomised LM makes errors
as a function of the false positive rate and the values (quantisation)
level.  roughly, the smaller these parameters are, the smaller your LM
will be, but there may be a performance drop.

secondly, the default count-based smoothing methods are only good when
you use enormous quantities of data --look at the Google LM paper
where they show that Stupid backoff approaches K-N smoothing.

if you really want the best performance from moderate amounts of data
(50 million lines is small:  i have used 1 billion sentences) then you
can get SRILM to produce an ARPA file as normal.  (this is the result
of ngram-count).  Randlm can convert an arpa file into a randomised
format.  what this means is that RandLM will use Kneser-Ney smoothing
and assuming reasonable error rates, your translation performance
should be near identical to when using the SRILM

Miles

2009/4/16 Michael Zuckerman michael90...@gmail.com:
 Hi,

 We used moses with randlm - we took a very big corpus of ~50 million lines
 for the language model and processed it with randlm. Then we compared the
 results with moses run with srilm used on much smaller corpus. Surprisingly,
 srilm gave much better results (better translation quality), although used
 on much smaller corpus. Both lm's ran on 5-grams.
 These results were repeated in different language pairs (german - english,
 russian - english, spanish - english etc.)
 Could you please explain these results ?

 Thanks,
 Michael.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Error when run moses with lattices format as input

2009-04-16 Thread Miles Osborne
in general, when you compile a c or c++ program, you add the switch

-g

to the options (usually in a Makefile).  this will tell the compiler
to add stuff to the program so that it works with gde.

you then do:

gdb moses

and you will see a prompt.  you then run moses within that prompt, but
using the run command instead of moses:

gdb run ...

when it crashes, you then type where and you will see the various
functions that were called prior to the crash.

Miles

2009/4/16 Nguyen Manh Hung manhh...@cl.ics.tut.ac.jp:
 Sorry Chris.
 I'm a beginner with moses and C(I usually use perl and java), so I don't
 know how to run moses in gdb(debugger mode???).  I just searched but
 have got any guide for its. Could you show me how to do this.

 Thanks very much,
  Manh Hung

 2009-04-16 (木) の 11:57 -0400 に Chris Dyer さんは書きました:
 I was actually hoping for a stack trace.  That is, run moses in gdb,
 and then when it crashes, uses the where command to show where the
 crash is.

 Thanks!

 2009/4/16 Nguyen Manh Hung manhh...@cl.ics.tut.ac.jp:
  Dear Chris
  I have included the stack trace as a file.
  Thanks in advance,
   Manh Hung
 
  2009-04-16 (木) の 11:34 -0400 に Chris Dyer さんは書きました:
  Can you send me a stack trace for where the SEGV is happening?  Once
  the phrase table has been binarized, there's no need to have any
  special temporary space.
 
  On Tue, Apr 28, 2009 at 10:46 AM, Nguyen Manh Hung
  manhh...@cl.ics.tut.ac.jp wrote:
   Chris Dyer さんは書きました:
  
   You need to add a -weight-i flag to the command line which specifies
   how much weighting to apply to the arc feature.
  
   e.g.:
  
   moses ... -weight-i 0.5
  
   -Chris
  
   On Thu, Apr 16, 2009 at 9:58 AM, Nguyen Manh Hung
   manhh...@cl.ics.tut.ac.jp wrote:
  
  
   Hi,
  
   I'm using Moese to decode with lattices format as input. Also I make
   lattices file content by hand. When I run moses with follow command
   MOSES_HOME/moses-cmd/src/moses -f config_file.ini -inputtype 2
   -input-file input.txt  out.put
   These error has come
   --
   Creating lexical reordering...
   weights: 0.300 0.300 0.300 0.300 0.300 0.300
   Loading table into memory...done.
   Created lexical orientation reordering
   Start loading
   LanguageModel /home/manhhung/smt/confusion/data/lm/lm_news_jan.srilm :
   [112.000] seconds
   Finished loading LanguageModels : [114.000] seconds
   ERROR:You specified 0 input weights (weight-i), but you specified 1 
   link
   parameters (link-param-count)!
   -
  
   Could you please explain these errors for me
   Thanks,
Manh Hung
  
  
   ___
   Moses-support mailing list
   Moses-support@mit.edu
   http://mailman.mit.edu/mailman/listinfo/moses-support
  
  
  
  
  
  
   @Chris: Ohh, it's running OK now, thanks you very much,
   But... has come another error message. Segementation fault.
   I have got this error message when I made binary file type of 
   phrasetable.
   Its seem that the size of /tmp is a small. So I added -T
   options(--temporary-directory) to resolve them.
   But in the options of moses command, I dont found any such option. How 
   do
   you thinks about this error.
   Thanks in advance,
   Manh Hung
  
 
 


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fetching older versions of moses

2009-03-12 Thread Miles Osborne
assuming the current version hasn't been fixed to deal with the LM
problem affecting older versions of gcc:

--check-out the code using SVN as usual, ie

  svn co https://svn.sourceforge.net/svnroot/mosesdecoder/trunk mosesdecoder

then look at the SVN logs:

svn log | less

find some version which is okay for you (probably the one prior to
Chris's changes) and then do:

svn up -r VERSION

where VERSION is the SVN update version.

Miles

2009/3/12 Michael Zuckerman michael90...@gmail.com:
 Hi,

 How can I find out what moses version is in use now, and how can I fetch
 older versions of moses from the database (what is the command to do this) ?

 Thanks,
  Michael.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How is the final LM score obtained?

2009-03-05 Thread Miles Osborne
a couple of points:

--you are asking ngram for perplexities scores, but Moses uses log probs
--Moses will append s and /s pseudo-words to the start and end ot
a sentence;  this will change the probabilities

Miles

2009/3/5 Carlos Henriquez carlo...@gps.tsc.upc.es:

 Hi all.


 I'm making some tests extracting the nbest list from moses (-n-best-list 
 option) with all models' weights set to 1 and I don't understand how do you 
 get the final LM score. I'm using srilm.

 For instance, my best translation from Chinese to English on sentence 9 was

 9 ||| after three hours .  ||| d: 0 lm: -17.0614 tm: -7.41812 -0.944461 
 -4.79107 -2.87243 w: -4 ||| -37.0874

 but if I run ngram alone with the same output sentence

 echo after three hours . | ngram -order 5 -lm ../marie/lm/train.tok.en.lm 
 -ppl -

 the result is very different

 file -: 1 sentences, 4 words, 0 OOVs
 0 zeroprobs, logprob= -7.40966 ppl= 30.3341 ppl1= 71.1892

 I tried with some other values from my nbest list and I always found a big 
 difference between the two scores.

 If my initial weight is 1, why are the scores so different? I suppose I am 
 misunderstanding something.

 The moses command to obtain the n-best-list was

 moses -f moses.ini -i ../../corpus/dev.zh -d 1 -tm 1 1 1 1 -lm 1 -w 1 
 -n-best-list devout.moses.nbest 10 -include-alignment-in-n-best true  
 devout.moses 2 /dev/null

 (yep, I'm not using the last tm weight) and the moses.ini file does not have 
 any weights.

 --
 Carlos A. Henríquez Q.
 carlo...@gps.tsc.upc.es




 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment symmetrisation heuristics

2009-03-04 Thread Miles Osborne
one thing to remember is that the link between AER and BLEU is not
obvious;  in my view at least AER-like scores should be treated with
skepticism and the real merit of an alignment approach should be the
corresponding translation performance (BLEU etc).

can you provide associated BLEU scores for those AER numbers?

Miles

2009/3/4 J.Tiedemann j.tiedem...@rug.nl:

 hi,

 I'm just wondering if Och's refined heuristics is also implemented
 in Moses. The grow-diag is not exactly the same as far as I
 understand.

 The reason why I'm asking is because I found out that in all of my
 experiments with europarl data the intersection always produces  the
 best results in terms of AER (for example using the wpt03 data)
 whereas I see better performances reported for refined compared with
 intersection in various papers (also for the wpt03 data). However, I
 cannot believe that the grow-heuristics would perform so much worse
 than the original refined approach.

 My AER scores with standard GIZA settings and moses heuristics  for
 wpt03 data are the following:

 moses.intersect AER = 0.0613
 moses.grow-diag AER = 0.0843
 moses.grow-diag-final-and   AER = 0.0926
 moses.grow-diag-final   AER = 0.1312
 moses.srctotgt  AER = 0.1039
 moses.tgttosrc  AER = 0.1162
 moses.union AER = 0.1444

 does this sound reasonable?


 Jorg
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment symmetrisation heuristics

2009-03-04 Thread Miles Osborne
yep, that sounds reasonable.  in that case it is good to remember that
those heuristics are all designed for eventual translation and not for
doing well at AER.  i can easily imagine some other set of heuristics
which will do well at word alignment-like tasks and not necessarily
pan-out into good bleu scores etc.

Miles

2009/3/4 J.Tiedemann j.tiedem...@rug.nl:

 it depends on what you want to do. I was interested in the word alignment in
 particular. not necessarily for running MT with moses.

 for SMT I usually just use the default grow-diag-final-and which probably
 gives the best input anyway. this is, I guess, because it's better on
 recall. AER seems to strongly prefer precision.

 jorg


 On Wed, 4 Mar 2009 13:46:36 +
  Miles Osborne mi...@inf.ed.ac.uk wrote:

 one thing to remember is that the link between AER and BLEU is not
 obvious;  in my view at least AER-like scores should be treated with
 skepticism and the real merit of an alignment approach should be the
 corresponding translation performance (BLEU etc).

 can you provide associated BLEU scores for those AER numbers?

 Miles

 2009/3/4 J.Tiedemann j.tiedem...@rug.nl:

 hi,

 I'm just wondering if Och's refined heuristics is also implemented
 in Moses. The grow-diag is not exactly the same as far as I
 understand.

 The reason why I'm asking is because I found out that in all of my
 experiments with europarl data the intersection always produces the
 best results in terms of AER (for example using the wpt03 data)
 whereas I see better performances reported for refined compared with
 intersection in various papers (also for the wpt03 data). However, I
 cannot believe that the grow-heuristics would perform so much worse
 than the original refined approach.

 My AER scores with standard GIZA settings and moses heuristics  for
 wpt03 data are the following:

 moses.intersect AER = 0.0613
 moses.grow-diag AER = 0.0843
 moses.grow-diag-final-and   AER = 0.0926
 moses.grow-diag-final   AER = 0.1312
 moses.srctotgt  AER = 0.1039
 moses.tgttosrc  AER = 0.1162
 moses.union AER = 0.1444

 does this sound reasonable?


 Jorg
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses on a mac

2009-03-03 Thread Miles Osborne
there is a related bug with randlm which i'm looking at now.

whilst i'm doing this, can you verify that it is some mac-specific
problem and not say something due to the gcc version you are using?

Miles

2009/3/4 Kemal Oflazer k...@cs.cmu.edu:
 Dear All

 I just install moses on  a large mac system and wanted to test out an
 earlier setup. Train went just fine but moses dies with

 Start loading LanguageModel
 /Users/oflazer/smt/models/lm/smorph-lm-n5.lm : [0.000] seconds
 pure virtual method called
 terminate called without an active exception
 Abort trap

  this seems to be perhaps related to srilm (does not seem to be
 loeading the file) which is properly  installed. Is there anything
 special to mac that I need to be careful about?
 Thanks

 Kemal

 -
 Kemal Oflazer
 http://people.sabanciuniv.edu/oflazer/
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Error in running moses with randlm

2009-02-24 Thread Miles Osborne
ok, i'll try to work it out.

can you:

--mail me your moses.ini file
--mail me the commands you ran to create your language model
--tell me exactly how much language model data you used and what it
is;  if it is europarl then that should be ok

Miles

2009/2/24 Michael Zuckerman michael90...@gmail.com:
 Hi,


 I am running moses on a small example containing two german sentences (in
 file in):
 das ist ein kleines haus
 das ist ein kleines haus
 I am using the attached randlm language model model.BloomMap, and the
 attached phrase table and moses.ini files.
 My command line is:
 $ ../../../../mosesdecoder/moses-cmd/src/moses -f moses.ini  in  out
 When loading the language model, moses gives an error:

 Defined parameters (per moses.ini or switch):
 config: moses.ini
 input-factors: 0
 lmodel-file: 5 0 3
 /home/michez/alfabetic/lm/randlm/test/model.BloomMap
 mapping: T 0
 ttable-file: 0 0 1 phrase-table
 ttable-limit: 10
 weight-d: 1
 weight-l: 1
 weight-t: 1
 weight-w: 0
 Added ScoreProducer(0 Distortion) index=0-0
 Added ScoreProducer(1 WordPenalty) index=1-1
 Added ScoreProducer(2 !UnknownWordPenalty) index=2-2
 Loading lexical distortion models...
 have 0 models
 Start loading LanguageModel
 /home/michez/alfabetic/lm/randlm/test/model.BloomMap : [0.000] seconds
 pure virtual method called
 terminate called without an active exception
 Aborted

 Do you have a clue how to handle this error ?

 Thanks,
 Michael.


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Error in RandLM

2009-02-19 Thread Miles Osborne
ok.  i just did a clean install of RandLM on a 64 bit machine, running
Suse.  here is my test:


./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model
 ../src/README


this produces the expected results, so it must be that somehow there
is a difference in either your Unix version or else in tools such as
cat etc.

so, which version of cat do you have?  this is what i have here:

cat --version
cat (GNU coreutils) 6.11
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund and Richard M. Stallman.

Miles


2009/2/19 Miles Osborne mi...@inf.ed.ac.uk:
 what happens when you run this?

 Miles

 2009/2/19 Michael Zuckerman michael90...@gmail.com:
 Hi,

 I am using sort (GNU coreutils) 6.10.
 Here is the full STDERR output, with the command I ran:
 $ ../bin/buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model
 -input-type corpus  ../../europarl.lower.token.en 2 errors
 User defined parameter settings:
 falsepos8
 input-typecorpus
 output-prefixmodel
 structBloomMap
 values8
 Default values set in PipelineTool:
 order3
 tmp-dir/tmp
 input-path___stdin___
 working-mem100
 output-dir.
 add-bos-eos1
 seed0
 Default values set in Builder:
 falseneg0
 misassign1
 count-cut-off1
 memory0
 maxcount35
 output-typerandlm
 smoothing-param0.4
 Derived parameters settings:
 estimatorbatch
 smoothingStupidBackoff
 Pipeline converting data from corpus to counts
 output path = ./model.tokens
 output path = ./model.tokens.sorted
 output path = ./model.counts.sorted
 cat ./model.tokens | cat | sort --compress-program=cat -T /tmp -S 100M -k 1
 -k 2 -k 3 -k 4  | cat  ./model.tokens.sorted
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 cat: invalid option -- d
 Try `cat --help' for more information.
 rm ./model.tokens
 buildlm: RandLMStats.cpp:312: virtual bool randlm::CountStats::observe(const
 randlm::Word*, randlm::Value, int): Assertion `len  0' failed.

 Thanks,
 Michael.

 On Thu, Feb 19, 2009 at 4:06 PM, Miles Osborne mi...@inf.ed.ac.uk wrote:

 can you post the full STDERR output, along with the command you ran.

 also, which version of sort are you using?

 sort --version

 Miles


 2009/2/19 Michael Zuckerman michael90...@gmail.com:
  Hi,
 
  We are trying

Re: [Moses-support] Error in RandLM

2009-02-19 Thread Miles Osborne
that might be it.  but i seem to have it working here, using a
non-gzipped version of Europarl.

in any case, Michael:  tell us if it works when the corpus is gzipped

Miles

2009/2/19 Barry Haddow bhad...@inf.ed.ac.uk:
 Hi

 I've seen this error before. The short answer is that you need to use a
 gzipped version of the corpus.

 The reason is that randlm uses gzip to decompress/compress when you have a
 gzipped corpus, which is fine because gzip takes a -d argument for
 decompressing. If presented with a non-gzipped version of the corpus, randlm
 attempts to fake gzip with cat, which fails because cat doesn't accept -d.

 This has come up on the mailing list before, as far as I recall.

 regards
 Barry

 On Thursday 19 February 2009 13:53, Michael Zuckerman wrote:
 Hi,

 We are trying to run RandLM on our files. We use the command:
 $ ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model
 -input-type corpus  ../../europarl.lower.token.en

 And we get the following errors:
 cat: invalid option -- d
 Try `cat --help' for more information.
 rm ./model.tokens
 buildlm: RandLMStats.cpp:312: virtual bool
 randlm::CountStats::observe(const randlm::Word*, randlm::Value, int):
 Assertion `len  0' failed.
 Aborted

 Are you familiar with these errors ? Do you have an idea about how to solve
 them ?

 Thanks,
  Michael.

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] RandLM compressor cat bug.

2008-12-14 Thread Miles Osborne
ah, ok.  i think David hit it on the head:  randlm is currently in the
very first release and to my knowledge hasn't been extensively tested
under various setups.

we'll gather together these problems and add them into the next release.

Miles

2008/12/7 Radek Bartoň xbart...@stud.fit.vutbr.cz:
 On Wednesday 03 of December 2008 02:08:05 you wrote:
 Hi Radek,

 Thanks for the patch. What's your system? (I've a feeling the part of
 RandLM is not very portable).

 David


 Gentoo ~amd64 here, cat is part of coreutils-6.12-r2 package.

 Sorry, I accidentaly posted this message to David instead of list first.

 --
 Ing. Radek Bartoň

 Faculty of Information Technology
 Department of Computer Graphics and Multimedia
 Brno University of Technology

 E-mail: xbart...@stud.fit.vutbr.cz
 Web: http://blackhex.no-ip.org
 Jabber: black...@jabber.cz

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] RandLM compressor cat bug.

2008-12-06 Thread Miles Osborne
which version of unix are you using?

MIles

2008/11/28 Radek Bartoň [EMAIL PROTECTED]:
 Hello.

 Since there is no RandLM mailing list (at least I haven't found one) I'm
 posting here. When creating language model with cat compressor, buildlm
 fails (on my system) with error:

 cat: invalid option -- 'd'

 Attached patch should fix that.

 --
 Ing. Radek Bartoň

 Faculty of Information Technology
 Department of Computer Graphics and Multimedia
 Brno University of Technology

 E-mail: [EMAIL PROTECTED]
 Web: http://blackhex.no-ip.org
 Jabber: [EMAIL PROTECTED]

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] translation result change from time to time

2008-11-21 Thread Miles Osborne
it could be due to things like the way ties are broken, floating-point
errors and the like

Miles

2008/11/21 Hieu Hoang [EMAIL PROTECTED]:
 that would be a worrying. are you sure all parameters are the same ? loading
 the models and memory shouldn't affect the results.

 there may rarely be differences if u're running on different operating
 systems due to floating point operations


 2008/11/21 שי מור יוסף [EMAIL PROTECTED]

 Hello , i found out when I try to translate same sentences in different
 time (from the moment the model loaded into the memory) I get different
 results . why this happen ? memory problem maybe ? or its regarding to
 loading process of the model ?

 I am using 3 binary models on strong server (16 GB ram) , I will be happy
 to get information regarding this problem

 Or suggestions for tests to find out the problem

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Announcement: RandLM

2008-11-03 Thread Miles Osborne
What is it?

RandLM (randomised language modelling) is yet another language model
for Moses.  However, it is designed to be very space-efficient indeed:
 depending upon settings, it can represent an SRILM language model in
about 1/10 of the space. The code can be used to estimate LMs either
from raw text (similar to SRILM's ngram-count) or else can be used
to load pre-built ARPA files.  Best compression results are obtained
when building LMs from raw text.

You can get the code here:

http://sourceforge.net/projects/randlm

(This is the first public release and there are sure to be bugs)

Read the files:

BUILDING_WITH_MOSES.txt

for Moses integration and:

README

for general information on building the release.

Note that Moses can support SRILM and RandLM LMs at the same time --just use

/configure --with-randlm=/path/to/randlm --with-randlm=/path/to/srilm

If you want to read more about this, then look at our ACL and EMNLP papers:

David Talbot and Miles Osborne.  Smoothed Bloom filter language
models: Tera-Scale LMs on the Cheap. EMNLP, Prague, Czech Republic
2007.
http://www.iccs.informatics.ed.ac.uk/~osborne/papers/emnlp07.pdf

David Talbot and Miles Osborne. Randomised Language Modelling for
Statistical Machine Translation. ACL, Prague, Czech Republic 2007.
http://www.iccs.informatics.ed.ac.uk/~osborne/papers/acl07.pdf

Miles

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Significance of BLEU using Multi-bleu

2008-09-18 Thread Miles Osborne
firstly, do MERT and make sure that everything has reasonable parameters!

this is how to think about testing.   you are trying to estimate the
error of your model (which you trained-up in the usual way).  when
estimating this error, the *training set* is the test set. so, the
more `training' material you have, the better your confidence in
estimating that error.

in short,  the more test material you use, the more reliable your
results will be.  results can vary in both directions --both up (you
got lucky) and down (you are unlucky).  increasing the test set size
reduces the chances of either of these situations happening.

when working with a narrow domain, you should need fewer sentences,
exactly how few will depend upon what you are doing.

Miles

2008/9/18 Vineet Kashyap [EMAIL PROTECTED]:
 Hi Miles

 Thanks for the fast reply.

 I am very sure that both testing and training data is different.

 Also, no optimization has been done using MERT and the training
 set is about 8948 sentences. But generally speaking would
 testing on a small set of sentences increase the BLEU
 scores and is it possible to get good scores with a small corpus
 when working with a narrow domain.

 I am doing further testing and will look at Corpus size vs BLEU

 Thanks

 Vineet





 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Fwd: Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Miles Osborne
(my message bounced as it was too long ... here is a truncated  version)

Miles

-- Forwarded message --
From: Miles Osborne [EMAIL PROTECTED]
Date: 2008/8/14
Subject: Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model
and Train Model
To: Llio Humphreys [EMAIL PROTECTED]
Cc: moses-support moses-support@mit.edu


building language models (using for example ngram-count) is computationally
expensive.  from what you tell the list, it seems that you don't have enough
physical memory to run it properly.

you have a number of options:

--specify a lower order model (eg 4 rather than 5, or even 3);  depending
upon how much monolingual training material you have, this may not produce
worse results  and it will certainly run faster and will require less space.

--divide your language model training material into chunks and run
ngram-count on each chunk.  this is one strategy for building LMs using all
of the Giga word corpus (when you don't have access to a 64 bit machine).
here you would create multiple LMs.

--use a disk-based method of creating them.  we have done this, and
basically it trades speed for time.

--take the radical option and simply don't bother smoothing at all (ie use
Google's stupid backoff).  this makes training LMs trivial --just compute
the counts of ngrams and work-out how to store them.  i reckon it should be
possible to do this and create an ARPA file suitable for loading into the
SRILM.

--buy more machines.

Miles
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Miles Osborne
building language models (using for example ngram-count) is computationally
expensive.  from what you tell the list, it seems that you don't have enough
physical memory to run it properly.

you have a number of options:

--specify a lower order model (eg 4 rather than 5, or even 3);  depending
upon how much monolingual training material you have, this may not produce
worse results  and it will certainly run faster and will require less space.

--divide your language model training material into chunks and run
ngram-count on each chunk.  this is one strategy for building LMs using all
of the Giga word corpus (when you don't have access to a 64 bit machine).
here you would create multiple LMs.

--use a disk-based method of creating them.  we have done this, and
basically it trades speed for time.

--take the radical option and simply don't bother smoothing at all (ie use
Google's stupid backoff).  this makes training LMs trivial --just compute
the counts of ngrams and work-out how to store them.  i reckon it should be
possible to do this and create an ARPA file suitable for loading into the
SRILM.

--buy more machines.

Miles

2008/8/14 Llio Humphreys [EMAIL PROTECTED]

 Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
 thank you all for your help.  It is very, very much appreciated. I
 decided to try Eric's packages, and it looks like the installation
 worked.  I typed some of the
  commands in the Baseline instructions without arguments, and the
  program either output to the screen that I missed some arguments or
  gave a description of the program.  Thank you Eric!!!

  Following the Baseline instructions
  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
  following step:

  Use SRILM to build language model:
  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
  -text working-dir/lm/europarl.lowercased -lm
  working-dir/lm/europarl.lm

  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
  path to ngram-count, but it was possible to invoke it without the
  path:

  ngram-count -order 5 -interpolate -kndiscount -text
  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm

  I'm concerned about two things:
  1) this ngram-count step is taking a very long time.  I think I started
  it off around 6pm yesterday, but it's still going.  It's very
  resource-intensive, and it's difficult to get to  other windows open.
  I went to check up on it around 9pm, and couldn't find that particular
  terminal.  I thought I had closed that terminal by mistake, so I stupidly
  opened another one, and entered the same command.  I subsequently
  found that the original terminal was still open, so I closed the
  second one.  I'm not sure if issuing this command a second time on the
  same program and files on a different terminal would corrupt the
  original ngramcount step, and whether I should start it off again, or
  whether starting it off again would make things worse?   I looked up
  ngram-count (
 http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
  and I don't think it outputs to any file, so I guess you have to be in
  the same terminal to do the next step?  I opened
  another terminal and typed 'top' to see what processes are running,
  and I know that ngram-count is doing something, but whether it's doing
  well or stuck in a loop, I can't say.  What I do find strange is that
 the time for ngram-count is said to be 00:58:20, and it's been going
 for hours.. I searched this problem in previous Moses Group emails and
 I understand that if I run this with order 4 instead of 5 it will run
 quicker with very similar results?  So, can I just stop what it's
 doing, and run this command in the same terminal with order 4?  Are
 there any files I need to 'touch' to ensure that it doesn't leave any
 stone unturned?

  2) how to do the next step:


  
 bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl
  -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir
  working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
  -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
  0:5:working-dir/lm/europarl.lm:0

 I assume that like ngram-count, I can just type in
 train-factored-phrase-model.perl without the full path...Do I need to
 set the -scripts-root-dir paramater?  Are all the scripts in the same
 place?

 Thank you,

 Llio




  On 8/14/08, Murat ALPEREN [EMAIL PROTECTED] wrote:
   Dear Llio,
  
   You should be okay with installing moses finally if you have installed
 all
   tha dependant packages before. I am not aware of the 'whereis' command,
 but
   once you train your model, your moses.ini file which is created by
 training
   script will take care of the paths. However, you should carefully supply
   paths while training your model. Before training your model, you should
 have
   two seperate corpus files which are lowercased, sentence aligned and
   

Re: [Moses-support] Moses: Prepare Data, Build Language Model and Train Model

2008-08-13 Thread Miles Osborne
an ugly hack is to simply create a soft link to the i686-m64 directory (as i
recently did on a new 64 bit machine)

Miles

2008/8/13 Sara Stymne [EMAIL PROTECTED]

 Hi!

 When we installed SRILM and Moses on our 64-bit Ubuntu machine we had
 some troubles with getting the machine type right. What solved it in the
 end was to hack the machine-type script (found in srilm/sbin), so that
 it gave the correct machine type, from i686 to i686-m64:

else if (`uname -m` == x86_64) then
set MACHINE_TYPE = i686-m64
#set MACHINE_TYPE = i686

 After doing that we could compile SRILM without specifying the
 MACHINE_TYPE.

 /Sara


 Llio Humphreys skrev:
  Dear Josh,
  thanks for the links.  I had already found this information, and it
  helped me compile SRILM on the Mac.  Here, the problem was finding the
  most appropriate Makefile for the Linux/Ubuntu machine I'm working on:
  amd athlon x2 dual core x86_64.  $SRILM/common.Makefile.i686_m64
  seemed the most appropriate, and the CC and  CXX variables are
  correct, but I still ended up with a lot of errors, unfortunately.
  Llio
 
  On Wed, Aug 13, 2008 at 1:46 PM, Josh Schroeder [EMAIL PROTECTED]
 wrote:
  You can also check out the SRILM documentation:
  http://www.speech.sri.com/projects/srilm/manpages/
  FAQ: http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html
 
  Or search the SRILM mailing list archives:
  http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
 
  -Josh
 
  On 13 Aug 2008, at 13:37, Anung Ariwibowo wrote:
 
  Hi Llio,
 
  I can compile SRILM in Linux Ubuntu without problem. Can you post the
  error message here, maybe we can help.
 
  Cheers,
  Anung
 
  On Wed, Aug 13, 2008 at 8:29 PM, Llio Humphreys [EMAIL PROTECTED]
 
  wrote:
  Dear Josh/Hieu,
  many thanks for your replies.  The default shell is bash, and updating
  the .profile file worked - thanks for that tip.  I look forward to
  hearing more from you about the ./model/extract.0-0.o.part* problem.
  My apologies for my ignorance of Unix matters: I'd like to think of
  myself as a newbie rather than one who is averse to learning about
  these things, and the further information you have provided has been
  useful and interesting.  Hieu mentioned that Anung Ariwibowo got Moses
  to work when he transferred to a Linux machine.  A colleague has
  kindly let me borrow a Linux/Ubuntu machine, but I have already run
  into problems compiling SRILM!  So, I'll see if Eric Nichols's
  packages will take care of that:
  http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/feisty/nlp/http://cl.naist.jp/%7Eeric-n/ubuntu-nlp/dists/feisty/nlp/
  Best regards,
  Llio
 
 
 
  On 8/13/08, Josh Schroeder [EMAIL PROTECTED] wrote:
  Hi Llio,
 
 
  you may have already received my email on the following problem when
  building the language model:
 
  Executing: cat ./model/extract.0-0.o.part*  ./model/extract.0-0.o
  cat: ./model/extract.0-0.o.part*: No such file or directory
  Exit code: 1
 
   That's building the phrase table, not the language model. It seems
 like
  several people on the list are having problems with this step, so I'm
  going
  to take a look at the training process and post something to the list
 in
  the
  next day or two.
 
 
  1. You mention that Moses does not use environment variables.
  However, in order to get SRILM to work, I found it necessary to
 create
  environment variables and pass these on to SRILM's make:
 
  make SRILM=$PWD MACHINE_TYPE=macosx
 
 
 PATH=/bin:/sbin:/usr/bin:/usr/sbin:/Users/lliohumphreys/MT/MOSESSUITE/srilm:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin/macosx:/sw/bin/gawk
  MANPATH=/Users/lliohumphreys/MT/MOSESSUITE/srilm/man
  LC_NUMERIC=C
  In addition, I was also required to type in the following command for
  moses-scripts:
 
  export
 
 SCRIPTS_ROOTDIR=/Users/lliohumphreys/MT/MOSESSUITE/bin/moses-scripts/scripts-20080811-1801
 
   Sorry, I should have been more clear. Moses itself, the decoder that
  loads
  a trained phrase table and language model and translates text, is a
  self-contained command-line program that doesn't require environment
  variables.
 
   Your first example is compiling SRILM. This is not part of the Moses
  toolkit: it's a toolkit of its own for language modeling and a ton of
  other
  stuff. We use it as one of two possible integrated language models
 (the
  other is IRSTLM) with Moses.
 
   Your second example is part of the training regime. Yes, there is
 some
  use
  of the SCRIPTS_ROOTDIR in the
  train-factored-phrase-model.perl, but for most training
  support scripts that come with moses there is a flag that lets you
  specify
  SCRIPTS_ROOTDIR at the command line instead of storing it as an
  environment
  variable. In train-factored-phrase-model it's -scripts-root-dir,
 which
  I
  think you've actually used in one of your other emails.
 
 
 
  If I open a new terminal and echo these variables, most of them are
  blank, and PATH just 

  1   2   >