[Moses-support] training and tuning for POS or CCG
I have Arabic into English translation ... Factored Model . My Question is: Have i to add POS for the source and target or just target that i want to translate to (through training and tuning) ? In case i have to add for both , how can i add supertaged (CCG) to the Arabic language cause there is no tool support that ? Thanx in advance . -- رب اعتق رقابنا ورقاب والدينا من النار *Hamdi Ahmed Rajeh* *Hunan University - China* *Phd Researcher* *0086-15211108249* ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] apos; in tokenization
Dears, When I make tokenization on files it replaces the apostrophes with “apos;” which make sense, but in the other side it crashes the meaning and the order of the words at all, for example: Sentence before tokenization : Src : keep your notification's payload under 5 kb. Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت. Sentence after tokenization : Src: keep your notification apos; s payload under 5 kb . Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت . If I translate “keep” without using tokenization it will generates “اجعل” which Is correct but after using tokenization moses generates “الإعلام” which means that the alignment is crashed do I make something wrong? do I miss something or just it is a natural behavior when I use tokenization Thanks Best Regards Ihab Ramadan| Senior Developer| http://www.saudisoft.com/ Saudisoft - Egypt | Tel +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax+20233032036 | Follow us on http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=V SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri mary linked | https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo okmark ZA102637861 | https://twitter.com/Saudisoft ZA102637858 ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Moses tokenizer treats combining diaeresis inconsistently
Dear Moses, The attached file, taken from line 2345157 of http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz , tokenizes differently on different machines. I'm running tokenizer.perl from head (481a07dc) with this perl: This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more detail) perl -V is attached from newer machines. The input is Jürgen with a specific encoding: uconv -f utf-8 -x any-name jur \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{control-000A} So the umlaut is encoded as a normal u character followed by a combining diaeresis marker. This encoding is legal, but it differs from the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH DIAERESIS}. Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS} is a single character and recognizing it as part of the IsAlnum class. Tokenizing on these machines outputs Jürgen Newer machines are treating them separately, recognizing \N{COMBINING DIAERESIS} as a separate character that is not part of IsAlnum. The Moses tokenizer then treats it as something to split off, yielding this tokenization: Ju ̈ rgen I thought it might be locale-related but IsAlnum is supposed to be locale-agnostic. I couldn't come up with environment variables that made the new machines tokenize as a single word. Maybe this is a perl bug, but the result is that two different machines running the same perl script produce different tokenization :-(. This is also a reason to turn Unicode normalization on. If the tokenizer did NFKC at the beginning, then the problem would go away. Kenneth jur.gz Description: application/gzip Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64 intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux ' config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib64/perl5/5.18.2 -Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi -Dsitelib=/usr/local/lib64/perl5/5.18.2 -Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2 -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 -Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=none -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dnoextensions=ODBM_File' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O3 -march=native -pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe' ccversion='', gccversion='4.7.3', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.18.2 gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1 -Wl,--as-needed' Characteristics of this binary (from libperl): Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently
This is also a reason to turn Unicode normalization on. If the tokenizer did NFKC at the beginning, then the problem would go away. If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind. - JB On Dec 29, 2014, at 16:05 , Kenneth Heafield mo...@kheafield.com wrote: Dear Moses, The attached file, taken from line 2345157 of http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz , tokenizes differently on different machines. I'm running tokenizer.perl from head (481a07dc) with this perl: This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more detail) perl -V is attached from newer machines. The input is Jürgen with a specific encoding: uconv -f utf-8 -x any-name jur \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{control-000A} So the umlaut is encoded as a normal u character followed by a combining diaeresis marker. This encoding is legal, but it differs from the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH DIAERESIS}. Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS} is a single character and recognizing it as part of the IsAlnum class. Tokenizing on these machines outputs Jürgen Newer machines are treating them separately, recognizing \N{COMBINING DIAERESIS} as a separate character that is not part of IsAlnum. The Moses tokenizer then treats it as something to split off, yielding this tokenization: Ju ̈ rgen I thought it might be locale-related but IsAlnum is supposed to be locale-agnostic. I couldn't come up with environment variables that made the new machines tokenize as a single word. Maybe this is a perl bug, but the result is that two different machines running the same perl script produce different tokenization :-(. This is also a reason to turn Unicode normalization on. If the tokenizer did NFKC at the beginning, then the problem would go away. Kenneth jur.gzperl_V.txt___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] apos; in tokenization
The escaping is necessary because Moses reserves these characters for other uses. When corpora are consistently prepared, the escaping has no effect on translation results. It looks like you have not prepared your corpora consistently. Note my results (apos;s) are different from yours (apos; s): user@host:~$ echo keep your notification's payload under 5 kb. | tokenizer.perl -l en Tokenizer Version 1.1 Language: en Number of threads: 1 keep your notification apos;s payload under 5 kb . Go back and double-check how you prepare your training corpus and your translation jobs. On 12/29/2014 09:26 PM, Ihab Ramadan wrote: Dears, When I make tokenization on files it replaces the apostrophes with “apos;” which make sense, but in the other side it crashes the meaning and the order of the words at all, for example: Sentence before tokenization : Src : keep your notification's payload under 5 kb. Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت. Sentence after tokenization : Src: keep your notification apos; s payload under 5 kb . Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت. If I translate “keep” without using tokenization it will generates “اجعل” which Is correct but after using tokenization moses generates “الإعلام” which means that the alignment is crashed do I make something wrong? do I miss something or just it is a natural behavior when I use tokenization Thanks Best Regards /Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/ - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | Fax+20233032036 | *Follow us on *linked http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary* | **ZA102637861* https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark* | **ZA102637858* https://twitter.com/Saudisoft ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently
Japanese is another language that suffers from standard Unicode NFKC because the normalization applies changes that can not be reversed. On 12/30/2014 04:40 AM, John D Burger wrote: This is also a reason to turn Unicode normalization on. If the tokenizer did NFKC at the beginning, then the problem would go away. If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind. - JB On Dec 29, 2014, at 16:05 , Kenneth Heafield mo...@kheafield.com wrote: Dear Moses, The attached file, taken from line 2345157 of http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz , tokenizes differently on different machines. I'm running tokenizer.perl from head (481a07dc) with this perl: This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more detail) perl -V is attached from newer machines. The input is Jürgen with a specific encoding: uconv -f utf-8 -x any-name jur \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{control-000A} So the umlaut is encoded as a normal u character followed by a combining diaeresis marker. This encoding is legal, but it differs from the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH DIAERESIS}. Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS} is a single character and recognizing it as part of the IsAlnum class. Tokenizing on these machines outputs Jürgen Newer machines are treating them separately, recognizing \N{COMBINING DIAERESIS} as a separate character that is not part of IsAlnum. The Moses tokenizer then treats it as something to split off, yielding this tokenization: Ju ̈ rgen I thought it might be locale-related but IsAlnum is supposed to be locale-agnostic. I couldn't come up with environment variables that made the new machines tokenize as a single word. Maybe this is a perl bug, but the result is that two different machines running the same perl script produce different tokenization :-(. This is also a reason to turn Unicode normalization on. If the tokenizer did NFKC at the beginning, then the problem would go away. Kenneth jur.gzperl_V.txt___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] how to compile with nplm library
Hi, nplm is one toolkit of neural probabilistic language model. This toolkit can be used in Moses for language model and bilingual LM(neural network joint model, ACL 2014). These two parts have been updated in github mosesdecoder. If you want to use nplm in Moses, you have to compile Moses by linking libnplm.a (generated by nplm). Here is the probelm : how to compile Moses with libnplm.a ? Do I need to modify the Jamroot file and how to modify ? Thanks, Xiaoqiang Feng ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] how to compile with nplm library
Hey, First you need to checkout and compile this fork of nplm: https://github.com/rsennrich/nplm Then you need to compile moses with nplm switch: ./bjam --with-nplm=path/to/nplm Then you can see how to use it here http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc31 On 30 Dec 2014 06:28, Xiaoqiang Feng feng.x.q.2...@gmail.com wrote: Hi, nplm is one toolkit of neural probabilistic language model. This toolkit can be used in Moses for language model and bilingual LM(neural network joint model, ACL 2014). These two parts have been updated in github mosesdecoder. If you want to use nplm in Moses, you have to compile Moses by linking libnplm.a (generated by nplm). Here is the probelm : how to compile Moses with libnplm.a ? Do I need to modify the Jamroot file and how to modify ? Thanks, Xiaoqiang Feng ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support