date:20141229

[Moses-support] training and tuning for POS or CCG

2014-12-29 Thread Eng HAR

I have Arabic into English translation ... Factored Model .

My Question is:  Have i to add POS for the source and target or just target
that i want to translate to (through training and tuning) ?

In case i have to add for both , how can i add supertaged (CCG) to the
Arabic language cause there is no tool support that ?

Thanx in advance .

-- 
رب اعتق رقابنا ورقاب والدينا من النار

*Hamdi Ahmed Rajeh*
*Hunan University - China*
*Phd Researcher*
*0086-15211108249*
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] apos; in tokenization

2014-12-29 Thread Ihab Ramadan

Dears,

When I make tokenization on files it replaces the apostrophes with “apos;”
which make sense, but in the other side it crashes the meaning and the order
of the words at all, for example:

 

Sentence before tokenization :

Src : keep your notification's payload under 5 kb.

Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت.

Sentence after tokenization :

Src: keep your notification apos; s payload under 5 kb .

Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت .

If I translate “keep” without using tokenization it will generates “اجعل”
which Is correct but after using tokenization moses generates “الإعلام”
which means that the alignment is crashed 

do I make something wrong?

do I miss something or just it is a natural behavior when I use tokenization

Thanks 

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft -
Egypt | Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 | Fax+20233032036 |
Follow us on
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary linked |
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Moses tokenizer treats combining diaeresis inconsistently

2014-12-29 Thread Kenneth Heafield

Dear Moses,

The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
, tokenizes differently on different machines.

I'm running tokenizer.perl from head (481a07dc) with this perl:

This is perl 5, version 18, subversion 2 (v5.18.2) built for
x86_64-linux-thread-multi
(with 25 registered patches, see perl -V for more detail)

perl -V is attached from newer machines.

The input is Jürgen with a specific encoding:

uconv -f utf-8 -x any-name jur

\N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
LETTER E}\N{LATIN SMALL LETTER N}\N{control-000A}

So the umlaut is encoded as a normal u character followed by a
combining diaeresis marker.  This encoding is legal, but it differs from
the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
DIAERESIS}.

Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS} is a single character and recognizing it as part of the
IsAlnum class.  Tokenizing on these machines outputs

Jürgen

Newer machines are treating them separately, recognizing \N{COMBINING
DIAERESIS} as a separate character that is not part of IsAlnum.  The
Moses tokenizer then treats it as something to split off, yielding this
tokenization:

Ju ̈ rgen

I thought it might be locale-related but IsAlnum is supposed to be
locale-agnostic.  I couldn't come up with environment variables that
made the new machines tokenize as a single word.

Maybe this is a perl bug, but the result is that two different machines
running the same perl script produce different tokenization :-(.

This is also a reason to turn Unicode normalization on.  If the
tokenizer did NFKC at the beginning, then the problem would go away.

Kenneth



jur.gz
Description: application/gzip
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
   
  Platform:
osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64 
intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread 
-Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe 
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr 
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin 
-Dprivlib=/usr/lib64/perl5/5.18.2 
-Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi 
-Dsitelib=/usr/local/lib64/perl5/5.18.2 
-Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi 
-Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2 
-Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi 
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 -Dlocincpth=/usr/include  
-Dglibpth=/lib64 /usr/lib64  -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo 
-Dmyhostname=localhost -Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh 
-Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=none 
-Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 
5.18.1/x86_64-linux-thread-multi 5.18.1  -Dlibpth=/usr/local/lib64 /lib64 
/usr/lib64 -Dnoextensions=ODBM_File'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
  Compiler:
cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE 
-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O3 -march=native -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
ccversion='', gccversion='4.7.3', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
alignbytes=8, prototype=define
  Linker and Libraries:
ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
libpth=/usr/local/lib64 /lib64 /usr/lib64
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.18.2
gnulibc_version='2.19'
  Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1 
-Wl,--as-needed'


Characteristics of this binary (from libperl): 
  Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS

Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently

2014-12-29 Thread John D Burger

 This is also a reason to turn Unicode normalization on.  If the
 tokenizer did NFKC at the beginning, then the problem would go away.

If I understand the situation correctly, this would only fix this particular 
example and a few others like it. There are many base+combining grapheme 
clusters in Unicode text which cannot be normalized to a single pre-composed 
character. Vietnamese comes to mind.

- JB

On Dec 29, 2014, at 16:05 , Kenneth Heafield mo...@kheafield.com wrote:

 Dear Moses,
 
   The attached file, taken from line 2345157 of
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 , tokenizes differently on different machines.
 
   I'm running tokenizer.perl from head (481a07dc) with this perl:
 
 This is perl 5, version 18, subversion 2 (v5.18.2) built for
 x86_64-linux-thread-multi
 (with 25 registered patches, see perl -V for more detail)
 
 perl -V is attached from newer machines.
 
   The input is Jürgen with a specific encoding:
 
 uconv -f utf-8 -x any-name jur
 
 \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
 DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
 LETTER E}\N{LATIN SMALL LETTER N}\N{control-000A}
 
 So the umlaut is encoded as a normal u character followed by a
 combining diaeresis marker.  This encoding is legal, but it differs from
 the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
 DIAERESIS}.
 
 Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
 DIAERESIS} is a single character and recognizing it as part of the
 IsAlnum class.  Tokenizing on these machines outputs
 
 Jürgen
 
 Newer machines are treating them separately, recognizing \N{COMBINING
 DIAERESIS} as a separate character that is not part of IsAlnum.  The
 Moses tokenizer then treats it as something to split off, yielding this
 tokenization:
 
 Ju ̈ rgen
 
 I thought it might be locale-related but IsAlnum is supposed to be
 locale-agnostic.  I couldn't come up with environment variables that
 made the new machines tokenize as a single word.
 
 Maybe this is a perl bug, but the result is that two different machines
 running the same perl script produce different tokenization :-(.
 
 This is also a reason to turn Unicode normalization on.  If the
 tokenizer did NFKC at the beginning, then the problem would go away.
 
 Kenneth
 
 jur.gzperl_V.txt___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] apos; in tokenization

2014-12-29 Thread Tom Hoar



The escaping is necessary because Moses reserves these characters for 
other uses. When corpora are consistently prepared, the escaping has no 
effect on translation results. It looks like you have not prepared your 
corpora consistently. Note my results (apos;s) are different from yours 
(apos; s):


user@host:~$ echo keep your notification's payload under 5 kb. | 
tokenizer.perl -l en

Tokenizer Version 1.1
Language: en
Number of threads: 1
keep your notification apos;s payload under 5 kb .

Go back and double-check how you prepare your training corpus and your 
translation jobs.



On 12/29/2014 09:26 PM, Ihab Ramadan wrote:


Dears,

When I make tokenization on files it replaces the apostrophes with 
“apos;” which make sense, but in the other side it crashes the 
meaning and the order of the words at all, for example:


Sentence before tokenization :

Src : keep your notification's payload under 5 kb.

Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت.

Sentence after tokenization :

Src: keep your notification apos; s payload under 5 kb .

Trg: اجعل حمولة الإعلام أقل من 5 كيلوبايت.

If I translate “keep” without using tokenization it will generates 
“اجعل” which Is correct but after using tokenization moses generates 
“الإعلام” which means that the alignment is crashed


do I make something wrong?

do I miss something or just it is a natural behavior when I use 
tokenization


Thanks

Best Regards

/Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/ 
- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | 
Fax+20233032036 | *Follow us on *linked 
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary* | 
**ZA102637861* 
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark* | 
**ZA102637858* https://twitter.com/Saudisoft




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently

2014-12-29 Thread Tom Hoar

Japanese is another language that suffers from standard Unicode NFKC 
because the normalization applies changes that can not be reversed.



On 12/30/2014 04:40 AM, John D Burger wrote:
 This is also a reason to turn Unicode normalization on.  If the
 tokenizer did NFKC at the beginning, then the problem would go away.
 If I understand the situation correctly, this would only fix this particular 
 example and a few others like it. There are many base+combining grapheme 
 clusters in Unicode text which cannot be normalized to a single pre-composed 
 character. Vietnamese comes to mind.

 - JB

 On Dec 29, 2014, at 16:05 , Kenneth Heafield mo...@kheafield.com wrote:

 Dear Moses,

  The attached file, taken from line 2345157 of
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 , tokenizes differently on different machines.

  I'm running tokenizer.perl from head (481a07dc) with this perl:

 This is perl 5, version 18, subversion 2 (v5.18.2) built for
 x86_64-linux-thread-multi
 (with 25 registered patches, see perl -V for more detail)

 perl -V is attached from newer machines.

  The input is Jürgen with a specific encoding:

 uconv -f utf-8 -x any-name jur

 \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
 DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
 LETTER E}\N{LATIN SMALL LETTER N}\N{control-000A}

 So the umlaut is encoded as a normal u character followed by a
 combining diaeresis marker.  This encoding is legal, but it differs from
 the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
 DIAERESIS}.

 Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
 DIAERESIS} is a single character and recognizing it as part of the
 IsAlnum class.  Tokenizing on these machines outputs

 Jürgen

 Newer machines are treating them separately, recognizing \N{COMBINING
 DIAERESIS} as a separate character that is not part of IsAlnum.  The
 Moses tokenizer then treats it as something to split off, yielding this
 tokenization:

 Ju ̈ rgen

 I thought it might be locale-related but IsAlnum is supposed to be
 locale-agnostic.  I couldn't come up with environment variables that
 made the new machines tokenize as a single word.

 Maybe this is a perl bug, but the result is that two different machines
 running the same perl script produce different tokenization :-(.

 This is also a reason to turn Unicode normalization on.  If the
 tokenizer did NFKC at the beginning, then the problem would go away.

 Kenneth

 jur.gzperl_V.txt___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] how to compile with nplm library

2014-12-29 Thread Xiaoqiang Feng

Hi,

nplm is one toolkit of neural probabilistic language model. This toolkit
can be used in Moses for language model and bilingual LM(neural network
joint model, ACL 2014). These two parts have been updated in github
mosesdecoder.

If you want to use nplm in Moses, you have to compile Moses by linking
libnplm.a (generated by nplm).
Here is the probelm : how to compile Moses with libnplm.a ? Do I need to
modify the Jamroot file and how to modify ?

Thanks,
Xiaoqiang Feng
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] how to compile with nplm library

2014-12-29 Thread Nikolay Bogoychev

Hey,

First you need to checkout and compile this fork of nplm:
https://github.com/rsennrich/nplm

Then you need to compile moses with nplm switch:
./bjam --with-nplm=path/to/nplm

Then you can see how to use it here
http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc31
On 30 Dec 2014 06:28, Xiaoqiang Feng feng.x.q.2...@gmail.com wrote:

 Hi,

 nplm is one toolkit of neural probabilistic language model. This toolkit
 can be used in Moses for language model and bilingual LM(neural network
 joint model, ACL 2014). These two parts have been updated in github
 mosesdecoder.

 If you want to use nplm in Moses, you have to compile Moses by linking
 libnplm.a (generated by nplm).
 Here is the probelm : how to compile Moses with libnplm.a ? Do I need to
 modify the Jamroot file and how to modify ?

 Thanks,
 Xiaoqiang Feng

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] training and tuning for POS or CCG

[Moses-support] apos; in tokenization

[Moses-support] Moses tokenizer treats combining diaeresis inconsistently

Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently

Re: [Moses-support] apos; in tokenization

Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently

[Moses-support] how to compile with nplm library

Re: [Moses-support] how to compile with nplm library

8 matches

Site Navigation

Mail list logo

Footer information