Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-31 Thread Венцислав Жечев (Ventsislav Zhechev)
Hi,

Any clue what systems could be messed up? On Ubuntu I complied boost 1.57, cmph 
and Moses right out of the box, so I don’t see what I could have done wrong 
there.

I just checked and the gzip phrase tables are proper UTF-8. I even ran the 
processPhraseTableMin binary from the website on the Ubuntu machine and still 
got the same results. That is, if I query the compact phrase table with the 
queryPhraseTableMin binary from the website, UTF-8 is recognised and I get 
results; if I use queryPhraseTableMin that I complied on the same system, UTF-8 
is not recognised and I get no results.

Does anyone have an idea what could influence the compilation of Moses in a way 
that would prevent it from properly reading UTF-8?
Especially given that the Moses binaries for MacOS X from the website don’t 
seem to read UTF-8 properly (at least on my machine), and I didn’t compile 
those.


Cheers,

Ventzi

 30.03.2015 г., в 11:08, moses-support-requ...@mit.edu написал(а):
 
 Date: Mon, 30 Mar 2015 11:08:13 +0200
 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl
 Subject: Re: [Moses-support] Unicode Issues when Using Compact Phrase
   Table, Binaries vs. Own Build
 To: moses-support@mit.edu
 Message-ID: 5519127d.7080...@amu.edu.pl
 Content-Type: text/plain; charset=utf-8
 
 Hi,
 the phrase-table and as far as I know Moses in general are 
 unicode-agnostic, as long as you use utf-8. Input is handled as raw byte 
 sequences, most of the time there are numeric identifiers only.
 Sounds more like a couple of messed up systems on your side, especially 
 the part where self-compiled systems work or don't work. Cannot give you 
 much more insight, unfortunately.
 Best,
 Marcin


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Marcin Junczys-Dowmunt
Forgot to add that we use the compact phrase table and Moses on older 
and newer Ubuntu version with Arabic, Chinese, Korean, Japanese, Russian 
in both directions and no problems. Those puny German umlauts should not 
be a challenge. :)


W dniu 30.03.2015 o 11:08, Marcin Junczys-Dowmunt pisze:

Hi,
the phrase-table and as far as I know Moses in general are 
unicode-agnostic, as long as you use utf-8. Input is handled as raw 
byte sequences, most of the time there are numeric identifiers only.
Sounds more like a couple of messed up systems on your side, 
especially the part where self-compiled systems work or don't work. 
Cannot give you much more insight, unfortunately.

Best,
Marcin

W dniu 30.03.2015 o 10:53, Венцислав Жечев (Ventsislav Zhechev) pisze:

Hi all,

I’m having this really weird Unicode issue when using compact phrase 
tables that could be related to endianness somehow, but I’ve no idea how.
I compiled the training tools from v3 on my Mac and built a few 
models using compact phrase (and reordering) tables and KenLM, 
including (for simplicity) a recasing model for DE (download it from 
https://autodesk.box.com/DE-Recaser). Things become strange when I 
try to use the models, though:
1. All works fine when I use the decoder binary I compiled myself on 
the Mac (10.10.2, self-built Boost 1.57)
2. Unicode input is not recognised when I use the binary from 
http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. words 
like ‘für’ or ‘ausführlich’ are marked as UNK.
3. Unicode input is not recognised when I use a binary I compiled 
myself on Ubuntu 12.04.5 (self-built Boost 1.57)
4. All  works fine when I use the binary from 
http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/


I tested the above with the queryPhraseTableMin tool (rather than the 
decoder) and got the same results, which is what makes me think this 
could be somehow related to binary incompatibility with the way the 
phrase table is compacted. Haven’t investigated deeper than that, though.



Any clues?
One would say, just use the Linux binary then on Linux... However, I 
have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled 
binary doesn’t work, as the system glibc is too old. So there I need 
to compile Moses myself, but then Unicode isn’t recognised...




Cheers,

Ventzi

–––
*Dr. Ventsislav Zhechev*
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services

*MAIN* +41 32 723 91 22
*FAX* +41 32 723 93 99

_http://VentsislavZhechev.eu_

*Autodesk, Inc.*
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
_www.autodesk.com http://www.autodesk.com/_





___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Marcin Junczys-Dowmunt

Hi,
the phrase-table and as far as I know Moses in general are 
unicode-agnostic, as long as you use utf-8. Input is handled as raw byte 
sequences, most of the time there are numeric identifiers only.
Sounds more like a couple of messed up systems on your side, especially 
the part where self-compiled systems work or don't work. Cannot give you 
much more insight, unfortunately.

Best,
Marcin

W dniu 30.03.2015 o 10:53, Венцислав Жечев (Ventsislav Zhechev) pisze:

Hi all,

I’m having this really weird Unicode issue when using compact phrase 
tables that could be related to endianness somehow, but I’ve no idea how.
I compiled the training tools from v3 on my Mac and built a few models 
using compact phrase (and reordering) tables and KenLM, including (for 
simplicity) a recasing model for DE (download it from 
https://autodesk.box.com/DE-Recaser). Things become strange when I try 
to use the models, though:
1. All works fine when I use the decoder binary I compiled myself on 
the Mac (10.10.2, self-built Boost 1.57)
2. Unicode input is not recognised when I use the binary from 
http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. 
words like ‘für’ or ‘ausführlich’ are marked as UNK.
3. Unicode input is not recognised when I use a binary I compiled 
myself on Ubuntu 12.04.5 (self-built Boost 1.57)
4. All  works fine when I use the binary from 
http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/


I tested the above with the queryPhraseTableMin tool (rather than the 
decoder) and got the same results, which is what makes me think this 
could be somehow related to binary incompatibility with the way the 
phrase table is compacted. Haven’t investigated deeper than that, though.



Any clues?
One would say, just use the Linux binary then on Linux... However, I 
have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled 
binary doesn’t work, as the system glibc is too old. So there I need 
to compile Moses myself, but then Unicode isn’t recognised...




Cheers,

Ventzi

–––
*Dr. Ventsislav Zhechev*
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services

*MAIN* +41 32 723 91 22
*FAX* +41 32 723 93 99

_http://VentsislavZhechev.eu_

*Autodesk, Inc.*
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
_www.autodesk.com http://www.autodesk.com/_





___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Nikolay Bogoychev
Hey Венци,

Did you by any chance binarize your phrase tables from a raw text format or
from gunzip (or any other supported compressed text formats)? I recently
run into similar issues with my phrase table (ProbingPT)  if the input
phrase table had not been compressed during binary creation. I wasn't able
to trace the issue, i just make sure I gz any phrase table before
binarizing.

Cheers,

Nick

On Mon, Mar 30, 2015 at 10:11 AM, Marcin Junczys-Dowmunt junc...@amu.edu.pl
 wrote:

  Forgot to add that we use the compact phrase table and Moses on older
 and newer Ubuntu version with Arabic, Chinese, Korean, Japanese, Russian in
 both directions and no problems. Those puny German umlauts should not be a
 challenge. :)

 W dniu 30.03.2015 o 11:08, Marcin Junczys-Dowmunt pisze:

 Hi,
 the phrase-table and as far as I know Moses in general are
 unicode-agnostic, as long as you use utf-8. Input is handled as raw byte
 sequences, most of the time there are numeric identifiers only.
 Sounds more like a couple of messed up systems on your side, especially
 the part where self-compiled systems work or don't work. Cannot give you
 much more insight, unfortunately.
 Best,
 Marcin

 W dniu 30.03.2015 o 10:53, Венцислав Жечев (Ventsislav Zhechev) pisze:

 Hi all,

  I’m having this really weird Unicode issue when using compact phrase
 tables that could be related to endianness somehow, but I’ve no idea how.
 I compiled the training tools from v3 on my Mac and built a few models
 using compact phrase (and reordering) tables and KenLM, including (for
 simplicity) a recasing model for DE (download it from
 https://autodesk.box.com/DE-Recaser). Things become strange when I try to
 use the models, though:
 1. All works fine when I use the decoder binary I compiled myself on the
 Mac (10.10.2, self-built Boost 1.57)
  2. Unicode input is not recognised when I use the binary from
 http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e.
 words like ‘für’ or ‘ausführlich’ are marked as UNK.
 3. Unicode input is not recognised when I use a binary I compiled myself
 on Ubuntu 12.04.5 (self-built Boost 1.57)
 4. All  works fine when I use the binary from
 http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/

  I tested the above with the queryPhraseTableMin tool (rather than the
 decoder) and got the same results, which is what makes me think this could
 be somehow related to binary incompatibility with the way the phrase table
 is compacted. Haven’t investigated deeper than that, though.


  Any clues?
 One would say, just use the Linux binary then on Linux... However, I have
 a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled binary
 doesn’t work, as the system glibc is too old. So there I need to compile
 Moses myself, but then Unicode isn’t recognised...



  Cheers,

   Ventzi

  –––
 *Dr. Ventsislav Zhechev*
 Computational Linguist, Certified ScrumMaster®
 Platform Architecture and Technologies
 Localisation Services

  *MAIN* +41 32 723 91 22
 *FAX* +41 32 723 93 99

  *http://VentsislavZhechev.eu http://VentsislavZhechev.eu*

  *Autodesk, Inc.*
 Rue de Puits-Godet 6
 2000 Neuchâtel, Switzerland
 *www.autodesk.com http://www.autodesk.com/*





 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support




 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Kenneth Heafield
Sounds like a case of composed characters.

Try passing the input through this:

uconv -f utf8 -t utf8 -x Any-NFKC --callback skip --remove-signature

On 03/30/2015 04:53 AM, Венцислав Жечев (Ventsislav Zhechev) wrote:
 Hi all,
 
 I’m having this really weird Unicode issue when using compact phrase
 tables that could be related to endianness somehow, but I’ve no idea how.
 I compiled the training tools from v3 on my Mac and built a few models
 using compact phrase (and reordering) tables and KenLM, including (for
 simplicity) a recasing model for DE (download it
 from https://autodesk.box.com/DE-Recaser). Things become strange when I
 try to use the models, though:
 1. All works fine when I use the decoder binary I compiled myself on the
 Mac (10.10.2, self-built Boost 1.57)
 2. Unicode input is not recognised when I use the binary
 from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e.
 words like ‘für’ or ‘ausführlich’ are marked as UNK.
 3. Unicode input is not recognised when I use a binary I compiled myself
 on Ubuntu 12.04.5 (self-built Boost 1.57)
 4. All  works fine when I use the binary
 from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ 
 
 I tested the above with the queryPhraseTableMin tool (rather than the
 decoder) and got the same results, which is what makes me think this
 could be somehow related to binary incompatibility with the way the
 phrase table is compacted. Haven’t investigated deeper than that, though.
 
 
 Any clues?
 One would say, just use the Linux binary then on Linux... However, I
 have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled
 binary doesn’t work, as the system glibc is too old. So there I need to
 compile Moses myself, but then Unicode isn’t recognised...
 
 
 
 Cheers,
 
 Ventzi
 
 –––
 *Dr. Ventsislav Zhechev*
 Computational Linguist, Certified ScrumMaster®
 Platform Architecture and Technologies
 Localisation Services
 
 *MAIN* +41 32 723 91 22
 *FAX* +41 32 723 93 99
 
 _http://VentsislavZhechev.eu_
 
 *Autodesk, Inc.*
 Rue de Puits-Godet 6
 2000 Neuchâtel, Switzerland
 _www.autodesk.com http://www.autodesk.com/_
 
 
 
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support