Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-31 Thread Венцислав Жечев (Ventsislav Zhechev)
. Cheers, Ventzi > 30.03.2015 г., в 11:08, moses-support-requ...@mit.edu написал(а): > > Date: Mon, 30 Mar 2015 11:08:13 +0200 > From: Marcin Junczys-Dowmunt > Subject: Re: [Moses-support] Unicode Issues when Using Compact Phrase > Table, Binaries vs. Own Build > To:

Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Kenneth Heafield
Sounds like a case of composed characters. Try passing the input through this: uconv -f utf8 -t utf8 -x Any-NFKC --callback skip --remove-signature On 03/30/2015 04:53 AM, "Венцислав Жечев (Ventsislav Zhechev)" wrote: > Hi all, > > I’m having this really weird Unicode issue when using compact p

Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Nikolay Bogoychev
Hey Венци, Did you by any chance binarize your phrase tables from a raw text format or from gunzip (or any other supported compressed text formats)? I recently run into similar issues with my phrase table (ProbingPT) if the input phrase table had not been compressed during binary creation. I wasn

Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Marcin Junczys-Dowmunt
Forgot to add that we use the compact phrase table and Moses on older and newer Ubuntu version with Arabic, Chinese, Korean, Japanese, Russian in both directions and no problems. Those puny German umlauts should not be a challenge. :) W dniu 30.03.2015 o 11:08, Marcin Junczys-Dowmunt pisze: H

Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build

2015-03-30 Thread Marcin Junczys-Dowmunt
Hi, the phrase-table and as far as I know Moses in general are unicode-agnostic, as long as you use utf-8. Input is handled as raw byte sequences, most of the time there are numeric identifiers only. Sounds more like a couple of messed up systems on your side, especially the part where self-com