Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build
Hi, Any clue what systems could be messed up? On Ubuntu I complied boost 1.57, cmph and Moses right out of the box, so I don’t see what I could have done wrong there. I just checked and the gzip phrase tables are proper UTF-8. I even ran the processPhraseTableMin binary from the website on the Ubuntu machine and still got the same results. That is, if I query the compact phrase table with the queryPhraseTableMin binary from the website, UTF-8 is recognised and I get results; if I use queryPhraseTableMin that I complied on the same system, UTF-8 is not recognised and I get no results. Does anyone have an idea what could influence the compilation of Moses in a way that would prevent it from properly reading UTF-8? Especially given that the Moses binaries for MacOS X from the website don’t seem to read UTF-8 properly (at least on my machine), and I didn’t compile those. Cheers, Ventzi 30.03.2015 г., в 11:08, moses-support-requ...@mit.edu написал(а): Date: Mon, 30 Mar 2015 11:08:13 +0200 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Subject: Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build To: moses-support@mit.edu Message-ID: 5519127d.7080...@amu.edu.pl Content-Type: text/plain; charset=utf-8 Hi, the phrase-table and as far as I know Moses in general are unicode-agnostic, as long as you use utf-8. Input is handled as raw byte sequences, most of the time there are numeric identifiers only. Sounds more like a couple of messed up systems on your side, especially the part where self-compiled systems work or don't work. Cannot give you much more insight, unfortunately. Best, Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build
Forgot to add that we use the compact phrase table and Moses on older and newer Ubuntu version with Arabic, Chinese, Korean, Japanese, Russian in both directions and no problems. Those puny German umlauts should not be a challenge. :) W dniu 30.03.2015 o 11:08, Marcin Junczys-Dowmunt pisze: Hi, the phrase-table and as far as I know Moses in general are unicode-agnostic, as long as you use utf-8. Input is handled as raw byte sequences, most of the time there are numeric identifiers only. Sounds more like a couple of messed up systems on your side, especially the part where self-compiled systems work or don't work. Cannot give you much more insight, unfortunately. Best, Marcin W dniu 30.03.2015 o 10:53, Венцислав Жечев (Ventsislav Zhechev) pisze: Hi all, I’m having this really weird Unicode issue when using compact phrase tables that could be related to endianness somehow, but I’ve no idea how. I compiled the training tools from v3 on my Mac and built a few models using compact phrase (and reordering) tables and KenLM, including (for simplicity) a recasing model for DE (download it from https://autodesk.box.com/DE-Recaser). Things become strange when I try to use the models, though: 1. All works fine when I use the decoder binary I compiled myself on the Mac (10.10.2, self-built Boost 1.57) 2. Unicode input is not recognised when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. words like ‘für’ or ‘ausführlich’ are marked as UNK. 3. Unicode input is not recognised when I use a binary I compiled myself on Ubuntu 12.04.5 (self-built Boost 1.57) 4. All works fine when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ I tested the above with the queryPhraseTableMin tool (rather than the decoder) and got the same results, which is what makes me think this could be somehow related to binary incompatibility with the way the phrase table is compacted. Haven’t investigated deeper than that, though. Any clues? One would say, just use the Linux binary then on Linux... However, I have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled binary doesn’t work, as the system glibc is too old. So there I need to compile Moses myself, but then Unicode isn’t recognised... Cheers, Ventzi ––– *Dr. Ventsislav Zhechev* Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services *MAIN* +41 32 723 91 22 *FAX* +41 32 723 93 99 _http://VentsislavZhechev.eu_ *Autodesk, Inc.* Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland _www.autodesk.com http://www.autodesk.com/_ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build
Hi, the phrase-table and as far as I know Moses in general are unicode-agnostic, as long as you use utf-8. Input is handled as raw byte sequences, most of the time there are numeric identifiers only. Sounds more like a couple of messed up systems on your side, especially the part where self-compiled systems work or don't work. Cannot give you much more insight, unfortunately. Best, Marcin W dniu 30.03.2015 o 10:53, Венцислав Жечев (Ventsislav Zhechev) pisze: Hi all, I’m having this really weird Unicode issue when using compact phrase tables that could be related to endianness somehow, but I’ve no idea how. I compiled the training tools from v3 on my Mac and built a few models using compact phrase (and reordering) tables and KenLM, including (for simplicity) a recasing model for DE (download it from https://autodesk.box.com/DE-Recaser). Things become strange when I try to use the models, though: 1. All works fine when I use the decoder binary I compiled myself on the Mac (10.10.2, self-built Boost 1.57) 2. Unicode input is not recognised when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. words like ‘für’ or ‘ausführlich’ are marked as UNK. 3. Unicode input is not recognised when I use a binary I compiled myself on Ubuntu 12.04.5 (self-built Boost 1.57) 4. All works fine when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ I tested the above with the queryPhraseTableMin tool (rather than the decoder) and got the same results, which is what makes me think this could be somehow related to binary incompatibility with the way the phrase table is compacted. Haven’t investigated deeper than that, though. Any clues? One would say, just use the Linux binary then on Linux... However, I have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled binary doesn’t work, as the system glibc is too old. So there I need to compile Moses myself, but then Unicode isn’t recognised... Cheers, Ventzi ––– *Dr. Ventsislav Zhechev* Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services *MAIN* +41 32 723 91 22 *FAX* +41 32 723 93 99 _http://VentsislavZhechev.eu_ *Autodesk, Inc.* Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland _www.autodesk.com http://www.autodesk.com/_ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build
Hey Венци, Did you by any chance binarize your phrase tables from a raw text format or from gunzip (or any other supported compressed text formats)? I recently run into similar issues with my phrase table (ProbingPT) if the input phrase table had not been compressed during binary creation. I wasn't able to trace the issue, i just make sure I gz any phrase table before binarizing. Cheers, Nick On Mon, Mar 30, 2015 at 10:11 AM, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: Forgot to add that we use the compact phrase table and Moses on older and newer Ubuntu version with Arabic, Chinese, Korean, Japanese, Russian in both directions and no problems. Those puny German umlauts should not be a challenge. :) W dniu 30.03.2015 o 11:08, Marcin Junczys-Dowmunt pisze: Hi, the phrase-table and as far as I know Moses in general are unicode-agnostic, as long as you use utf-8. Input is handled as raw byte sequences, most of the time there are numeric identifiers only. Sounds more like a couple of messed up systems on your side, especially the part where self-compiled systems work or don't work. Cannot give you much more insight, unfortunately. Best, Marcin W dniu 30.03.2015 o 10:53, Венцислав Жечев (Ventsislav Zhechev) pisze: Hi all, I’m having this really weird Unicode issue when using compact phrase tables that could be related to endianness somehow, but I’ve no idea how. I compiled the training tools from v3 on my Mac and built a few models using compact phrase (and reordering) tables and KenLM, including (for simplicity) a recasing model for DE (download it from https://autodesk.box.com/DE-Recaser). Things become strange when I try to use the models, though: 1. All works fine when I use the decoder binary I compiled myself on the Mac (10.10.2, self-built Boost 1.57) 2. Unicode input is not recognised when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. words like ‘für’ or ‘ausführlich’ are marked as UNK. 3. Unicode input is not recognised when I use a binary I compiled myself on Ubuntu 12.04.5 (self-built Boost 1.57) 4. All works fine when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ I tested the above with the queryPhraseTableMin tool (rather than the decoder) and got the same results, which is what makes me think this could be somehow related to binary incompatibility with the way the phrase table is compacted. Haven’t investigated deeper than that, though. Any clues? One would say, just use the Linux binary then on Linux... However, I have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled binary doesn’t work, as the system glibc is too old. So there I need to compile Moses myself, but then Unicode isn’t recognised... Cheers, Ventzi ––– *Dr. Ventsislav Zhechev* Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services *MAIN* +41 32 723 91 22 *FAX* +41 32 723 93 99 *http://VentsislavZhechev.eu http://VentsislavZhechev.eu* *Autodesk, Inc.* Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland *www.autodesk.com http://www.autodesk.com/* ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Unicode Issues when Using Compact Phrase Table, Binaries vs. Own Build
Sounds like a case of composed characters. Try passing the input through this: uconv -f utf8 -t utf8 -x Any-NFKC --callback skip --remove-signature On 03/30/2015 04:53 AM, Венцислав Жечев (Ventsislav Zhechev) wrote: Hi all, I’m having this really weird Unicode issue when using compact phrase tables that could be related to endianness somehow, but I’ve no idea how. I compiled the training tools from v3 on my Mac and built a few models using compact phrase (and reordering) tables and KenLM, including (for simplicity) a recasing model for DE (download it from https://autodesk.box.com/DE-Recaser). Things become strange when I try to use the models, though: 1. All works fine when I use the decoder binary I compiled myself on the Mac (10.10.2, self-built Boost 1.57) 2. Unicode input is not recognised when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. words like ‘für’ or ‘ausführlich’ are marked as UNK. 3. Unicode input is not recognised when I use a binary I compiled myself on Ubuntu 12.04.5 (self-built Boost 1.57) 4. All works fine when I use the binary from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ I tested the above with the queryPhraseTableMin tool (rather than the decoder) and got the same results, which is what makes me think this could be somehow related to binary incompatibility with the way the phrase table is compacted. Haven’t investigated deeper than that, though. Any clues? One would say, just use the Linux binary then on Linux... However, I have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled binary doesn’t work, as the system glibc is too old. So there I need to compile Moses myself, but then Unicode isn’t recognised... Cheers, Ventzi ––– *Dr. Ventsislav Zhechev* Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services *MAIN* +41 32 723 91 22 *FAX* +41 32 723 93 99 _http://VentsislavZhechev.eu_ *Autodesk, Inc.* Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland _www.autodesk.com http://www.autodesk.com/_ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support