Forgot to add that we use the compact phrase table and Moses on older
and newer Ubuntu version with Arabic, Chinese, Korean, Japanese, Russian
in both directions and no problems. Those puny German umlauts should not
be a challenge. :)
W dniu 30.03.2015 o 11:08, Marcin Junczys-Dowmunt pisze:
Hi,
the phrase-table and as far as I know Moses in general are
unicode-agnostic, as long as you use utf-8. Input is handled as raw
byte sequences, most of the time there are numeric identifiers only.
Sounds more like a couple of messed up systems on your side,
especially the part where self-compiled systems work or don't work.
Cannot give you much more insight, unfortunately.
Best,
Marcin
W dniu 30.03.2015 o 10:53, "Венцислав Жечев (Ventsislav Zhechev)" pisze:
Hi all,
I’m having this really weird Unicode issue when using compact phrase
tables that could be related to endianness somehow, but I’ve no idea how.
I compiled the training tools from v3 on my Mac and built a few
models using compact phrase (and reordering) tables and KenLM,
including (for simplicity) a recasing model for DE (download it from
https://autodesk.box.com/DE-Recaser). Things become strange when I
try to use the models, though:
1. All works fine when I use the decoder binary I compiled myself on
the Mac (10.10.2, self-built Boost 1.57)
2. Unicode input is not recognised when I use the binary from
http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. words
like ‘für’ or ‘ausführlich’ are marked as UNK.
3. Unicode input is not recognised when I use a binary I compiled
myself on Ubuntu 12.04.5 (self-built Boost 1.57)
4. All works fine when I use the binary from
http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/
I tested the above with the queryPhraseTableMin tool (rather than the
decoder) and got the same results, which is what makes me think this
could be somehow related to binary incompatibility with the way the
phrase table is compacted. Haven’t investigated deeper than that, though.
Any clues?
One would say, just use the Linux binary then on Linux... However, I
have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled
binary doesn’t work, as the system glibc is too old. So there I need
to compile Moses myself, but then Unicode isn’t recognised...
Cheers,
Ventzi
–––––––
*Dr. Ventsislav Zhechev*
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services
*MAIN* +41 32 723 91 22
*FAX* +41 32 723 93 99
_http://VentsislavZhechev.eu_
*Autodesk, Inc.*
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
_www.autodesk.com <http://www.autodesk.com/>_
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support