Sounds like a case of composed characters. Try passing the input through this:
uconv -f utf8 -t utf8 -x Any-NFKC --callback skip --remove-signature On 03/30/2015 04:53 AM, "Венцислав Жечев (Ventsislav Zhechev)" wrote: > Hi all, > > I’m having this really weird Unicode issue when using compact phrase > tables that could be related to endianness somehow, but I’ve no idea how. > I compiled the training tools from v3 on my Mac and built a few models > using compact phrase (and reordering) tables and KenLM, including (for > simplicity) a recasing model for DE (download it > from https://autodesk.box.com/DE-Recaser). Things become strange when I > try to use the models, though: > 1. All works fine when I use the decoder binary I compiled myself on the > Mac (10.10.2, self-built Boost 1.57) > 2. Unicode input is not recognised when I use the binary > from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e. > words like ‘für’ or ‘ausführlich’ are marked as UNK. > 3. Unicode input is not recognised when I use a binary I compiled myself > on Ubuntu 12.04.5 (self-built Boost 1.57) > 4. All works fine when I use the binary > from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ > > I tested the above with the queryPhraseTableMin tool (rather than the > decoder) and got the same results, which is what makes me think this > could be somehow related to binary incompatibility with the way the > phrase table is compacted. Haven’t investigated deeper than that, though. > > > Any clues? > One would say, just use the Linux binary then on Linux... However, I > have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled > binary doesn’t work, as the system glibc is too old. So there I need to > compile Moses myself, but then Unicode isn’t recognised... > > > > Cheers, > > Ventzi > > ––––––– > *Dr. Ventsislav Zhechev* > Computational Linguist, Certified ScrumMaster® > Platform Architecture and Technologies > Localisation Services > > *MAIN* +41 32 723 91 22 > *FAX* +41 32 723 93 99 > > _http://VentsislavZhechev.eu_ > > *Autodesk, Inc.* > Rue de Puits-Godet 6 > 2000 Neuchâtel, Switzerland > _www.autodesk.com <http://www.autodesk.com/>_ > > > > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support