Sounds like a case of composed characters.

Try passing the input through this:

uconv -f utf8 -t utf8 -x Any-NFKC --callback skip --remove-signature

On 03/30/2015 04:53 AM, "Венцислав Жечев (Ventsislav Zhechev)" wrote:
> Hi all,
> 
> I’m having this really weird Unicode issue when using compact phrase
> tables that could be related to endianness somehow, but I’ve no idea how.
> I compiled the training tools from v3 on my Mac and built a few models
> using compact phrase (and reordering) tables and KenLM, including (for
> simplicity) a recasing model for DE (download it
> from https://autodesk.box.com/DE-Recaser). Things become strange when I
> try to use the models, though:
> 1. All works fine when I use the decoder binary I compiled myself on the
> Mac (10.10.2, self-built Boost 1.57)
> 2. Unicode input is not recognised when I use the binary
> from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e.
> words like ‘für’ or ‘ausführlich’ are marked as UNK.
> 3. Unicode input is not recognised when I use a binary I compiled myself
> on Ubuntu 12.04.5 (self-built Boost 1.57)
> 4. All  works fine when I use the binary
> from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/ 
> 
> I tested the above with the queryPhraseTableMin tool (rather than the
> decoder) and got the same results, which is what makes me think this
> could be somehow related to binary incompatibility with the way the
> phrase table is compacted. Haven’t investigated deeper than that, though.
> 
> 
> Any clues?
> One would say, just use the Linux binary then on Linux... However, I
> have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled
> binary doesn’t work, as the system glibc is too old. So there I need to
> compile Moses myself, but then Unicode isn’t recognised...
> 
> 
> 
> Cheers,
> 
> Ventzi
> 
> –––––––
> *Dr. Ventsislav Zhechev*
> Computational Linguist, Certified ScrumMaster®
> Platform Architecture and Technologies
> Localisation Services
> 
> *MAIN* +41 32 723 91 22
> *FAX* +41 32 723 93 99
> 
> _http://VentsislavZhechev.eu_
> 
> *Autodesk, Inc.*
> Rue de Puits-Godet 6
> 2000 Neuchâtel, Switzerland
> _www.autodesk.com <http://www.autodesk.com/>_
> 
> 
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to