Ham, Michael wrote:
> Those escape numbers are Unicode characters. The Chinese character
> set
> does not exist in ASCII, so you have to use UTF-8.
Sorry if I wasn't clear: I'm talking about the Chinese side of
LDC2004E12, which is not in ASCII or Unicode, it's in GB18030.
Apparently, th
TED]>
Subject: [Moses-support] OT: LDC2004E12
To: moses-support@mit.edu
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Sorry for the slightly off-topic message, but at least it's about MT:
We're using the UN Chinese-English Parall
Sorry for the slightly off-topic message, but at least it's about MT:
We're using the UN Chinese-English Parallel Text collection
(LDC2004E12) for some of our work. It has lots of odd sequences of
the form:
\x{a37e}
I presume these are hex codes indicating escaped characters or
somethi