Re: [Moses-support] OT: LDC2004E12

2008-07-14 Thread John D. Burger
Ham, Michael wrote: > Those escape numbers are Unicode characters. The Chinese character > set > does not exist in ASCII, so you have to use UTF-8. Sorry if I wasn't clear: I'm talking about the Chinese side of LDC2004E12, which is not in ASCII or Unicode, it's in GB18030. Apparently, th

Re: [Moses-support] OT: LDC2004E12

2008-07-13 Thread Ham, Michael
TED]> Subject: [Moses-support] OT: LDC2004E12 To: moses-support@mit.edu Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Sorry for the slightly off-topic message, but at least it's about MT: We're using the UN Chinese-English Parall

[Moses-support] OT: LDC2004E12

2008-07-11 Thread John D. Burger
Sorry for the slightly off-topic message, but at least it's about MT: We're using the UN Chinese-English Parallel Text collection (LDC2004E12) for some of our work. It has lots of odd sequences of the form: \x{a37e} I presume these are hex codes indicating escaped characters or somethi