On 7/12/11 3:45 PM, [email protected] wrote:
Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of the
data I use for Portuguese takes advantage of UTF-8 been the default because
all corpus I have are Latin1 and my system default is neither UTF-8 or
Latin1.

Using the system default looks nice because often we have to use the
converter tools, and that outputs the system default. If we convert, train
and evaluate in the same system we would need to set the encoding parameter
only once.

This is actually a weakness. I have a macbook, and my default encoding
is MacRoman. I once tried to write japanese text to stdout, but that didn't work with MacRoman and more or less all chars have been replaced with a question mark
(if I remember correctly).

We might need to change that one day, so the output is always written to a file.

I don't really know which of the both ways is better, always specify the encoding or use the default, anyway I am +1 for both. If you think we should go the more standard
way and use the default encoding, then lets do that.

Jörn

Reply via email to