Jörn, I'm wondering how to implement the EncodingParameter interface.
This is not allowed: @ParameterDescription(valueName = "charsetName", description = "specifies the encoding which should be used for reading and writing text") @OptionalParameter(defaultValue=Charset.defaultCharset().name()) Charset getEncoding(); Will we need to do some special handling in ArgumentParser for that? Maybe setting a constant "DEFAULT_CHARSET" and handle it at ArgumentParse.Parse ? On Tue, Jul 12, 2011 at 10:55 AM, Jörn Kottmann <[email protected]> wrote: > On 7/12/11 3:45 PM, [email protected] wrote: > >> Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of >> the >> data I use for Portuguese takes advantage of UTF-8 been the default >> because >> all corpus I have are Latin1 and my system default is neither UTF-8 or >> Latin1. >> >> Using the system default looks nice because often we have to use the >> converter tools, and that outputs the system default. If we convert, train >> and evaluate in the same system we would need to set the encoding >> parameter >> only once. >> > > This is actually a weakness. I have a macbook, and my default encoding > is MacRoman. I once tried to write japanese text to stdout, but that didn't > work > with MacRoman and more or less all chars have been replaced with a question > mark > (if I remember correctly). > > We might need to change that one day, so the output is always written to a > file. > > I don't really know which of the both ways is better, always specify the > encoding > or use the default, anyway I am +1 for both. If you think we should go the > more standard > way and use the default encoding, then lets do that. > > Jörn >
