TokenizerMEEvaluatorTool.java

[email protected] Tue, 12 Jul 2011 07:50:23 -0700

Jörn,

I'm wondering how to implement the EncodingParameter interface.


This is not allowed:

  @ParameterDescription(valueName = "charsetName", description = "specifies
the encoding which should be used for reading and writing text")
  @OptionalParameter(defaultValue=Charset.defaultCharset().name())
  Charset getEncoding();

Will we need to do some special handling in ArgumentParser for that? Maybe
setting a constant "DEFAULT_CHARSET" and handle it at ArgumentParse.Parse ?


On Tue, Jul 12, 2011 at 10:55 AM, Jörn Kottmann <[email protected]> wrote:

> On 7/12/11 3:45 PM, [email protected] wrote:
>
>> Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of
>> the
>> data I use for Portuguese takes advantage of UTF-8 been the default
>> because
>> all corpus I have are Latin1 and my system default is neither UTF-8 or
>> Latin1.
>>
>> Using the system default looks nice because often we have to use the
>> converter tools, and that outputs the system default. If we convert, train
>> and evaluate in the same system we would need to set the encoding
>> parameter
>> only once.
>>
>
> This is actually a weakness. I have a macbook, and my default encoding
> is MacRoman. I once tried to write japanese text to stdout, but that didn't
> work
> with MacRoman and more or less all chars have been replaced with a question
> mark
> (if I remember correctly).
>
> We might need to change that one day, so the output is always written to a
> file.
>
> I don't really know which of the both ways is better, always specify the
> encoding
> or use the default, anyway I am +1 for both. If you think we should go the
> more standard
> way and use the default encoding, then lets do that.
>
> Jörn
>

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Reply via email to