TokenizerMEEvaluatorTool.java

[email protected] Tue, 12 Jul 2011 06:46:46 -0700

William Colen


On Tue, Jul 12, 2011 at 10:34 AM, Jörn Kottmann <[email protected]> wrote:

> On 7/12/11 3:11 PM, [email protected] wrote:
>
>> Added: incubator/opennlp/trunk/**opennlp-tools/src/main/java/**
>> opennlp/tools/cmdline/**BasicEvaluationParameters.java
>> URL:http://svn.apache.org/**viewvc/incubator/opennlp/**
>> trunk/opennlp-tools/src/main/**java/opennlp/tools/cmdline/**
>> BasicEvaluationParameters.**java?rev=1145578&view=auto<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java?rev=1145578&view=auto>
>> ==============================**==============================**
>> ==================
>> --- incubator/opennlp/trunk/**opennlp-tools/src/main/java/**
>> opennlp/tools/cmdline/**BasicEvaluationParameters.java (added)
>> +++ incubator/opennlp/trunk/**opennlp-tools/src/main/java/**
>> opennlp/tools/cmdline/**BasicEvaluationParameters.java Tue Jul
>>
> ...
>
>  +
>> +  @ParameterDescription(**valueName = "charsetName", description =
>> "specifies the encoding which should be used for reading and writing text")
>> +  @OptionalParameter(**defaultValue="UTF-8")
>> +  Charset getEncoding();
>>
>
> We should decide how we handle this, and do it consistently.
> The trainers declare it as a mandatory parameter, the evaluators declare
> it as optional now and take UTF-8 as default.
>
> In my opinion we should either force the user to specify it, then he
> needs to think about the encoding. Or we use the platform default encoding,
> because
> that is the default a user would expect by convention since all software
> tools usually
> operate with the platform default encoding.
>
> Or is there a good reason to use UTF-8 as a default?
>
> I know that this is a decision which is difficult to get right,
> as far as I know we have been criticized for the current way of doing
> it because people don't want to pass the encoding parameter all the time.
>

Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of the
data I use for Portuguese takes advantage of UTF-8 been the default because
all corpus I have are Latin1 and my system default is neither UTF-8 or
Latin1.

Using the system default looks nice because often we have to use the
converter tools, and that outputs the system default. If we convert, train
and evaluate in the same system we would need to set the encoding parameter
only once.

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Reply via email to