Please, see my comments below.

On Fri, Nov 18, 2011 at 10:18 AM, Jörn Kottmann <[email protected]> wrote:

> On 11/18/11 2:26 AM, James Kosin wrote:
>
>> As everyone may know I'm on an encoding head hunt... but would like some
>> feedback on some changes coming soon.
>>
>> For the CLI only... really.   The CLI often times uses the platforms
>> default encoding; which may or may not be desirable.  One of the reasons
>>
> +1. I think that leaving to use platform encoding is fine, given that:
1) user is aware, which encoding is being used (might be difficult with
pipes)
2) there is a way to specify exactly which encoding to use
3) there is choice of input and output


> is that the input or output may become corrupted causing training issues
>> or even usage issues for the operator.  I'm not sure if the | pipe
>> operator has the same issues; however a recent check of some converted
>> files proved that the platform encoding may be undesirable, especially
>> if the output encoding is unable to handle the input characters from
>> another encoding.  Internally to the classes and opening and reading
>> files don't have this issue; so, the libraries themselves are safe.
>>
>
> In my opinion it was just a bad decision to let the format package write
> the
> transformed text to standard out.
> I suggest that we change it and always write to an output file instead.
>
I would make that an option, console output might be useful to somebody,
for example it allows easy combination of tools without having to write to
disk. But there should always be an option to write to file explicitly (not
via >) and specify encoding, also explicitly. This might be needed to
handle difficult cases. It might also be needed to specify different
encodings for input and output, but this way we might have to deal with
encoding incompatibility and I would avoid taking this heavy burden on our
shoulders.


>
> We should maybe also echo the encoding to the console, so the user
> knows which one was used.
>
This might intervene with pipes, doesn't it? The same issue that we had
with the banner.


>
> Should we also change our small demo tools? There I believe it is confusing
> when the user uses an encoding and then cannot see the result on the
> console.

My understanding is that there should be
1) option to use pipes, for input or for output
2) option to specify input or output file explicitly
3) option to specify input and output encodings explicitly

IMO, this should give enough flexibility, for example, this allows one to
load the file from disk into the first tool (let say, tokenizer) using some
custom encoding, then pass it via pipe (without having to slow down the
process by using disk) to pos tagger, then to parser and then to disk. And
in case one uses system with wide enough encoding, say Linux with utf8, one
can avoid specifying encoding, in other cases one always have an option to
specify an encoding to be safe. A couple of examples:

1) via files: opennlp Tokenizer -input file-1252.txt -input-encoding
win1251 -output file-utf8.txt -output-encoding utf8
2) via pipes, one should take care of system encoding in the middle, so
that pipe does not screw anything, suppose system encoding is utf8: opennlp
Tokenizer -input file-1252.txt -input-encoding win1251 -output-encoding
utf8 | opennlp POSTagger | opennlp Parser -output file-utf8.txt

Something like this... What do you think?

Aliaksandr

Reply via email to