Running Mac OS 10.4 and the original opennlp bash script, I've saved the file 
input.txt in the utf-8 encoding and got the correct output both on the Terminal 
and in an ouptut file, which was also saved in unicode utf-8. My Terminal 
display is configured for unicode utf-8. I don't know if these facts are of any 
help for Linux users...

 $ opennlp SimpleTokenizer < input.txt
Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig 
bekannt gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als 
Kavinsky , über sein Debütalbum , seine Musik und die 80 er Jahre unterhalten .


Average: 33,3 sent/s 
Total: 1 sent
Runtime: 0.03s

$ opennlp SimpleTokenizer < input.txt > output.txt


Average: 111,1 sent/s 
Total: 1 sent
Runtime: 0.0090s

$ cat output.txt 
Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig 
bekannt gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als 
Kavinsky , über sein Debütalbum , seine Musik und die 80 er Jahre unterhalten .







________________________________
 De: Jörn Kottmann <[email protected]>
Para: [email protected] 
Enviadas: Sexta-feira, 1 de Março de 2013 5:32
Assunto: Re: German Umlauts broken while using Command Line?
 
The problem here is the ASCII encoding can't encode the German Umlauts
and therefore they are replaced with the question marks you see in the 
output.

Any ideas on how we can improve this? Anyway, if we can't do much about it
we should at least document the work around to manually set the encoding via
file.encoding.

Jörn

On 02/28/2013 06:29 PM, Stefan Matheis wrote:
>
> On Thursday, February 28, 2013 at 5:26 PM, Jörn Kottmann wrote:
>
>> Hmm, pretty sure there is an encoding mismatch, do you know which
>> encoding is used by
>> your JVM? I would guess that is not UTF-8. You can probably get around
>> the issue by re-encoding the input
>> file to the encoding the JVM is using.
>>  
>> Have a look here:
>> http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java
>>  
>> Would be nice if you can run the println statements there.
>>  
>> Jörn
> Where ever this comes from ..
>
> $ java CharsetTest
> Default Charset=US-ASCII
> file.encoding=Latin-1
> Default Charset=US-ASCII
> Default Charset in Use=ASCII
>
> $ echo $JAVA_TOOL_OPTIONS
> (empty)
>
> $ export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF8'
>
> $ java CharsetTest
> Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
> Default Charset=UTF-8
> file.encoding=Latin-1
> Default Charset=UTF-8
> Default Charset in Use=UTF8
>
>
>
> But this change itself didn't help .. output remains unchanged, so i took the 
> road down to dirty-hack-land, applying the following change to bin/opennlp - 
> for sure not how it should be .. but works at least for the moment:
>
> -$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@
> +$JAVACMD -Xmx1024m -Dfile.encoding=UTF8 -jar 
> $OPENNLP_HOME/lib/opennlp-tools-*.jar $@
>
>
>

Reply via email to