Running Mac OS 10.4 and the original opennlp bash script, I've saved the file input.txt in the utf-8 encoding and got the correct output both on the Terminal and in an ouptut file, which was also saved in unicode utf-8. My Terminal display is configured for unicode utf-8. I don't know if these facts are of any help for Linux users...
$ opennlp SimpleTokenizer < input.txt Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig bekannt gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als Kavinsky , über sein Debütalbum , seine Musik und die 80 er Jahre unterhalten . Average: 33,3 sent/s Total: 1 sent Runtime: 0.03s $ opennlp SimpleTokenizer < input.txt > output.txt Average: 111,1 sent/s Total: 1 sent Runtime: 0.0090s $ cat output.txt Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig bekannt gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als Kavinsky , über sein Debütalbum , seine Musik und die 80 er Jahre unterhalten . ________________________________ De: Jörn Kottmann <[email protected]> Para: [email protected] Enviadas: Sexta-feira, 1 de Março de 2013 5:32 Assunto: Re: German Umlauts broken while using Command Line? The problem here is the ASCII encoding can't encode the German Umlauts and therefore they are replaced with the question marks you see in the output. Any ideas on how we can improve this? Anyway, if we can't do much about it we should at least document the work around to manually set the encoding via file.encoding. Jörn On 02/28/2013 06:29 PM, Stefan Matheis wrote: > > On Thursday, February 28, 2013 at 5:26 PM, Jörn Kottmann wrote: > >> Hmm, pretty sure there is an encoding mismatch, do you know which >> encoding is used by >> your JVM? I would guess that is not UTF-8. You can probably get around >> the issue by re-encoding the input >> file to the encoding the JVM is using. >> >> Have a look here: >> http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java >> >> Would be nice if you can run the println statements there. >> >> Jörn > Where ever this comes from .. > > $ java CharsetTest > Default Charset=US-ASCII > file.encoding=Latin-1 > Default Charset=US-ASCII > Default Charset in Use=ASCII > > $ echo $JAVA_TOOL_OPTIONS > (empty) > > $ export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF8' > > $ java CharsetTest > Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8 > Default Charset=UTF-8 > file.encoding=Latin-1 > Default Charset=UTF-8 > Default Charset in Use=UTF8 > > > > But this change itself didn't help .. output remains unchanged, so i took the > road down to dirty-hack-land, applying the following change to bin/opennlp - > for sure not how it should be .. but works at least for the moment: > > -$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@ > +$JAVACMD -Xmx1024m -Dfile.encoding=UTF8 -jar > $OPENNLP_HOME/lib/opennlp-tools-*.jar $@ > > >
