>>I also tried to use tika-app, since I saw in --help that I can pass the >>--encoding parameter. So I ran:
To clarify (you may already understand this, sorry)…the encoding parameter specifies the output encoding; it is not a hint to Tika in encoding detection. With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with “Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with “Test-email-empty-works.txt”. I get the same behavior when I redirect the output to a file: java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt Bizarrely, it looks like both files are being parsed by the RFC822Parser, and when I run the “detect” commandline option –d, on both files with 1.12 and trunk, both say RFC822. From: Philipp Steinkrüger [mailto:philipp.steinkrue...@uni-koeln.de] Sent: Sunday, May 15, 2016 10:12 AM To: user@tika.apache.org Subject: Tika response encoding problem Dear list, I am running Tika server 1.14 on a Debian jessie. I start the server with this command: java -jar tika-server-1.14-SNAPSHOT.jar If I send a file for metadata extraction like this curl -T email.txt http://localhost:9998/meta The response screws up any umlauts. The environment variables for the shell from which I start the server as well as execute the curl command are as follows: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8 I followed this page (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a clean unicode environment. The test case mentioned on that page works fine. I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran: I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran: (1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt and (2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt The output of umlauts does change, but in neither case is it right. For (1) the umlauts are represented by ‘??’; for (2) they are represented by 'ü’ (that is a capital A with a ~ on top, followed by the quarter sign 1/4). How can I fix this problem? Ultimately, I want to run queries to Tika from a python script (with Chris Mattmann’s module). If this behaviour can be controlled from within python, that would be fine for me. But since I got the problem also using curl and tika-app, I thought that the problem is more likely to be found in tika itself. I’d be very grateful for any assistance! Best, Philipp