[ https://issues.apache.org/jira/browse/TIKA-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steffen Moldaner updated TIKA-912: ---------------------------------- Attachment: TikaResource.java.patch rfc2068, chapter 3.7.1 states that "When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value." TikaResource uses UTF-8 without labeling the charset. I attached a patch against Tika 1.5 based on Chris Wilsons solution. > Response charset encoding not declared, and depends on host OS (Windows/Linux) > ------------------------------------------------------------------------------ > > Key: TIKA-912 > URL: https://issues.apache.org/jira/browse/TIKA-912 > Project: Tika > Issue Type: Bug > Components: server > Affects Versions: 1.1 > Environment: java version "1.6.0_26" > Java(TM) SE Runtime Environment (build 1.6.0_26-b03) > Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode) > java version "1.6.0_31" > Java(TM) SE Runtime Environment (build 1.6.0_31-b05) > Java HotSpot(TM) Client VM (build 20.6-b01, mixed mode, sharing) > Reporter: Chris Wilson > Labels: newbie, patch > Attachments: TikaResource-utf8-response.patch, TikaResource.java.patch > > > When the response to the /tika servlet contains non-ASCII characters, Tika > doesn't tell us what encoding it's using, and the encoding differs depending > on which OS the server is running on. > This is a server running on Tomcat on Linux: > {code} > chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T > documents/fixtures/smartquote-bullet.docx http://localhost:8080/tika/tika | > hexdump -C > 00000000 48 54 54 50 2f 31 2e 31 20 31 30 30 20 43 6f 6e |HTTP/1.1 100 Con| > 00000010 74 69 6e 75 65 0d 0a 0d 0a 48 54 54 50 2f 31 2e |tinue....HTTP/1.| > 00000020 31 20 32 30 30 20 4f 4b 0d 0a 53 65 72 76 65 72 |1 200 OK..Server| > 00000030 3a 20 41 70 61 63 68 65 2d 43 6f 79 6f 74 65 2f |: Apache-Coyote/| > 00000040 31 2e 31 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 70 |1.1..Content-Typ| > 00000050 65 3a 20 74 65 78 74 2f 70 6c 61 69 6e 0d 0a 54 |e: text/plain..T| > 00000060 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e 67 |ransfer-Encoding| > 00000070 3a 20 63 68 75 6e 6b 65 64 0d 0a 44 61 74 65 3a |: chunked..Date:| > 00000080 20 46 72 69 2c 20 30 34 20 4d 61 79 20 32 30 31 | Fri, 04 May 201| > 00000090 32 20 31 39 3a 34 30 3a 35 34 20 47 4d 54 0d 0a |2 19:40:54 GMT..| > 000000a0 0d 0a e2 80 99 0a e2 80 a2 09 0a |...........| > 000000ab > {code} > And this is a server running on Tomcat on Windows: > {code} > chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T > documents/fixtures/smartquote-bullet.docx http://localhost:9080/tika/tika | > hexdump -C > 00000000 48 54 54 50 2f 31 2e 31 20 31 30 30 20 43 6f 6e |HTTP/1.1 100 Con| > 00000010 74 69 6e 75 65 0d 0a 0d 0a 48 54 54 50 2f 31 2e |tinue....HTTP/1.| > 00000020 31 20 32 30 30 20 4f 4b 0d 0a 53 65 72 76 65 72 |1 200 OK..Server| > 00000030 3a 20 41 70 61 63 68 65 2d 43 6f 79 6f 74 65 2f |: Apache-Coyote/| > 00000040 31 2e 31 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 70 |1.1..Content-Typ| > 00000050 65 3a 20 74 65 78 74 2f 70 6c 61 69 6e 0d 0a 54 |e: text/plain..T| > 00000060 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e 67 |ransfer-Encoding| > 00000070 3a 20 63 68 75 6e 6b 65 64 0d 0a 44 61 74 65 3a |: chunked..Date:| > 00000080 20 46 72 69 2c 20 30 34 20 4d 61 79 20 32 30 31 | Fri, 04 May 201| > 00000090 32 20 31 39 3a 33 39 3a 35 32 20 47 4d 54 0d 0a |2 19:39:52 GMT..| > 000000a0 0d 0a 92 0a 95 09 0a |.......| > 000000a7 > {code} > As you can see, the data (last few bytes) is encoded differently. The Linux > server encodes it as UTF-8, while Windows is using something strange, > probably Windows-1252, where 0x92 is a curly quote and 0x95 is a bullet point. > A client can't know what encoding the server used, because the Content-Type > is just text/plain with no encoding. > Ideally I would like it to use UTF-8 always, so that the client doesn't have > to do extra work to decode it. The attached patch does that, and declares it. -- This message was sent by Atlassian JIRA (v6.1.5#6160)