[
https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778951#action_12778951
]
Peter Wolanin commented on TIKA-324:
------------------------------------
on Mac OS 10.5 it looks correct:
$echo $LANG
en_US.UTF-8
on CentOS 5, no value is set:
echo $LANG
If I set that value on CenOS (to the same as my Mac) then output is correct:
[r...@i:~] export LANG=en_US.UTF-8
[r...@i:~] java -jar tika-app-0.4.jar -t test.txt
Iñtërnâtiônàlizætiøn
> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
> Key: TIKA-324
> URL: https://issues.apache.org/jira/browse/TIKA-324
> Project: Tika
> Issue Type: Bug
> Components: cli
> Affects Versions: 0.3, 0.4, 0.5
> Environment: Mac OS 10.5, java version "1.6.0_15"
> Reporter: Peter Wolanin
> Priority: Critical
> Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch,
> TIKA-324.patch, TIKA-324.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also: http://drupal.org/node/622508#comment-2267918
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.