[
https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-324.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.6
Assignee: Jukka Zitting
OK. I've committed the latest patch to trunk. The code now never uses the
default platform encoding on Mac OS X, opting instead for UTF-8 as the default.
People can still override the setting with an explicit --encoding argument.
For the CentOS case I recommend just setting the LANG environment variable
correctly, as that's used also by other programs and there is no other easy way
for Tika or Java to figure out which encoding should be used on that platform.
> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
> Key: TIKA-324
> URL: https://issues.apache.org/jira/browse/TIKA-324
> Project: Tika
> Issue Type: Bug
> Components: cli
> Affects Versions: 0.3, 0.4, 0.5
> Environment: Mac OS 10.5, java version "1.6.0_15"
> Reporter: Peter Wolanin
> Assignee: Jukka Zitting
> Priority: Critical
> Fix For: 0.6
>
> Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch,
> TIKA-324.patch, TIKA-324.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also: http://drupal.org/node/622508#comment-2267918
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.