[ 
https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-324.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

OK. I've committed the latest patch to trunk. The code now never uses the 
default platform encoding on Mac OS X, opting instead for UTF-8 as the default. 
People can still override the setting with an explicit --encoding argument.

For the CentOS case I recommend just setting the LANG environment variable 
correctly, as that's used also by other programs and there is no other easy way 
for Tika or Java to figure out which encoding should be used on that platform.

> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.6
>
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, 
> TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to