Tika CLI mangles utf-8 content in text (-t) mode
------------------------------------------------

                 Key: TIKA-324
                 URL: https://issues.apache.org/jira/browse/TIKA-324
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 0.4, 0.3
         Environment: Mac OS 10.5, java version "1.6.0_15"
            Reporter: Peter Wolanin
            Priority: Critical
             Fix For: 0.5
         Attachments: test.txt


When using the -t flag to tika, multi-byte content is destroyed in the output.

Example:

{code}
$ java -jar tika-app-0.4.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
{code}


see also:  http://drupal.org/node/622508#comment-2267918

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to