[jira] [Commented] (TIKA-1080) Arabic characters under windows

Jukka Zitting (JIRA) Thu, 07 Feb 2013 06:03:15 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573496#comment-13573496
 ]


Jukka Zitting commented on TIKA-1080:
-------------------------------------

If you don't provide an option like --encoding=UTF-8 on the tika-app command 
line, the text mode will assume that you'll be using the default encoding of 
the system (as reported by the Java runtime). Any characters not supported by 
that encoding will end up as question marks. The simple fix here is to 
explicitly provide the desired encoding with such a command line option.

In contrast HTML output defaults to UTF-8 (though you can override it with the 
--encoding option), since unlike with plain text, HTML clients can 
automatically extract the encoding information from the head of the document 
and thus display the content correctly.

                
> Arabic characters under windows
> -------------------------------
>
>                 Key: TIKA-1080
>                 URL: https://issues.apache.org/jira/browse/TIKA-1080
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.3
>         Environment: Windows 2003 or Windows 2008
>            Reporter: Alberto Ornaghi
>         Attachments: arabic.docx
>
>
> If tika is executed under windows the text mode (--text) is failing to 
> extract arabic chars and outputs only question marks. The same behaviour 
> occurs if tika is executed as a server. The issue is not present in the GUI, 
> only commandline. The issue is not present if the output is html.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1080) Arabic characters under windows

Reply via email to