[ 
https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605548#comment-16605548
 ] 

Ichbiah commented on TIKA-2711:
-------------------------------

I can provide longer text. The problem is the same. I wonder why Tika is not 
consistent, it is the same text and the TIKA output differs if the file is DOS 
or UNIX. It produces the right text for DOS but not for UNIX where the 
apostrophes are NOT rendered well.

> When parsing a UNIX text file apostrophes are rendered as ?
> -----------------------------------------------------------
>
>                 Key: TIKA-2711
>                 URL: https://issues.apache.org/jira/browse/TIKA-2711
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>         Environment: Windows 10
>            Reporter: Ichbiah
>            Priority: Minor
>             Fix For: 1.19
>
>         Attachments: long_text_dos.txt, long_text_unix.txt, petit_dos.txt, 
> petit_unix.txt
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" 
> from the DOS version of the file. For the UNIX version of the file the 
> apostrophes are falsely rendered as question marks.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to