[ 
https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605660#comment-16605660
 ] 

Nick Burch commented on TIKA-2711:
----------------------------------

On the long text, Tika is detecting the dos version as {{windows-1252}} 
encoding and the unix one as {{ISO-8859-1}}, which are similar but not the 
same. Most likely it is the {{\r\n}} vs {{\n}} characters which is tipping the 
detection one way or the other, as it's probability and n-gram based.

If you know what encoding you used, tell Tika that! It'll then be fine.

Or get your users to use proper normal apostrophes not the dodgy windows 
special ones ;)

> When parsing a UNIX text file apostrophes are rendered as ?
> -----------------------------------------------------------
>
>                 Key: TIKA-2711
>                 URL: https://issues.apache.org/jira/browse/TIKA-2711
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>         Environment: Windows 10
>            Reporter: Ichbiah
>            Priority: Minor
>             Fix For: 1.19
>
>         Attachments: long_text_dos.txt, long_text_unix.txt, petit_dos.txt, 
> petit_unix.txt
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" 
> from the DOS version of the file. For the UNIX version of the file the 
> apostrophes are falsely rendered as question marks.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to