[ 
https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605357#comment-16605357
 ] 

Nick Burch commented on TIKA-2711:
----------------------------------

Text files do not include any encoding information, so Tika has to guess one 
before it can process the file. To do that guessing, the more text that Tika 
has to work with, the more accurate it can be

Can you try giving Tika a much longer set of French text in the two formats, 
and see if it gets it right for both?

(IIRC we use the first few KB of text to do the analysis. Very short runs of 
text are always a problem for encoding and language detection, as there's not 
enough to go on to be sure which of the many possibilities is correct)

Alternately, if you know for sure the text encoding used, you can tell Tika 
that and it'll help a lot!

> When parsing a UNIX text file apostrophes are rendered as ?
> -----------------------------------------------------------
>
>                 Key: TIKA-2711
>                 URL: https://issues.apache.org/jira/browse/TIKA-2711
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>         Environment: Windows 10
>            Reporter: Ichbiah
>            Priority: Minor
>             Fix For: 1.19
>
>         Attachments: petit_dos.txt, petit_unix.txt
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" 
> from the DOS version of the file. For the UNIX version of the file the 
> apostrophes are falsely rendered as question marks.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to