Olivier M created TIKA-1794:
-------------------------------

             Summary: TXTParser removes form feed characters
                 Key: TIKA-1794
                 URL: https://issues.apache.org/jira/browse/TIKA-1794
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.11
         Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
            Reporter: Olivier M
            Priority: Minor


Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when 
parsing a text file.

If I compare the hex bytes of the original file and the hex bytes of the 
extracted text I can see that the 0C character is replaced by  EF BF BD which 
is the UTF-8 replacement character.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to