Axel Dörfler created TIKA-4255:
----------------------------------

             Summary: TextAndCSVParser ignores Metadata.CONTENT_ENCODING
                 Key: TIKA-4255
                 URL: https://issues.apache.org/jira/browse/TIKA-4255
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.9.2, 3.0.0-BETA, 2.6.0
            Reporter: Axel Dörfler


I pass a text to the auto-detect parser that just contains the text "ETL". I 
pass on content type, and content encoding information via Metadata.

However, TextAndCSVParser ignores the provided encoding (since CSVParams has 
not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses to 
rather detect it by itself. Turns out it detects some IBM424 hebrew charset, 
and uses that which results in a kind of surprising output.

Tested with the mentioned versions, though the bug should be much older already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to