Axel Dörfler created TIKA-4255: ---------------------------------- Summary: TextAndCSVParser ignores Metadata.CONTENT_ENCODING Key: TIKA-4255 URL: https://issues.apache.org/jira/browse/TIKA-4255 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.9.2, 3.0.0-BETA, 2.6.0 Reporter: Axel Dörfler
I pass a text to the auto-detect parser that just contains the text "ETL". I pass on content type, and content encoding information via Metadata. However, TextAndCSVParser ignores the provided encoding (since CSVParams has not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses to rather detect it by itself. Turns out it detects some IBM424 hebrew charset, and uses that which results in a kind of surprising output. Tested with the mentioned versions, though the bug should be much older already. -- This message was sent by Atlassian Jira (v8.20.10#820010)