[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-3515: ------------------------------ Affects Version/s: (was: 2.0.0-BETA) 2.0.0 > Tika CLI -t should use UTF-8 as default output encoding > ------------------------------------------------------- > > Key: TIKA-3515 > URL: https://issues.apache.org/jira/browse/TIKA-3515 > Project: Tika > Issue Type: Improvement > Affects Versions: 2.0.0, 1.27 > Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 > Reporter: Luís Filipe Nassif > Priority: Minor > Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, > LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, > LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, > LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 > PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, > image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png > > > Some Korean chars are extracted as squares. The encodings of plain texts are > detected correctly. Maybe this is related with the content handler (just a > guess). I'll attach the triggering files. -- This message was sent by Atlassian Jira (v8.3.4#803005)