[ https://issues.apache.org/jira/browse/TIKA-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734531#comment-14734531 ]
Nick Burch commented on TIKA-1728: ---------------------------------- Detection of the v5 file is handled by the OLE2 container-aware detector. We can't do it with magic, as there is no predictable place in the file to look for some unique bytes I think we still need to keep one of the formats as {{application/x-hwp}}, as that's what most other libraries/programs use. Just need to pick which to make the default If you're able to put some time into building a parser with java-hwp, that'd be great! Probably best as a different jira though to track that > Detection is not working properly for detecting HWP 5.0 file > ------------------------------------------------------------ > > Key: TIKA-1728 > URL: https://issues.apache.org/jira/browse/TIKA-1728 > Project: Tika > Issue Type: Bug > Environment: OS: windows 7 and centos 6 > Java: 1.7 > Tika jar: tika-app-1.10.jar > File: HWP 5.0 > Reporter: mungeol heo > Attachments: HWP-document-file-formats-3.0-Korean.pdf, > HWP-document-file-formats-5.0-Korean.pdf, error-message.png, test_3.0.hwp, > test_5.0.hwp > > > HWP file has two formats which are HWP 3.0 and HWP 5.0. > 'tika-app-1.10.jar' detects HWP 3.0 format's file correctly. > But, not for HWP 5.0. > Used commands and returned results are addresses below. > > java -jar tika-app-1.10.jar --detect test_3.0.hwp > > application/x-hwp > > java -jar tika-app-1.10.jar --detect test_5.0.hwp > > application/x-tika-msoffice -- This message was sent by Atlassian JIRA (v6.3.4#6332)