[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-711: ------------------------------------ Attachment: TIKA-711.patch OK, after digging I found out that in fact POI's AbstractWordConverter converts ASCII 30 to Unicode non-breaking hyphen (U+2011) and ASCII 31 to Unicode zero-width space (U+200b), but Tika doesn't. This is why I see the "right" behavior when running POI's command-line conversion but not with Tika. So I think the fix is simple here: just do that same mapping in WordExtractor.handleCharacterRun; attached patch does that, and enables the test case (now passing). > Word parser doesn't extract optional hyphen correctly > ----------------------------------------------------- > > Key: TIKA-711 > URL: https://issues.apache.org/jira/browse/TIKA-711 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: TIKA-711.patch, TIKA-711.patch, testOptionalHyphen.doc, > testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, > testOptionalHyphen.pptx, testOptionalHyphen.rtf > > > We seem not to extract the optional hyphen character correctly in > the Word parser. > You can create this char in Word by typing ctrl and -. It's hidden, > normally; you have to turn on display of formatting marks to see it. > Ideally we'd get U+00AD (unicode soft hyphen), I think. > DOC produces a unicode replacement char, which is wrong. > DOCX and PDF drop the char (which seems acceptable). RTF produces > U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will > produce U+00AD). > PPT and PPTX work correctly (U+00AD). > So DOC is the only bug I think -- I haven't dug into what's wrong > yet... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira