[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-711: ------------------------------------ Attachment: testOptionalHyphen.rtf testOptionalHyphen.pptx testOptionalHyphen.ppt testOptionalHyphen.pdf testOptionalHyphen.docx testOptionalHyphen.doc TIKA-711.patch Patch. > Word parser doesn't extract optional hyphen correctly > ----------------------------------------------------- > > Key: TIKA-711 > URL: https://issues.apache.org/jira/browse/TIKA-711 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-711.patch, testOptionalHyphen.doc, > testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, > testOptionalHyphen.pptx, testOptionalHyphen.rtf > > > We seem not to extract the optional hyphen character correctly in > the Word parser. > You can create this char in Word by typing ctrl and -. It's hidden, > normally; you have to turn on display of formatting marks to see it. > Ideally we'd get U+00AD (unicode soft hyphen), I think. > DOC produces a unicode replacement char, which is wrong. > DOCX and PDF drop the char (which seems acceptable). RTF produces > U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will > produce U+00AD). > PPT and PPTX work correctly (U+00AD). > So DOC is the only bug I think -- I haven't dug into what's wrong > yet... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira