Word parser doesn't extract optional hyphen correctly -----------------------------------------------------
Key: TIKA-711 URL: https://issues.apache.org/jira/browse/TIKA-711 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.0 We seem not to extract the optional hyphen character correctly in the Word parser. You can create this char in Word by typing ctrl and -. It's hidden, normally; you have to turn on display of formatting marks to see it. Ideally we'd get U+00AD (unicode soft hyphen), I think. DOC produces a unicode replacement char, which is wrong. DOCX and PDF drop the char (which seems acceptable). RTF produces U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will produce U+00AD). PPT and PPTX work correctly (U+00AD). So DOC is the only bug I think -- I haven't dug into what's wrong yet... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira