[ https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-987: ----------------------------------- Fix Version/s: (was: 1.14) 1.15 > Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted > ------------------------------------------------------------ > > Key: TIKA-987 > URL: https://issues.apache.org/jira/browse/TIKA-987 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Fix For: 1.15 > > Attachments: picture.doc, picture_3.doc > > > I have two Word docs, both containing the same drawing, but one has > text added. > In one case (picture.doc) the extraction is correct: it contains only > an embedded image.wmf; when I view the image it's correct. > In the second case (picture_3.doc) the picture is extracted as image > (no extension), and is 0 bytes, and there is an invalid character > (mapped to unicode replacement char) inserted before the image: > {noformat} > <title/> > </head> > <body><p>�<img src="embedded:image1" alt="image1"/></p> > <p/> > <p/> > <p>vehicle > </p> > {noformat} > (Though, the text "vehicle" is extracted correctly). > I dug a bit, and with the 2nd doc there is an embedded {SHAPE * > MERGEFORMAT} field, which we invoke > WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts > the 0-byte no-extension image as well as the invalid character. With > the first doc there is no field (at least not one that's handle with > handleSpecialCharacterRuns...). Otherwise I'm not sure how to > fix... it could be something is going wrong in how POI parses the > Pictures from PictureSource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)