[ https://issues.apache.org/jira/browse/TIKA-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762096#comment-17762096 ]
Tim Allison commented on TIKA-4124: ----------------------------------- Not clear what the licenses are but there are some example files and a helpful discussion here: https://github.com/jgm/pandoc/issues/3883 Looks like the alt chunks can be rtf, html or a bunch of other file formats. Ideally, we'd inline the content, but it will be simpler to handle these like attachments as in the above example code fix. > embedded html of type > http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk > is not parsed > --------------------------------------------------------------------------------------------------------------- > > Key: TIKA-4124 > URL: https://issues.apache.org/jira/browse/TIKA-4124 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Tim Barrett > Priority: Minor > > Word documents that may have been created using third party programs such as > docx4j sometimes contain embedded html. This is not parsed by Tika. The > embedded HTML file usually resides within the main folder of the docx > internal structure. > Changing the code in: > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart() > as follows, handles this (the final else if) > > {color:#7f0055}if{color}{color:#000000} > (POIXMLDocument.{color}{color:#0000c0}OLE_OBJECT_REL_TYPE{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}) > && > {color}{color:#0000c0}TYPE_OLE_OBJECT{color}{color:#000000}.equals({color}{color:#6a3e3e}target{color}{color:#000000}.getContentType())) > {{color} > {color:#000000} > handleEmbeddedOLE({color}{color:#6a3e3e}target{color}{color:#000000}, > {color}{color:#6a3e3e}xhtml{color}{color:#000000}, > {color}{color:#6a3e3e}sourceDesc{color}{color:#000000} + > {color}{color:#6a3e3e}rel{color}{color:#000000}.getId(), > {color}{color:#6a3e3e}parentMetadata{color}{color:#000000});{color} > {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} > ({color}{color:#6a3e3e}targetURI{color}{color:#000000} != > {color}{color:#7f0055}null{color}{color:#000000}) {{color} > {color:#000000} > {color}{color:#6a3e3e}handledTarget{color}{color:#000000}.add({color}{color:#6a3e3e}targetURI{color}{color:#000000}.toString());{color} > {color:#000000} }{color} > {color:#000000} } {color}{color:#7f0055}else{color}{color:#000000} > {color}{color:#7f0055}if{color}{color:#000000} > ({color}{color:#0000c0}RELATION_MEDIA{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}) > || > {color}{color:#0000c0}RELATION_VIDEO{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}) > || > {color}{color:#0000c0}RELATION_AUDIO{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}){color} > {color:#000000} || > PackageRelationshipTypes.{color}{color:#0000c0}IMAGE_PART{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}) > || > POIXMLDocument.{color}{color:#0000c0}PACK_OBJECT_REL_TYPE{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}){color} > {color:#000000} || > POIXMLDocument.{color}{color:#0000c0}OLE_OBJECT_REL_TYPE{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000})) > {{color} > {color:#000000} > handleEmbeddedFile({color}{color:#6a3e3e}target{color}{color:#000000}, > {color}{color:#6a3e3e}xhtml{color}{color:#000000}, > {color}{color:#6a3e3e}sourceDesc{color}{color:#000000} + > {color}{color:#6a3e3e}rel{color}{color:#000000}.getId());{color} > {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} > ({color}{color:#6a3e3e}targetURI{color}{color:#000000} != > {color}{color:#7f0055}null{color}{color:#000000}) {{color} > {color:#000000} > {color}{color:#6a3e3e}handledTarget{color}{color:#000000}.add({color}{color:#6a3e3e}targetURI{color}{color:#000000}.toString());{color} > {color:#000000} }{color} > {color:#000000} } {color}{color:#7f0055}else{color}{color:#000000} > {color}{color:#7f0055}if{color}{color:#000000} > (XSSFRelation.{color}{color:#0000c0}VBA_MACROS{color}{color:#000000}.getRelation().equals({color}{color:#6a3e3e}type{color}{color:#000000})) > {{color} > {color:#000000} > handleMacros({color}{color:#6a3e3e}target{color}{color:#000000}, > {color}{color:#6a3e3e}xhtml{color}{color:#000000});{color} > {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} > ({color}{color:#6a3e3e}targetURI{color}{color:#000000} != > {color}{color:#7f0055}null{color}{color:#000000}) {{color} > {color:#000000} > {color}{color:#6a3e3e}handledTarget{color}{color:#000000}.add({color}{color:#6a3e3e}targetURI{color}{color:#000000}.toString());{color} > {color:#000000} }{color} > {color:#000000} } {color}{color:#7f0055}else{color}{color:#000000} > {color}{color:#7f0055}if{color}{color:#000000} > ({color}{color:#6a3e3e}type{color}{color:#000000}.endsWith({color}{color:#2a00ff}"aFChunk"{color}{color:#000000})) > {{color} > > {color:#000000} > handleEmbeddedFile({color}{color:#6a3e3e}target{color}{color:#000000}, > {color}{color:#6a3e3e}xhtml{color}{color:#000000}, > {color}{color:#6a3e3e}sourceDesc{color}{color:#000000} + > {color}{color:#6a3e3e}rel{color}{color:#000000}.getId());{color} > > {color:#000000} }{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)