[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127015#comment-14127015 ]
Hudson commented on TIKA-1413: ------------------------------ SUCCESS: Integrated in tika-trunk-jdk1.6 #181 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/181/]) TIKA-1413 - Remove embedded thumbnail from body (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1623819) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java > OOXML thumbnail name added to body > ---------------------------------- > > Key: TIKA-1413 > URL: https://issues.apache.org/jira/browse/TIKA-1413 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.6 > Reporter: Andrzej Bialecki > > AbstractOOXMLExtractor.handleThumbnail processes thumbnails using > EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike > other embedded parts in handleEmbeddedParts(...)). > This results in adding the thumbnail name to the main body of the document > (as a package-entry), which in my opinion is wrong. > Example: > {code} > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="meta:slide-count" content="1"/> > <meta name="cp:revision" content="5"/> > <meta name="meta:last-author" content="Nick Burch"/> > <meta name="Slide-Count" content="1"/> > <meta name="Last-Author" content="Nick Burch"/> > <meta name="meta:save-date" content="2010-09-08T16:15:14Z"/> > <meta name="Content-Length" content="202969"/> > <meta name="subject" content="Gym class featuring a brown fox and lazy dog"/> > <meta name="Application-Name" content="Microsoft Office PowerPoint"/> > <meta name="Author" content="Nevin Nollop"/> > <meta name="dcterms:created" content="1601-01-01T00:00:00Z"/> > <meta name="Application-Version" content="12.0000"/> > <meta name="date" content="2010-09-08T16:15:14Z"/> > <meta name="Total-Time" content="2"/> > <meta name="extended-properties:Template" content=""/> > <meta name="publisher" content=""/> > <meta name="creator" content="Nevin Nollop"/> > <meta name="Word-Count" content="9"/> > <meta name="meta:paragraph-count" content="1"/> > <meta name="extended-properties:AppVersion" content="12.0000"/> > <meta name="Creation-Date" content="1601-01-01T00:00:00Z"/> > <meta name="meta:author" content="Nevin Nollop"/> > <meta name="cp:subject" content="Gym class featuring a brown fox and lazy > dog"/> > <meta name="extended-properties:Application" content="Microsoft Office > PowerPoint"/> > <meta name="resourceName" content="testPPT_embeded.pptx"/> > <meta name="Paragraph-Count" content="1"/> > <meta name="dc:title" content="The quick brown fox jumps over the lazy dog"/> > <meta name="Last-Save-Date" content="2010-09-08T16:15:14Z"/> > <meta name="custom:Version" content="1"/> > <meta name="Revision-Number" content="5"/> > <meta name="Last-Printed" content="1601-01-01T00:00:00Z"/> > <meta name="meta:print-date" content="1601-01-01T00:00:00Z"/> > <meta name="meta:creation-date" content="1601-01-01T00:00:00Z"/> > <meta name="dcterms:modified" content="2010-09-08T16:15:14Z"/> > <meta name="Template" content=""/> > <meta name="dc:creator" content="Nevin Nollop"/> > <meta name="meta:word-count" content="9"/> > <meta name="extended-properties:Company" content=""/> > <meta name="Last-Modified" content="2010-09-08T16:15:14Z"/> > <meta name="extended-properties:PresentationFormat" content="On-screen Show > (4:3)"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-Parsed-By" > content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> > <meta name="modified" content="2010-09-08T16:15:14Z"/> > <meta name="xmpTPg:NPages" content="1"/> > <meta name="extended-properties:TotalTime" content="2"/> > <meta name="dc:publisher" content=""/> > <meta name="Content-Type" > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > <meta name="Presentation-Format" content="On-screen Show (4:3)"/> > <title>The quick brown fox jumps over the lazy dog</title> > </head> > <body><p>The quick brown fox jumps over the lazy dog</p> > <div class="embedded" id="slide1_rId4"/> > <div class="embedded" id="slide1_rId5"/> > <div class="embedded" id="slide1_rId6"/> > <div class="embedded" id="slide1_rId7"/> > <div class="embedded" id="slide1_rId8"/> > <div class="embedded" id="slide1_rId9"/> > <div class="embedded" id="thumbnail_0.jpeg"/><div > class="package-entry"><h1>thumbnail_0.jpeg</h1></div></body></html> > {code} > The extracted plain text looks like this (using tika-app): > {code} > The quick brown fox jumps over the lazy dog > thumbnail_0.jpeg > {code} > The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. > I think also that the id attribute should be set to the real thumbnail path > within the package (i.e. tPart.getPartName().getName()) instead of the > artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)