[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112875#comment-13112875 ]
Michael McCandless commented on TIKA-712: ----------------------------------------- Maybe, until we work this out, we should turn off extracting anything from the master slides? Chris is about to build the release bits for 0.10... So I did some sleuthing. This is all new to me so this is really just speculative but I think I learned a few things: * Each slide refers to a slideLayouts/slideLayoutN.xml, from the _rels/slideN.xml.rels file. * In turn, each slideLayoutN.xml refers to a slideMaster/slideMasterN.xml, from the _rels/slideLayoutN.xml.rels file. * Simply editing footer text on the slide's master is not sufficient to see that text on the slide; you must also go to Insert -> Header & Footer and check the box to display footer/slide number/date and time. * If I enable footers like that, the slideN.xml actually includes the footer text; now, I'm not sure why Tika didn't see this before we changed anything. * If, instead, I go to the slide master and manually insert my own text box, then it comes through on the slides, however Tika (current trunk) fails to extract this onto the slide even though PowerPoint renders it... so we are still missing something here, maybe because we only render the master for the slide and not its layout? * That manually inserted element has a unique {{<p:nvPr userDrawn="1"/>}} under p:sp -> p:nvSpPr... maybe POI/Tika can interpret that to mean "include this text". * I suspect the p:ph element (under p:sp -> p:nvSpPr -> p:nvPr) may be important here... it seems to specify the "type" of the element, and it seems to be included in all the "boilerplate" elements but NOT in the new element I added to the master. You can see it in my examples above (type="ftr" and type="title"). Maybe POI/Tika can interpret the presence of this p:ph element to mean that text should not be included in the slide? I'm not yet sure how to boil this all down to what POI/Tika can concretely use to identify what should be included and what should not but it seems like progress... > Master slide text isn't extracted > --------------------------------- > > Key: TIKA-712 > URL: https://issues.apache.org/jira/browse/TIKA-712 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, > testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, > testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx > > > It looks like we are not getting text from the master slide for PPT > and PPTX. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira