[ 
https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112875#comment-13112875
 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

Maybe, until we work this out, we should turn off extracting anything
from the master slides?  Chris is about to build the release bits for
0.10...

So I did some sleuthing.  This is all new to me so this is really just
speculative but I think I learned a few things:

  * Each slide refers to a slideLayouts/slideLayoutN.xml, from the
    _rels/slideN.xml.rels file.

  * In turn, each slideLayoutN.xml refers to a
    slideMaster/slideMasterN.xml, from the _rels/slideLayoutN.xml.rels
    file.

  * Simply editing footer text on the slide's master is not sufficient
    to see that text on the slide; you must also go to Insert ->
    Header & Footer and check the box to display footer/slide
    number/date and time.

  * If I enable footers like that, the slideN.xml actually includes
    the footer text; now, I'm not sure why Tika didn't see this before
    we changed anything.

  * If, instead, I go to the slide master and manually insert my own
    text box, then it comes through on the slides, however Tika
    (current trunk) fails to extract this onto the slide even though
    PowerPoint renders it... so we are still missing something here,
    maybe because we only render the master for the slide and not
    its layout?

  * That manually inserted element has a unique {{<p:nvPr
    userDrawn="1"/>}} under p:sp -> p:nvSpPr... maybe POI/Tika can
    interpret that to mean "include this text".

  * I suspect the p:ph element (under p:sp -> p:nvSpPr -> p:nvPr) may
    be important here... it seems to specify the "type" of the
    element, and it seems to be included in all the "boilerplate"
    elements but NOT in the new element I added to the master.  You
    can see it in my examples above (type="ftr" and type="title").
    Maybe POI/Tika can interpret the presence of this p:ph element
    to mean that text should not be included in the slide?

I'm not yet sure how to boil this all down to what POI/Tika can
concretely use to identify what should be included and what should
not but it seems like progress...


> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, 
> testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, 
> testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to