[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126541#comment-15126541 ]
Tim Allison commented on TIKA-1841: ----------------------------------- Y, I was hoping to get here on earlier work with ppt/pptx...but I was afraid of breaking the backwards compatibility. If we aren't concerned with that too much on this issue, y, let's go forth and make the output structure as equal as possible. Thank you! > Different XML output structure for PPT and PPTX > ----------------------------------------------- > > Key: TIKA-1841 > URL: https://issues.apache.org/jira/browse/TIKA-1841 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.11 > Reporter: Sam H > > Issue is slightly related to TIKA-1840 > I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is > different. > The structure for PPTX seems as follows: > {code} > <div class="slide-content"></div> > <div class="slide-master-content" /> > <div class="slide-notes"></div> //optional > <div class="slide-comment"></div> //optional > ... > <div class="slide-content"></div> > <div class="slide-master-content" /> > <div class="slide-notes"></div> //optional > <div class="slide-comment"></div> //optional > {code} > Note that there's no parent slide element to indicate the start and end of > each slide. > For powerpoint the structure is as follows: > {code} > <div class="slideShow"> > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > ... > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > </div> > <div class="slide-notes"> > {code} > In my application, I'm using XPath to get the desired information . As the > XML structure is different, I have to differentiate my XPath queries whether > the file is PPT (old) or PPTX (new). It would be nice for Tika to return the > same XML for both. > I would propose changing the XML structure to this: > {code} > <div class="slideShow"> > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > ... > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > </div> > {code} > So, essentially, like the current PPT output, but without the list of notes > at the end (as this is also omitted for PPTX). > On the one hand this generalizes PPT(X) handling, on the other it can break > existing (external) functionality relying on a specific XML output format. > I don't know if this is something the project wants fixed or not. If so, I'm > willing to donate my time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)