[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177637#comment-15177637 ]
ASF GitHub Bot commented on TIKA-1841: -------------------------------------- GitHub user zetisam opened a pull request: https://github.com/apache/tika/pull/86 fix for TIKA-1841 contributed by zetisam You can merge this pull request into a Git repository by running: $ git pull https://github.com/zetisam/tika TIKA-1841 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/86.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #86 ---- commit ea82d8538dbd7a1f68d4d290ad0c115f62b29c76 Author: Sam Heijens <sam.heij...@zeticon.com> Date: 2016-02-15T15:09:51Z fix for TIKA-1841 contributed by zetisam ---- > Different XML output structure for PPT and PPTX > ----------------------------------------------- > > Key: TIKA-1841 > URL: https://issues.apache.org/jira/browse/TIKA-1841 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.11 > Reporter: Sam H > > Issue is slightly related to TIKA-1840 > I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is > different. > The structure for PPTX seems as follows: > {code} > <div class="slide-content"></div> > <div class="slide-master-content" /> > <div class="slide-notes"></div> //optional > <div class="slide-comment"></div> //optional > ... > <div class="slide-content"></div> > <div class="slide-master-content" /> > <div class="slide-notes"></div> //optional > <div class="slide-comment"></div> //optional > {code} > Note that there's no parent slide element to indicate the start and end of > each slide. > For powerpoint the structure is as follows: > {code} > <div class="slideShow"> > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > ... > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > </div> > <div class="slide-notes"> > {code} > In my application, I'm using XPath to get the desired information . As the > XML structure is different, I have to differentiate my XPath queries whether > the file is PPT (old) or PPTX (new). It would be nice for Tika to return the > same XML for both. > I would propose changing the XML structure to this: > {code} > <div class="slideShow"> > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > ... > <div class="slide"> > <div class="slide-master-content"></div> > <div class="slide-content"></div> > <div class="slide-notes"></div> //added in TIKA-1840 > <div class="slide-comment"></div> > </div> > </div> > {code} > So, essentially, like the current PPT output, but without the list of notes > at the end (as this is also omitted for PPTX). > On the one hand this generalizes PPT(X) handling, on the other it can break > existing (external) functionality relying on a specific XML output format. > I don't know if this is something the project wants fixed or not. If so, I'm > willing to donate my time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)