Sam H created TIKA-1841:
---------------------------

             Summary: Different XML output structure for PPT and PPTX
                 Key: TIKA-1841
                 URL: https://issues.apache.org/jira/browse/TIKA-1841
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.11
            Reporter: Sam H


Issue is slightly related to TIKA-1840

I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
different. 

The structure for PPTX seems as follows:
{code}
<div class="slide-content"></div>
<div class="slide-master-content" />
<div class="slide-notes"></div> //optional
<div class="slide-comment"></div> //optional
...
<div class="slide-content"></div>
<div class="slide-master-content" />
<div class="slide-notes"></div> //optional
<div class="slide-comment"></div> //optional
{code}

Note that there's no parent slide element to indicate the start and end of each 
slide.

For powerpoint the structure is as follows:
{code}
<div class="slideShow">
  <div class="slide">
    <div class="slide-master-content"></div>
    <div class="slide-content"></div>
    <div class="slide-notes"></div> //added in TIKA-1840
    <div class="slide-comment"></div> 
  </div>
  ...
  <div class="slide">
    <div class="slide-master-content"></div>
    <div class="slide-content"></div>
    <div class="slide-notes"></div> //added in TIKA-1840
    <div class="slide-comment"></div>
  </div>
</div>
<div class="slide-notes">
{code}

In my application, I'm using XPath to get the desired information . As the XML 
structure is different, I have to differentiate my XPath queries whether the 
file is PPT (old) or PPTX (new). It would be nice for Tika to return the same 
XML for both.

I would propose changing the XML structure to this:

{code}
<div class="slideShow">
  <div class="slide">
    <div class="slide-master-content"></div>
    <div class="slide-content"></div>
    <div class="slide-notes"></div> //added in TIKA-1840
    <div class="slide-comment"></div> 
  </div>
  ...
  <div class="slide">
    <div class="slide-master-content"></div>
    <div class="slide-content"></div>
    <div class="slide-notes"></div> //added in TIKA-1840
    <div class="slide-comment"></div>
  </div>
</div>
{code}

So, essentially, like the current PPT output, but without the list of notes at 
the end (as this is also omitted for PPTX).

On the one hand this generalizes PPT(X) handling, on the other it can break 
existing (external) functionality relying on a specific XML output format.

I don't know if this is something the project wants fixed or not. If so, I'm 
willing to donate my time.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to