[ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115857#comment-15115857
 ] 

Nick Burch commented on TIKA-1841:
----------------------------------

I think it would be good to have the PPT and PPTX parsers return xhtml as close 
to identical as we can reasonably get for equivalent input files.

Looking at the XHTML examples given, wrapping things up in a per-slide block 
seems more sensible and useful to me

We do try to avoid making breaking changes where we can, but as I can't think 
of any way to do so here without making an even-more-breaking change of 
duplicating all the text and the markup, it seems that our best bet would be to 
rationalise + warn in the changelog

I think we should have some test powerpoint files with both a .ppt and a .pptx 
version. It might be good if we could write a unit test that verifies that two 
parsers correctly do the slide -> contents + slide -> notes markup, as well as 
both producing the same output. Any chance you'd be able to write that?

Let's give it a few more days for everyone else interested to review + comment 
on this, before we finalise on a xhtml representation for powerpoint slideshows 
to update the parsers to

> Different XML output structure for PPT and PPTX
> -----------------------------------------------
>
>                 Key: TIKA-1841
>                 URL: https://issues.apache.org/jira/browse/TIKA-1841
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> ...
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> <div class="slide-notes">
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to