[ https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938273#comment-14938273 ]
Tim Allison commented on TIKA-1755: ----------------------------------- Current patch gets us this with PPTX: {noformat} <body><div class="slide-content"><table><tr> <td>Row 1 Col 1</td> <td>Row 1 Col 2</td> <td>Row 1 Col 3</td></tr> <tr> <td>Row 2 Col 1</td> <td>Row 2 Col 2</td> <td>Row 2 Col 3</td></tr> </table> <p>Here is a text box</p> <p>Footnote appears here[1]</p> <p>Bold italic underline superscript subscript</p> <p>Here is a list:</p> <p>Bullet 1</p> <p>Bullet 2</p> <p>Bullet 3</p> <p>Here is a numbered list:</p> <p>Number bullet 1</p> <p>Number bullet 2</p> <p>Number bullet 3</p> <p> Keyword1 Keyword2</p> <p>This is a hyperlink</p> <p> Subject is here</p> <p>Suddenly some Japanese text:</p> <p>????????????</p> <p>?????</p> <p>And then some Gothic text:</p> <p>??????</p> <p>Here is a citation:</p> <p>(Kramer)</p> <p>Figure 1 This is a caption for Figure 1</p> <p> </p> <p>Row 1 column 1</p> <p>Row 2 column 1</p> <p>Row 1 column 2</p> <p>Row 2 column 2</p> <p> </p> <p> </p> <p>[1] This is a footnote.</p> </div> <div class="slide-master-content" /> <div class="slide-notes"><p>1</p> <p>This is the footer text.</p> <p>This is the header text.</p> </div> <div class="embedded" id="/docProps/thumbnail.jpeg" /></body></html> {noformat} and this for PPT {noformat} <body><div class="slideShow"><div class="slide"><div class="slide-master-content" /> <div class="slide-content"><p /> <p /> <p /> <p>Here is a text box</p> <p /> <p>Footnote appears here[1]</p> <p>Bolditalicunderlinesuperscriptsubscript</p> <p>Here is a list:</p> <p>Bullet 1</p> <p>Bullet 2</p> <p>Bullet 3</p> <p>Here is a numbered list:</p> <p>Number bullet 1</p> <p>Number bullet 2</p> <p>Number bullet 3</p> <p>Keyword1 Keyword2</p> <p>This is a hyperlink</p> <p>Subject is here</p> <p>Suddenly some Japanese text:</p> <p>????????????</p> <p>?????</p> <p>And then some Gothic text:</p> <p>??????</p> <p>Here is a citation:</p> <p>(Kramer)</p> <p>Figure 1 This is a caption for Figure 1</p> <p /> <p>Row 1 column 1</p> <p>Row 2 column 1</p> <p>Row 1 column 2</p> <p>Row 2 column 2</p> <p /> <p /> <p /> <p>[1]This is a footnote.</p> </div> <table><tr> <td>Row 1 Col 1</td> <td>Row 1 Col 2</td> <td>Row 1 Col 3</td></tr> <tr> <td>Row 2 Col 1</td> <td>Row 2 Col 2</td> <td>Row 2 Col 3</td></tr> </table> </div> </div> <div class="slide-notes"><p /> <p>*</p> <p>This is the footer text.</p> <p>This is the header text.</p> </div> </body></html> {noformat} > Make ppt and pptx paragraph/div breaks more consistent > ------------------------------------------------------ > > Key: TIKA-1755 > URL: https://issues.apache.org/jira/browse/TIKA-1755 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > > In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found > that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, > especially now that we've applied the upgrades from TIKA-1707. > I propose adding quite a few more <p> to capture the sentence/bullet level > breaks in PPTX as we're now doing for PPT. > There are a handful of other things that we could clean up (table handling) > as well. > Some of these changes may be relevant to this > [discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3ccal8pwky96_gkjmps6zxuoe7h7-byvpxjbktbuy1goku3skz...@mail.gmail.com%3E]. > [~shaie], any input? > Patch and example output to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)