[ 
https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938273#comment-14938273
 ] 

Tim Allison commented on TIKA-1755:
-----------------------------------

Current patch gets us this with PPTX:
{noformat}
<body><div class="slide-content"><table><tr>    <td>Row 1 Col 1</td>    <td>Row 
1 Col 2</td>    <td>Row 1 Col 3</td></tr>
<tr>    <td>Row 2 Col 1</td>    <td>Row 2 Col 2</td>    <td>Row 2 Col 
3</td></tr>
</table>
<p>Here is a text box</p>
<p>Footnote appears here[1]</p>
<p>Bold italic underline superscript subscript</p>
<p>Here is a list:</p>
<p>Bullet 1</p>
<p>Bullet 2</p>
<p>Bullet 3</p>
<p>Here is a numbered list:</p>
<p>Number bullet 1</p>
<p>Number bullet 2</p>
<p>Number bullet 3</p>
<p> Keyword1 Keyword2</p>
<p>This is a hyperlink</p>
<p> Subject is here</p>
<p>Suddenly some Japanese text:</p>
<p>????????????</p>
<p>?????</p>
<p>And then some Gothic text:</p>
<p>??????</p>
<p>Here is a citation:</p>
<p>(Kramer)</p>
<p>Figure 1 This is a caption for Figure 1</p>
<p>
</p>
<p>Row 1 column 1</p>
<p>Row 2 column 1</p>
<p>Row 1 column 2</p>
<p>Row 2 column 2</p>
<p>
</p>
<p>
</p>
<p>[1] This is a footnote.</p>
</div>
<div class="slide-master-content" />
<div class="slide-notes"><p>1</p>
<p>This is the footer text.</p>
<p>This is the header text.</p>
</div>
<div class="embedded" id="/docProps/thumbnail.jpeg" /></body></html>
{noformat} 
and this for PPT 
{noformat}
<body><div class="slideShow"><div class="slide"><div 
class="slide-master-content" />
<div class="slide-content"><p />
<p />
<p />
<p>Here is a text box</p>
<p />
<p>Footnote appears here[1]</p>
<p>Bolditalicunderlinesuperscriptsubscript</p>
<p>Here is a list:</p>
<p>Bullet 1</p>
<p>Bullet 2</p>
<p>Bullet 3</p>
<p>Here is a numbered list:</p>
<p>Number bullet 1</p>
<p>Number bullet 2</p>
<p>Number bullet 3</p>
<p>Keyword1 Keyword2</p>
<p>This is a hyperlink</p>
<p>Subject is here</p>
<p>Suddenly some Japanese text:</p>
<p>????????????</p>
<p>?????</p>
<p>And then some Gothic text:</p>
<p>??????</p>
<p>Here is a citation:</p>
<p>(Kramer)</p>
<p>Figure 1 This is a caption for Figure 1</p>
<p />
<p>Row 1 column 1</p>
<p>Row 2 column 1</p>
<p>Row 1 column 2</p>
<p>Row 2 column 2</p>
<p />
<p />
<p />
<p>[1]This is a footnote.</p>
</div>
<table><tr>     <td>Row 1 Col 1</td>    <td>Row 1 Col 2</td>    <td>Row 1 Col 
3</td></tr>
<tr>    <td>Row 2 Col 1</td>    <td>Row 2 Col 2</td>    <td>Row 2 Col 
3</td></tr>
</table>
</div>
</div>
<div class="slide-notes"><p />
<p>*</p>
<p>This is the footer text.</p>
<p>This is the header text.</p>
</div>
</body></html>
{noformat}

> Make ppt and pptx paragraph/div breaks more consistent
> ------------------------------------------------------
>
>                 Key: TIKA-1755
>                 URL: https://issues.apache.org/jira/browse/TIKA-1755
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>
> In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found 
> that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, 
> especially now that we've applied the upgrades from TIKA-1707.
> I propose adding quite a few more <p> to capture the sentence/bullet level 
> breaks in PPTX as we're now doing for PPT.
> There are a handful of other things that we could clean up (table handling) 
> as well.
> Some of these changes may be relevant to this 
> [discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3ccal8pwky96_gkjmps6zxuoe7h7-byvpxjbktbuy1goku3skz...@mail.gmail.com%3E].
>   [~shaie], any input?
> Patch and example output to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to