[ 
https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-735:
------------------------------------

    Attachment: embeddedText.odp

ODP document that leads to above text output from TikaCLI -x.
                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra 
> <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>     <li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to