[ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-735: ------------------------------------ Attachment: embeddedText.odp ODP document that leads to above text output from TikaCLI -x. > OpenOffice parser: embedded OLE docs are extracted at the end, as extra > <html>...</html> > ---------------------------------------------------------------------------------------- > > Key: TIKA-735 > URL: https://issues.apache.org/jira/browse/TIKA-735 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: embeddedText.odp > > > When I have an OpenOffice presentation (ODP) that embeds (OLE) > objects, in this case OpenOffice text, text from the embedded objects > is at the end of the presentation. > It's great that we are extracting the embedded text, but it'd be > better if each embedded object's text were inlined on the slide that > embedded it. > I have a simple test ODP with two slides. Each slide has its own > text, and then embeds a text OLE object with text as well, and this is > the output: > {noformat} > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="Content-Length" content="20970"/> > <meta name="Content-Type" > content="application/vnd.oasis.opendocument.presentation"/> > <meta name="resourceName" content="embeddedText.odp"/> > <title/> > </head> > <body><div/> > <div><p>Main text on page 1</p> > </div> > <object/><div><div/> > </div> > <div/> > <div><ul> <li><p>Main text on page 2</p> > </li> > </ul> > </div> > <object/><div><div/> > </div> > </body></html><html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="Content-Length" content="20970"/> > <meta name="Content-Type" > content="application/vnd.oasis.opendocument.presentation"/> > <meta name="resourceName" content="embeddedText.odp"/> > <title/> > </head> > <body><p>Here is some embedded text on page 1</p> > </body></html><html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="Content-Length" content="20970"/> > <meta name="Content-Type" > content="application/vnd.oasis.opendocument.presentation"/> > <meta name="resourceName" content="embeddedText.odp"/> > <title/> > </head> > <body><p>Here is some embedded text on page 2</p> > </body></html> > {noformat} > You can see "Here is some embedded text on page N" comes out at the end, > after the main text "Main text on page N" for both slides. > It's also odd that we get a new html/head/meta/body for each embedded > doc (there should be only one for the overall document). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira