[
https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118833#comment-13118833
]
Michael McCandless commented on TIKA-735:
-----------------------------------------
Ahhh, I see.
So it looks like our default behavior for embedded docs is to fully
extract them, concatenated to the end of the XHTML, as "full" XHTML
docs (ie new <html>...</html> each time).
But maybe we can change TikaCLI so that content from sub-docs is
optionally "inlined" instead.
I see TikaCLI already has the -z option, which extracts embedded
docs to separate fileN files in the current dir...
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra
> <html>...</html>
> ----------------------------------------------------------------------------------------
>
> Key: TIKA-735
> URL: https://issues.apache.org/jira/browse/TIKA-735
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides. Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html
> xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type"
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul> <li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type"
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type"
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira