[ 
https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118833#comment-13118833
 ] 

Michael McCandless commented on TIKA-735:
-----------------------------------------

Ahhh, I see.

So it looks like our default behavior for embedded docs is to fully
extract them, concatenated to the end of the XHTML, as "full" XHTML
docs (ie new <html>...</html> each time).

But maybe we can change TikaCLI so that content from sub-docs is
optionally "inlined" instead.

I see TikaCLI already has the -z option, which extracts embedded
docs to separate fileN files in the current dir...

                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra 
> <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>     <li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to