[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

Jukka Zitting (Commented) (JIRA) Sat, 01 Oct 2011 11:12:59 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118861#comment-13118861
 ]


Jukka Zitting commented on TIKA-735:
------------------------------------

A parser should always produce valid XHTML output. If there's an embedded 
document that's fed into a recursive parse() call, the EmbeddedContentHandler 
and BodyContentHandler class can (and should) be used to include only the 
extracted body content of the embedded document. See the 
ParsingEmbeddedDocumentExtractor class for how this is done. In fact I'd 
recommend simply using the ParsingEmbeddedDocumentExtractor class directly, 
just like package, POIFS, and OOXML parsers already do.

Anyway, as mentioned by Nick elsewhere, it's probably not worth it to fix the 
current code since it'll probably in any case be rewritten to use the ODF 
toolkit.
                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra 
> <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>     <li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" 
> content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

Reply via email to