OpenOffice parser: embedded OLE docs are extracted at the end, as extra 
<html>...</html>
----------------------------------------------------------------------------------------

                 Key: TIKA-735
                 URL: https://issues.apache.org/jira/browse/TIKA-735
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor


When I have an OpenOffice presentation (ODP) that embeds (OLE)
objects, in this case OpenOffice text, text from the embedded objects
is at the end of the presentation.

It's great that we are extracting the embedded text, but it'd be
better if each embedded object's text were inlined on the slide that
embedded it.

I have a simple test ODP with two slides.  Each slide has its own
text, and then embeds a text OLE object with text as well, and this is
the output:

{noformat}
<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="Content-Length" content="20970"/>
<meta name="Content-Type" 
content="application/vnd.oasis.opendocument.presentation"/>
<meta name="resourceName" content="embeddedText.odp"/>
<title/>
</head>
<body><div/>
<div><p>Main text on page 1</p>
</div>
<object/><div><div/>
</div>
<div/>
<div><ul>       <li><p>Main text on page 2</p>
</li>
</ul>
</div>
<object/><div><div/>
</div>
</body></html><html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="Content-Length" content="20970"/>
<meta name="Content-Type" 
content="application/vnd.oasis.opendocument.presentation"/>
<meta name="resourceName" content="embeddedText.odp"/>
<title/>
</head>
<body><p>Here is some embedded text on page 1</p>
</body></html><html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="Content-Length" content="20970"/>
<meta name="Content-Type" 
content="application/vnd.oasis.opendocument.presentation"/>
<meta name="resourceName" content="embeddedText.odp"/>
<title/>
</head>
<body><p>Here is some embedded text on page 2</p>
</body></html>
{noformat}

You can see "Here is some embedded text on page N" comes out at the end,
after the main text "Main text on page N" for both slides.

It's also odd that we get a new html/head/meta/body for each embedded
doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to