[ 
https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136090#comment-13136090
 ] 

Uwe Schindler commented on TIKA-736:
------------------------------------

{quote}
Oh, this is because XHTMLContentHandler, on seeing the end of header /
start of body will output <meta> tags for all metadata present in the
Metadata class at that time. So... if new entries are added to
Metadata after the body tag is started they won't make it into the
<head>...</head>. Looks like this was done under TIKA-478.
{quote}

Oh that was long after may initial submission of this parser :-) I was not 
aware that the metadata is now also replicated into the HTML head, in addition 
to the separate Metadata class.

With the current parser it can also happen that the footer/header/masterslide 
comes before or after the main text, depending on order of files. But for 
indexing purposes like Lucene its not an issue at all - this was the only 
reason the original version of this parser was created for (as always for 
PANGAEA), so order did not have any effect.

We could work around the whole thing without the need for a random access ZIP 
file, if we could only serialize the body and insert the body later (e.g. using 
a caching sax filter)? In general the text-only part is much smaller than a zip 
file with large 1000dpi images, so somehow caching it might not be an issue (of 
course not the whole dom tree) :-)
                
> OpenOffice parser: master footer text isn't extracted
> -----------------------------------------------------
>
>                 Key: TIKA-736
>                 URL: https://issues.apache.org/jira/browse/TIKA-736
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-736.patch, TIKA-736.patch, testMasterFooter.odp
>
>
> If I edit the footer text on the master slide of an OpenOffice presentation, 
> I see that text rendered on the slide, but it's not extracted by Tika.
> Digging into the document, curiously the footer text is in the styles.xml, 
> under office:master-styles -> style:master-page -> draw:frame -> 
> draw:text-box -> text:p.  I think somehow we're not linking up each slide's 
> master text elements to that slide, similar to TIKA-712.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to