[jira] [Updated] (TIKA-736) OpenOffice parser: master footer text isn't extracted

Michael McCandless (Updated) (JIRA) Wed, 26 Oct 2011 03:57:59 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated TIKA-736:
------------------------------------

    Attachment: TIKA-736.patch

This turned out to be fairly simple to fix, so I worked out a patch,
and I think it's worth fixing in our current ODF parser, since we're
not sure when we'll cutover to the ODFToolkit based solution.

Basically I also recurse into styles.xml, using the content parser. It
doesn't seem to have the same "problem" as PPT/PPTX (TIKA-712), where
we the get boiler plate text out, except in one case that I could find
(page numbers would output <number> placeholder text), so I fixed
OpenDocumentContentParser to not output text for text:page-number
elements  (Seeparately, I noticed we don't properly extract page
numbers for ODP files today... I'll open a new issue.)

I also noticed because the OpenDocumentParser is strictly streaming
(single-pass through the ZipFile), we can easily fail to insert the
meta tags into the output XHTML, if we encounter "meta.xml" after
"content.xml".  This is maybe not so bad, because the metadata will
still have the fields... but we could fix it, by using random-access
ZipFile instead if we had already opened a ZipFile (eg
AutoDetectParser), or if the IS is a TIS with a File.  I put a TODO to
do this...

Also, I moved up the XHTMLContentHandler wrapping into
OpenDocumentParser (from OpenDocumentContentParser), so that we don't 
emit head/body tags twice.  I think we also need to do this for
TIKA-735 too.

This fix is not perfect, since (just like TIKA-712, for ppt/pptx) it
outputs the master text only once (as if it were its own slide),
instead of inlining it into each slide that referenced that master,
but I think it's at least better than what we have today (no master
text is extracted)... progress not perfection.

                
> OpenOffice parser: master footer text isn't extracted
> -----------------------------------------------------
>
>                 Key: TIKA-736
>                 URL: https://issues.apache.org/jira/browse/TIKA-736
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-736.patch, TIKA-736.patch, testMasterFooter.odp
>
>
> If I edit the footer text on the master slide of an OpenOffice presentation, 
> I see that text rendered on the slide, but it's not extracted by Tika.
> Digging into the document, curiously the footer text is in the styles.xml, 
> under office:master-styles -> style:master-page -> draw:frame -> 
> draw:text-box -> text:p.  I think somehow we're not linking up each slide's 
> master text elements to that slide, similar to TIKA-712.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-736) OpenOffice parser: master footer text isn't extracted

Reply via email to