[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

Michael McCandless (Commented) (JIRA) Wed, 26 Oct 2011 09:17:58 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136070#comment-13136070
 ]


Michael McCandless commented on TIKA-736:
-----------------------------------------

bq. Can you also check that parsing styles.xml of e.g. writer or calc documents 
does no harm?

Good idea, Uwe!; I tested this.

On a fresh Writer (.odt) doc, no text comes out of the styles.xml
(good).  If I then edit the footer, Tika misses that text today, but
the patch gets it (I added a test).

On a fresh Calc (.ods) doc, there is some minor "placeholder" text:

<pre>
  <p>???</p>
  <p>Page</p>
  <p>??? (???)</p>
  <p>10/26/2011, 11:13:57</p>
  <p>Page  / 99 </p>
</pre>

I've fixed the "99" by also filtering for "text:page-count" in ODCP;
the date/time is apparently when the doc was created; I think the rest
of the boiler plate text is acceptable?  EG, you can see this text
(Page 1) when you do Page Preview or print...

When I then edited the footer in the Calc doc, Tika misses that text
today, but the patch gets it (I added a test for this too).

bq. About the order: I have it somewhere in the back of my head, that the order 
of files in the ZIP file is somehow part of the standard. At least I know, that 
the MIME_TYPE file must be the first one in the ZIP file, to make detection of 
format easy.

I haven't been able to find mention of this in the spec... I'm looking
at http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-v1.1.odt and
it just describes the general ZIP format as far as I can tell...

bq. I still dont get the reason for problems with metadata if the order of 
files is different.

Oh, this is because XHTMLContentHandler, on seeing the end of header /
start of body will output <meta> tags for all metadata present in the
Metadata class at that time.  So... if new entries are added to
Metadata after the body tag is started they won't make it into the
<head>...</head>.  Looks like this was done under TIKA-478.

                
> OpenOffice parser: master footer text isn't extracted
> -----------------------------------------------------
>
>                 Key: TIKA-736
>                 URL: https://issues.apache.org/jira/browse/TIKA-736
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-736.patch, TIKA-736.patch, testMasterFooter.odp
>
>
> If I edit the footer text on the master slide of an OpenOffice presentation, 
> I see that text rendered on the slide, but it's not extracted by Tika.
> Digging into the document, curiously the footer text is in the styles.xml, 
> under office:master-styles -> style:master-page -> draw:frame -> 
> draw:text-box -> text:p.  I think somehow we're not linking up each slide's 
> master text elements to that slide, similar to TIKA-712.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

Reply via email to