[ 
https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133995#comment-13133995
 ] 

Uwe Schindler commented on TIKA-736:
------------------------------------

The current ODF parser is very lightweight and memory efficient (I hope 
ODFToolkit uses a streaming API, too, comparable with SAX).It is very elegant, 
but limited. I would be against a parser like the OpenXML one that builds huge 
DOM/Object trees as this is also somehow a "security-leak", if your parser gets 
a huge document that don't fits in memory and crashes you app.

The current parser streams the document XMLs through the SAX API and converts 
it to HTML by replacing element names and doing some structural modifications 
(I wrote that one a few years ago and donated it to TIKA). I have no problem 
with nuking it once ODFToolkit is out, but please, please, please use a 
streaming API without large DOM/Object trees and temporary files. Optionally 
leave both parsers available (I would also take care of the current one).
                
> OpenOffice parser: master footer text isn't extracted
> -----------------------------------------------------
>
>                 Key: TIKA-736
>                 URL: https://issues.apache.org/jira/browse/TIKA-736
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-736.patch, testMasterFooter.odp
>
>
> If I edit the footer text on the master slide of an OpenOffice presentation, 
> I see that text rendered on the slide, but it's not extracted by Tika.
> Digging into the document, curiously the footer text is in the styles.xml, 
> under office:master-styles -> style:master-page -> draw:frame -> 
> draw:text-box -> text:p.  I think somehow we're not linking up each slide's 
> master text elements to that slide, similar to TIKA-712.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to