Small improvements to how embedded docs are parsed in 
AbstractPOIFSExtractor.handleEmbeddedOfficeDoc
----------------------------------------------------------------------------------------------------

                 Key: TIKA-751
                 URL: https://issues.apache.org/jira/browse/TIKA-751
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 1.0


I noticed some minor things in this method:

  * It does too much work (writes the tmpFile out) if the
    EmbeddedDocumentExtractor didn't want to actually parse file
    file.

  * It writes the tmpFile when it won't use it in the OLE10_NATIVE
    case (because we use a TikeInputStream from the in-RAM byte[]
    instead).

Also I fixed a typo in the method name (embeded -> embedded) -- is
that OK?  It's a protected method, and a few of the office parsers
invoke it.

Finally I cutover to TemporaryResources to track the possible tmpFile
and open TikaInputStream against it.

Separately, it's inefficient now that we must serialize a sub-dir
(DirectoryEntry) in the NPOIFileSystem to a tmp file only to re-parse
it back to an NPOIFileSystem in OfficeParser; I'd like to look into
instead (somehow) directly passing the NPOIFileSystem's DirectoryEntry
to OfficeParser... but that looks like a bigger change.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to