Small improvements to how embedded docs are parsed in
AbstractPOIFSExtractor.handleEmbeddedOfficeDoc
----------------------------------------------------------------------------------------------------
Key: TIKA-751
URL: https://issues.apache.org/jira/browse/TIKA-751
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 1.0
I noticed some minor things in this method:
* It does too much work (writes the tmpFile out) if the
EmbeddedDocumentExtractor didn't want to actually parse file
file.
* It writes the tmpFile when it won't use it in the OLE10_NATIVE
case (because we use a TikeInputStream from the in-RAM byte[]
instead).
Also I fixed a typo in the method name (embeded -> embedded) -- is
that OK? It's a protected method, and a few of the office parsers
invoke it.
Finally I cutover to TemporaryResources to track the possible tmpFile
and open TikaInputStream against it.
Separately, it's inefficient now that we must serialize a sub-dir
(DirectoryEntry) in the NPOIFileSystem to a tmp file only to re-parse
it back to an NPOIFileSystem in OfficeParser; I'd like to look into
instead (somehow) directly passing the NPOIFileSystem's DirectoryEntry
to OfficeParser... but that looks like a bigger change.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira