[ https://issues.apache.org/jira/browse/TIKA-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-751: ------------------------------------ Attachment: TIKA-751.patch Patch. > Small improvements to how embedded docs are parsed in > AbstractPOIFSExtractor.handleEmbeddedOfficeDoc > ---------------------------------------------------------------------------------------------------- > > Key: TIKA-751 > URL: https://issues.apache.org/jira/browse/TIKA-751 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-751.patch > > > I noticed some minor things in this method: > * It does too much work (writes the tmpFile out) if the > EmbeddedDocumentExtractor didn't want to actually parse file > file. > * It writes the tmpFile when it won't use it in the OLE10_NATIVE > case (because we use a TikeInputStream from the in-RAM byte[] > instead). > Also I fixed a typo in the method name (embeded -> embedded) -- is > that OK? It's a protected method, and a few of the office parsers > invoke it. > Finally I cutover to TemporaryResources to track the possible tmpFile > and open TikaInputStream against it. > Separately, it's inefficient now that we must serialize a sub-dir > (DirectoryEntry) in the NPOIFileSystem to a tmp file only to re-parse > it back to an NPOIFileSystem in OfficeParser; I'd like to look into > instead (somehow) directly passing the NPOIFileSystem's DirectoryEntry > to OfficeParser... but that looks like a bigger change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira