[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127758#comment-13127758 ]
Michael McCandless commented on TIKA-753: ----------------------------------------- I noticed that when we parse an embedded Office document, it's inefficient because we take the NPOIFileSystem we had already parsed (from the full document) and write the "sub-directory" containing the embedded document to a temp file, only to re-parse it again once we've recursed to the inner detector/parser. I worked out a patch to instead just directly pass the sub-directory of the embedded document directly to the inner detector/parser. This gives a good speedup in my test case: I have a private test set of 2,080 Word docs; parsing them (and their embedded docs) takes 16.1 on trunk and 10.7 sec with this patch -- 34% faster (best of 10). The change has a few parts: * Fixed all Office parsers to alternatively directly take the document root (DirectoryNode); this was straightforward (but touched a lot of sources) because internally these parsers were extracting that root anyway. * Fixed AbstractPOIFSExtractor to not do the serialization to a temp file and instead put the document's root on an otherwise empty (new byte[0]) TikaInputStream as the openContainer. * Fixed OfficeParser and POIFSContainerDetector to recognize a DirectoryNode on the incoming TikaInputStream, and parse/detect that directly. The one catch I hit was a failure in POIContainerExtractionTest, due to already-fixed bug 51949 in POI (NPE on double-close of ZipFileZipEntrySource); I added a workaround in ParsingEmbeddedDocumentExtractor for this, with a TODO to remove the workaround once POI releases and we upgrade. It's important to remove that because we are double-opening the ZIP archive now for embedded OOXML docs... I also converted a couple if/else string equal chains into HashMap lookups. > Improve performance when parsing embedded Office docs > ----------------------------------------------------- > > Key: TIKA-753 > URL: https://issues.apache.org/jira/browse/TIKA-753 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 1.0 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira