[ https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492522#comment-15492522 ]
Tim Barrett commented on TIKA-2058: ----------------------------------- private void processMsgEmbeddedInMsg(InformationGranule msgGranule, Path resourceFilePath, ResourceSet parentResourceSet, AttachmentChunks attachment) throws Throwable { InputStream embeddedMsgFilePathInputStream = null; OutputStream outStream = null; POIFSFileSystem poifsFileSystem = null; try { MAPIMessage embeddedMAPIMessage = attachment.getEmbeddedMessage(); poifsFileSystem = new POIFSFileSystem(); EntryUtils.copyNodes(attachment.attachmentDirectory.getDirectory(), poifsFileSystem.getRoot()); Path targetDir = FileSystems.getDefault().getPath(resourceFilePath.getParent().toString() + "/attachments"); /* * Creates directory if not already there */ try { Files.createDirectory(targetDir); } catch (IOException ignore) { } String embeddedMessageName = null; try { String conversationTopic = embeddedMAPIMessage.getConversationTopic(); conversationTopic = NalandaStringUtilities.stripSpecialCharactersFromString(conversationTopic); embeddedMessageName = conversationTopic + ".msg"; } catch (ChunkNotFoundException cnfe) { embeddedMessageName = this.messageNameCounter + ".msg"; this.messageNameCounter++; } if (embeddedMessageName != null) { if (embeddedMessageName.length() > 200) { logger.warn("Embedded attachment has filename longer than 200 characters: " + embeddedMessageName); StringBuilder strBldrEmbeddedFileName = new StringBuilder(); strBldrEmbeddedFileName.append(UUID.randomUUID().toString()); strBldrEmbeddedFileName.append(".msg"); embeddedMessageName = strBldrEmbeddedFileName.toString(); logger.warn("Embedded attachment has filename with long name saved as " + embeddedMessageName); } File msgFileToWrite = new File(targetDir.toString() + "/" + embeddedMessageName); outStream = new FileOutputStream(msgFileToWrite); poifsFileSystem.writeFilesystem(outStream); outStream.close(); Path embeddedMsgFilePath = FileSystems.getDefault().getPath(msgFileToWrite.getPath()); embeddedMsgFilePathInputStream = Files.newInputStream(embeddedMsgFilePath); NalandaResourceHandler attachmentResourceHandler = new NalandaResourceHandler(this.parentResourceSet, this.jsonParseFailures, this.jsonPasswordFailures, this.filesCouldNotParseList); boolean isEmbeddedInMsg = true; attachmentResourceHandler.processEmbeddedResource(msgGranule, msgFileToWrite.getName(), embeddedMsgFilePathInputStream, parentResourceSet, embeddedMsgFilePath, null, null, null, isEmbeddedInMsg); } } catch (Throwable t) { logger.warn("Exception occurred processing embedded message in: " + msgGranule.getValue() + " embedded message has not been processed", t); } finally { if (poifsFileSystem != null) { // poifsFileSystem.close(); } if (embeddedMsgFilePathInputStream != null) { embeddedMsgFilePathInputStream.close(); } if (outStream != null) { outStream.close(); } } } > Memory Leak in Tika version 1.13 when parsing millions of files > --------------------------------------------------------------- > > Key: TIKA-2058 > URL: https://issues.apache.org/jira/browse/TIKA-2058 > Project: Tika > Issue Type: Bug > Affects Versions: 1.13 > Reporter: Tim Barrett > Attachments: Yourkit screenshot.png, poi-3.15-beta1-p1.jar, > poi-3.15-beta1-p1.pom, prevents-OOM-when-writable-is-false.patch, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > We have an application using Tika which parses roughly 7,000,000 files of > different types, many of the files are MSG files with attachments. This works > correctly with Tika 1.9, and has been in production for over a year, with > parsing runs taking place every few weeks. The same application runs into > insufficient memory problems (java heap) when using Tika 1.13. > I have used lsof and file leak detector to track down open files, however > neither shows any open files when the application is running. I did find an > issue with open files https://issues.apache.org/jira/browse/TIKA-2015, > however there was a workaround for this and this is not the issue. > I am sorry to have to report this with a level of vagueness, but with lsof > turning nothing up I am a bit stuck as to how to investigate further. We are > more than willing to help by testing on the basis of any ideas provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)