[ 
https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492536#comment-15492536
 ] 

Tim Barrett commented on TIKA-2058:
-----------------------------------

private void processFileEmbeddedInMsg(InformationGranule msgGranule, Path 
resourceFilePath, ResourceSet parentResourceSet,
                        AttachmentChunks attachment) throws IOException, 
Throwable, SAXException, TikaException {

                ByteArrayInputStream byteInputStream = null;

                try {

                        if (attachment.attachData != null) {

                                boolean isEmbeddedInMessage = true;

                                Path msgGranuleParentPath = 
resourceFilePath.getParent();

                                byteInputStream = new 
ByteArrayInputStream(attachment.attachData.getValue());

                                String embeddedFileName = null;

                                if (attachment.attachLongFileName != null && 
!attachment.attachLongFileName.toString().isEmpty()) {

                                        embeddedFileName = 
attachment.attachLongFileName.toString();

                                } else {

                                        if (attachment.attachFileName != null 
&& !attachment.attachFileName.toString().isEmpty()) {

                                                embeddedFileName = 
attachment.attachFileName.toString();

                                        }

                                }

                                if (embeddedFileName != null) {

                                        if (embeddedFileName.length() > 200) {

                                                logger.warn("Embedded 
attachment has filename longer than 200 characters: " + embeddedFileName);

                                                String embeddedFileExtension = 
NalandaStringUtilities.getTailLastOccurrence('.', embeddedFileName);

                                                StringBuilder 
strBldrEmbeddedFileName = new StringBuilder();

                                                
strBldrEmbeddedFileName.append(UUID.randomUUID().toString());

                                                
strBldrEmbeddedFileName.append(".");

                                                
strBldrEmbeddedFileName.append(embeddedFileExtension);

                                                embeddedFileName = 
strBldrEmbeddedFileName.toString();

                                                logger.warn("Embedded 
attachment has filename with long name saved as " + embeddedFileName);

                                        }

                                        NalandaResourceHandler 
attachmentResourceHandler = new NalandaResourceHandler(this.parentResourceSet,
                                                        this.jsonParseFailures, 
this.jsonPasswordFailures, this.filesCouldNotParseList);

                                        Path embeddedResourcePath = 
this.writeAttachmentToAttachmentsFolder((Resource) msgGranule, embeddedFileName,
                                                        byteInputStream, 
parentResourceSet, msgGranuleParentPath, false, null);

                                        if 
(ResourceSetAccessor.getResourceType(new 
File(embeddedFileName)).equals(RESOURCE_TYPE.ZIP)) {

                                                embeddedResourcePath = 
embeddedResourcePath.getParent();

                                        }

                                        
attachmentResourceHandler.processEmbeddedResource(msgGranule, embeddedFileName, 
null, parentResourceSet,
                                                        embeddedResourcePath, 
null, null, null, isEmbeddedInMessage);

                                }
                        }

                } finally {

                        if (byteInputStream != null) {

                                byteInputStream.close();

                        }

                }
        }



> Memory Leak in Tika version 1.13 when parsing millions of files
> ---------------------------------------------------------------
>
>                 Key: TIKA-2058
>                 URL: https://issues.apache.org/jira/browse/TIKA-2058
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Tim Barrett
>         Attachments: Yourkit screenshot.png, poi-3.15-beta1-p1.jar, 
> poi-3.15-beta1-p1.pom, prevents-OOM-when-writable-is-false.patch, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> We have an application using Tika which parses roughly 7,000,000 files of 
> different types, many of the files are MSG files with attachments. This works 
> correctly with Tika 1.9, and has been in production for over a year,  with 
> parsing runs taking place every few weeks. The same application runs into 
> insufficient memory problems (java heap) when using Tika 1.13.
> I have used lsof and file leak detector to track down open files, however 
> neither shows any open files when the application is running. I did find an 
> issue with open files https://issues.apache.org/jira/browse/TIKA-2015, 
> however there was a workaround for this and this is not the issue.
> I am sorry to have to report this with a level of vagueness, but with lsof 
> turning nothing up I am a bit stuck as to how to investigate further. We are 
> more than willing to help by testing on the basis of any ideas provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to