[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195991#comment-14195991 ]
Tim Allison commented on TIKA-1464: ----------------------------------- I haven't run into this on the 1m doc govdocs1 corpus, which has roughly a 23% pdfs, 21% html, 8% doc and 6% xls, 5% ppt, only a handful of (doc|xls|ppt)x and 0 msg files. Can you tell from [~lfcnassif]'s recommendation or from experimentation if you run into this issue if you only process pdfs or only process MSOffice or only msgs? > Too many open files in system when parsing thousands of files > ------------------------------------------------------------- > > Key: TIKA-1464 > URL: https://issues.apache.org/jira/browse/TIKA-1464 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.6 > Environment: Os-X 10.10, Windows 8.1 (probably all op systems) > Reporter: Tim Barrett > Priority: Blocker > Labels: TooManyOpenFilesInSystem > > Our big data project parses many thousands of different kinds of files > sequentially. Up to and including Tika 1.5 this has been trouble free and > Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG > files in roughly equal measure. > We switched to Tika 1.6 last week and this was a good enhancement for us as a > number of files (MSOffice) that previously failed to parse do now parse > correctly under Tika 1.6. > However we have seen that a Too many open files in system exception is raised > somewhere above 10000 files having been parsed. On a windows server this > exception is not raised but the system eventually begins to crawl. > Watching the system's behaviour with the apache tmp files we see that the > apache tika files *are* being deleted from the file system, but lsof is > showing all these files as remaining open by the running process using Tika. > It would appear that the files are being deleted but handles to these files > are not being cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)