[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195991#comment-14195991
 ] 

Tim Allison commented on TIKA-1464:
---

I haven't run into this on the 1m doc govdocs1 corpus, which has roughly a 23% 
pdfs, 21% html, 8% doc and 6% xls, 5% ppt, only a handful of (doc|xls|ppt)x and 
0 msg files.  Can you tell from [~lfcnassif]'s recommendation or from 
experimentation if you run into this issue if you only process pdfs or only 
process MSOffice or only msgs?   

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Tim Barrett (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196046#comment-14196046
 ] 

Tim Barrett commented on TIKA-1464:
---

I did have a suspicion that the file types may be the culprits. Another of our 
projects contains only PDF and MSOffice files, so no MSG files. That one runs 
without problems, although is not as large as the set which eventually errors 
out. So cabnnot be 100% sure that MSG files are the culprits, but I have a 
sneaking suspicion that they *are* the culprit. Many of our msg files contain 
embedded msg files, and/or PDF, MSOffice, image files etc. I am 100% confident 
we are not leaking non closed input streams, as I have already pointed out, pre 
1.6 Tika runs smoothly without any form of open files build up.

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Tim Barrett (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196301#comment-14196301
 ] 

Tim Barrett commented on TIKA-1464:
---

Parsing a few thousand files that are a mixture of PDF, MSOffice, text and 
various image files - without *any* MSG files at all i stable, no tmp files are 
shown as remaining open while the process is running.

Indexing over 1000 MSG files immediately shows tika tmp files being left open 
(even though they are deleted from the disk). 

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-04 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196343#comment-14196343
 ] 

Hong-Thai Nguyen commented on TIKA-1463:


Thank [~lfcnassif], without .exe effectively works also. BTW, path with space 
is buggy.
I leave this fix because adding .exe  only in Windows don't hurt anything.

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196414#comment-14196414
 ] 

Luis Filipe Nassif commented on TIKA-1464:
--

Could you run file leak detector while processing the MSG files? It can provide 
the code trace of where those file handles were opened.

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1464:
--
Comment: was deleted

(was: On Windows 7 with Tika 1.7-SNAPSHOT, on a batch of 3k msg files that have 
many attachments, the most I can get with a 4 thread process is 12 descriptors 
open at a time according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the 
full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
at java.io.RandomAccessFile.init(RandomAccessFile.java:242)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.init(FileBackedDataSource.java:46)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:192)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:163)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)


#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)


#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
at java.io.RandomAccessFile.init(RandomAccessFile.java:242)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.init(FileBackedDataSource.java:46)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:192)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:163)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)


#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 
on Tue Nov 04 15:21:03 EST 2014
at 

[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196765#comment-14196765
 ] 

Tim Allison commented on TIKA-1464:
---

On Windows 7, on a batch of 3k msg files that have many of attachments, the 
most I can get with a 4 thread process is 12 descriptors open at a time 
according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the 
full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
at java.io.RandomAccessFile.init(RandomAccessFile.java:242)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.init(FileBackedDataSource.java:46)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:192)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:163)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)


#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)


#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
at java.io.RandomAccessFile.init(RandomAccessFile.java:242)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.init(FileBackedDataSource.java:46)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:192)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:163)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)


#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 
on Tue Nov 04 15:21:03 EST 2014
at 

[jira] [Comment Edited] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196765#comment-14196765
 ] 

Tim Allison edited comment on TIKA-1464 at 11/4/14 8:35 PM:


On Windows 7 with Tika 1.7-SNAPSHOT, on a batch of 3k msg files that have many 
attachments, the most I can get with a 4 thread process is 12 descriptors open 
at a time according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the 
full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
at java.io.RandomAccessFile.init(RandomAccessFile.java:242)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.init(FileBackedDataSource.java:46)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:192)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:163)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)


#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)


#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
at java.io.RandomAccessFile.init(RandomAccessFile.java:242)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
at 
org.apache.poi.poifs.nio.FileBackedDataSource.init(FileBackedDataSource.java:46)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:192)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:163)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)


#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
at java.io.FileInputStream.init(FileInputStream.java:147)
at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by 

[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-04 Thread frank (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197270#comment-14197270
 ] 

frank commented on TIKA-1464:
-

suggest set the fix version as 1.7.

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)