[ https://issues.apache.org/jira/browse/TIKA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rob Tulloh updated TIKA-836: ---------------------------- Description: We are seeing that tika sometimes takes a very long time to parse some content (likely PDF). For example, with the following EML file that contains 4 documents (2 PDF, 1 MS Excel, 1 text): {noformat} fgrep --binary-file=text Content-Type: XXX.eml Content-Type: multipart/mixed; Content-Type: multipart/alternative; Content-Type: text/plain; Content-Type: text/html; Content-Type: application/octet-stream; Content-Type: application/octet-stream; Content-Type: application/vnd.ms-excel; du -sh XXX.eml 6.0M XXX.eml {noformat} Note that it takes tika nearly 30 minutes to process this content even though the source is only 6M in size: {noformat} time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out WARN - Did not found XRef object at specified startxref position 230521 WARN - Did not found XRef object at specified startxref position 3742379 real 29m16.913s user 18m17.050s sys 0m19.465s {noformat} Is there any way to configure tika (in particular via solr) to process files more quickly? was: We are seeing that tika sometimes takes a very long time to parse some content (likely PDF). For example, with the following EML file that contains 4 documents (2 PDF, 1 MS Excel, 1 text): {noformat} fgrep --binary-file=text Content-Type: XXX.eml Content-Type: multipart/mixed; Content-Type: multipart/alternative; Content-Type: text/plain; Content-Type: text/html; Content-Type: application/octet-stream; Content-Type: application/octet-stream; Content-Type: application/vnd.ms-excel; du -sh XXX.eml 6.0M 1326391801.eml {noformat} Note that it takes tika nearly 30 minutes to process this content even though the source is only 6M in size: {noformat} time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out WARN - Did not found XRef object at specified startxref position 230521 WARN - Did not found XRef object at specified startxref position 3742379 real 29m16.913s user 18m17.050s sys 0m19.465s {noformat} Is there any way to configure tika (in particular via solr) to process files more quickly? > parsing really slow on some documents > ------------------------------------- > > Key: TIKA-836 > URL: https://issues.apache.org/jira/browse/TIKA-836 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.0 > Environment: CentOS 4.x/5.x/6.x > Reporter: Rob Tulloh > > We are seeing that tika sometimes takes a very long time to parse some > content (likely PDF). For example, with the following EML file that contains > 4 documents (2 PDF, 1 MS Excel, 1 text): > {noformat} > fgrep --binary-file=text Content-Type: XXX.eml > Content-Type: multipart/mixed; > Content-Type: multipart/alternative; > Content-Type: text/plain; > Content-Type: text/html; > Content-Type: application/octet-stream; > Content-Type: application/octet-stream; > Content-Type: application/vnd.ms-excel; > du -sh XXX.eml > 6.0M XXX.eml > {noformat} > Note that it takes tika nearly 30 minutes to process this content even though > the source is only 6M in size: > {noformat} > time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out > WARN - Did not found XRef object at specified startxref position 230521 > WARN - Did not found XRef object at specified startxref position 3742379 > real 29m16.913s > user 18m17.050s > sys 0m19.465s > {noformat} > Is there any way to configure tika (in particular via solr) to process files > more quickly? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira