RE: Memory issues with PDF parser
1) Right, the npe is caused by the exception returning null when we call getMessage(). In TIKA-1605, we modified all code in the project to check for null returned by getMessage(). So, in the fixed version, you'll still get your good old IOException. I can't tell from your stacktrace what caused the IOException. 2) Y, regular builds of 1.9's app (and other modules) are available via Jenkins here: https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/ 3) Ok, makes sense. For kicks, you may want to change opening the file to: is = TikaInputStream.get(file) or maybe: is = TikaInputStream.get(file, metadata) And you'll want to surround your closing of the IS in a try/catch block. Or use IOUtils.closeQuietly. Finally, are you able to share the particular file that caused the IOException? From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Thursday, June 04, 2015 10:20 AM To: Allison, Timothy B.; talli...@apache.org Cc: user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues with PDF parser Hi Timothy, Thanks for the prompt reply. 1.)Wouldn't fixing the null pointer exception in turn throw the IO exception? I saw that the null pointer exception was thrown inside the catch block of the IO exception? Any root cause for the IO exception??. Is that also fixed? I am including the code that threw the null pointer exception in tike 1.8 Exception: 10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException 10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) Code in the pdf parser: catch (IOException e) { //nonseq parser throws IOException for bad password //At the Tika level, we want the same exception to be thrown if (e.getMessage().contains(Error (CryptographyException))) { metadata.set(pdf:encrypted, Boolean.toString(true)); throw new EncryptedDocumentException(e); } 2.)Do you have a snapshot or beta version of tika 1.9 that I could try with our pdf corpus? It would also help in your developer testing. 3.)For the inline images, we have just set the defaults(which is to skip them as you had mentioned). I have not done any memory profiling till now. I will also try that. Thanks, MG From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, June 04, 2015 7:19 AM To: Mouthgalya Ganapathy; talli...@apache.orgmailto:talli...@apache.org Cc: user@tika.apache.orgmailto:user@tika.apache.org Subject: RE: Memory issues with PDF parser Hi Mouthgalya, We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week. As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but there may be others. One potential memory hog is the processing of inline images within PDFs...have you configured Tika to pull those out (default is to skip them)? Other than that, I'd recommend dropping a note to the PDFBox users list to get help in diagnosing memory consumption with PDFBox. Have you tried any memory profiling? Best, Tim From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Wednesday, June 03, 2015 3:25 PM To: talli...@apache.orgmailto:talli...@apache.org Subject: Memory issues with PDF parser Hi all, I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the below code for extracting it. It works well for few files. But if I read many files , I see out of memory exception. I also see a Null pointer exception in the pdf parser. I think the null pointer exception is because of the memory exception. Any suggestions? Tika version: dependency groupIdorg.apache.tika/groupId artifactIdtika-server/artifactId version1.8/version /dependency I am running it as a part of J2EE APP in JBoss 1.7 Code:- //Parse the pdf content using Apache Tikka InputStream is = null; try { is = new BufferedInputStream(new FileInputStream(input)); //Disable write limit. contenthandler = new BodyContentHandler(-1); metadata = new Metadata(); pdfparser = new PDFParser(); context = new ParseContext(); pdfparser.parse(is, contenthandler, metadata, context); docBody=contenthandler.toString(); //System.out.println(contenthandler.toString()); } catch (Exception e) { System.out.println(Exception in updating docbody for report == + report.getDocID()); if(is==null
RE: Memory issues with PDF parser
You will get the same exception. If you run the pure Tika app commandline on a triggering file, does it at least show you the caused by clause that might give more information? Other question: Are you sure that you want to avoid parsing attachments? From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Thursday, June 04, 2015 2:55 PM To: Allison, Timothy B. Cc: user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues with PDF parser Thanks for the update Timothy, I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try that and will use TikaInputStreams. I will update the results. Given below is the IO exception that I get when I use Autoparser to extract pdf contents. I had used Tika 1.6. and pdfbox 1.8.9. I am guessing I will get the same/similar exception when I am going to run it with 1.9-SNAPSHOT. 1:27:53,921 WARN [org.hornetq.core.client.impl.ClientSessionImpl] (Thread-4 (HornetQ-client-global-threads-248507153)) resetting session after failure [Server:research-etl-server] 21:29:16,314 INFO [stdout] (Thread-12 (HornetQ-client-global-threads-248507153)) Exception in updating docbody for report == RPT_720610 [Server:research-etl-server] 21:29:23,817 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@29fe5969mailto:org.apache.tika.parser.pdf.PDFParser@29fe5969 [Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:250) [Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) [Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:888) [Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:983) [Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:678) [Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70) [Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source) [Server:research-etl-server] 21:29:23,822 WARN [org.hornetq.core.server.impl.ServerSessionImpl] (hornetq-failure-check-thread) Cleared up resources for session dc692df4-0a50-11e5-8aa3-005056900299 [Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at java.lang.reflect.Method.invoke(Method.java:597) [Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) Thanks, Mouthgalya Ganapathy Product Development Team From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, June 04, 2015 12:50 PM To: Mouthgalya Ganapathy Cc: user@tika.apache.orgmailto:user@tika.apache.org; Sauparna Sarkar Subject: RE: Memory issues
RE: Memory issues with PDF parser
Hi Mouthgalya, We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week. As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but there may be others. One potential memory hog is the processing of inline images within PDFs...have you configured Tika to pull those out (default is to skip them)? Other than that, I'd recommend dropping a note to the PDFBox users list to get help in diagnosing memory consumption with PDFBox. Have you tried any memory profiling? Best, Tim From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com] Sent: Wednesday, June 03, 2015 3:25 PM To: talli...@apache.org Subject: Memory issues with PDF parser Hi all, I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the below code for extracting it. It works well for few files. But if I read many files , I see out of memory exception. I also see a Null pointer exception in the pdf parser. I think the null pointer exception is because of the memory exception. Any suggestions? Tika version: dependency groupIdorg.apache.tika/groupId artifactIdtika-server/artifactId version1.8/version /dependency I am running it as a part of J2EE APP in JBoss 1.7 Code:- //Parse the pdf content using Apache Tikka InputStream is = null; try { is = new BufferedInputStream(new FileInputStream(input)); //Disable write limit. contenthandler = new BodyContentHandler(-1); metadata = new Metadata(); pdfparser = new PDFParser(); context = new ParseContext(); pdfparser.parse(is, contenthandler, metadata, context); docBody=contenthandler.toString(); //System.out.println(contenthandler.toString()); } catch (Exception e) { System.out.println(Exception in updating docbody for report == + report.getDocID()); if(is==null) System.out.println(The input stream is a null object); e.printStackTrace(); logger.log(Level.SEVERE, e.getMessage(), e); } finally { if (is != null) is.close(); contenthandler=null; metadata=null; pdfparser=null; context =null; } Exception:- I am just including the null pointer exception in the parser below. 10:53:11,696 INFO [stdout] (Thread-11 (HornetQ-client-global-threads-1619682129)) Exception in updating docbody for report == RPT_764268 10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException 10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) 10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881) 10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965) 10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676) 10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70) 10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) 10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at java.lang.reflect.Method.invoke(Method.java:597) 10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) 10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) 10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) 10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))at