subject:"RE\: Memory issues with PDF parser"

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.

1)  Right, the npe is caused by the exception returning null when we call 
getMessage().  In TIKA-1605, we modified all code in the project to check for 
null returned by getMessage().  So, in the fixed version, you'll still get 
your good old IOException.  I can't tell from your stacktrace what caused the 
IOException.

2)  Y, regular builds of 1.9's app (and other modules) are available via 
Jenkins here: 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)  Ok, makes sense.

For kicks, you may want to change opening the file to:
is = TikaInputStream.get(file)
or maybe:
is = TikaInputStream.get(file, metadata)

And you'll want to surround your closing of the IS in a try/catch block.  Or 
use IOUtils.closeQuietly.

Finally, are you able to share the particular file that caused the IOException?
From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; talli...@apache.org
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Hi Timothy,
Thanks for the prompt reply.


1.)Wouldn't fixing the null pointer exception in turn throw the IO 
exception? I saw that the null pointer exception was thrown inside the catch 
block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?



I am including the code that threw the null pointer exception in tike 1.8



Exception:
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)



Code in the pdf parser:
catch (IOException e) {
//nonseq parser throws IOException for bad password
//At the Tika level, we want the same exception to be thrown
if (e.getMessage().contains(Error (CryptographyException))) {
metadata.set(pdf:encrypted, Boolean.toString(true));
throw new EncryptedDocumentException(e);
}


2.)Do you have a snapshot or beta version of tika 1.9 that I could try with 
our pdf corpus? It would also help in your developer testing.

3.)For the inline images, we have just set the defaults(which is to skip 
them as you had mentioned). I have not done any memory profiling till now. I 
will also try that.



Thanks,
MG

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; talli...@apache.orgmailto:talli...@apache.org
Cc: user@tika.apache.orgmailto:user@tika.apache.org
Subject: RE: Memory issues with PDF parser

Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

  Best,

Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.orgmailto:talli...@apache.org
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  dependency
 groupIdorg.apache.tika/groupId
 artifactIdtika-server/artifactId
 version1.8/version
/dependency

I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
InputStream is = null;
try {
  is = new BufferedInputStream(new FileInputStream(input));
  //Disable write limit.
  contenthandler = new BodyContentHandler(-1);
   metadata = new Metadata();
  pdfparser = new PDFParser();
  context = new ParseContext();
  pdfparser.parse(is, contenthandler, metadata, context);
  docBody=contenthandler.toString();
  //System.out.println(contenthandler.toString());
}
catch (Exception e) {
   System.out.println(Exception in updating docbody for report == 
 + report.getDocID());
   if(is==null

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.

You will get the same exception.  If you run the pure Tika app commandline on a 
triggering file, does it at least show you the caused by clause that might 
give more information?

Other question: Are you sure that you want to avoid parsing attachments?


From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 2:55 PM
To: Allison, Timothy B.
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Thanks for the update Timothy,
I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try 
that and  will use TikaInputStreams. I will update the results.

Given below is the IO exception that I get when I use Autoparser to extract pdf 
contents. I had used Tika 1.6. and pdfbox 1.8.9. I am guessing I will get the 
same/similar exception when I am going to run it with 1.9-SNAPSHOT.

1:27:53,921 WARN  [org.hornetq.core.client.impl.ClientSessionImpl] (Thread-4 
(HornetQ-client-global-threads-248507153)) resetting session after failure
[Server:research-etl-server] 21:29:16,314 INFO  [stdout] (Thread-12 
(HornetQ-client-global-threads-248507153)) Exception in updating docbody for 
report == RPT_720610
[Server:research-etl-server] 21:29:23,817 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pdf.PDFParser@29fe5969mailto:org.apache.tika.parser.pdf.PDFParser@29fe5969
[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:250)
[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:888)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:983)
[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:678)
[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
[Server:research-etl-server] 21:29:23,822 WARN  
[org.hornetq.core.server.impl.ServerSessionImpl] (hornetq-failure-check-thread) 
Cleared up resources for session dc692df4-0a50-11e5-8aa3-005056900299
[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
java.lang.reflect.Method.invoke(Method.java:597)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)



Thanks,
Mouthgalya Ganapathy
Product Development Team
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 12:50 PM
To: Mouthgalya Ganapathy
Cc: user@tika.apache.orgmailto:user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.

Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

  Best,

Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.org
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  dependency
 groupIdorg.apache.tika/groupId
 artifactIdtika-server/artifactId
 version1.8/version
/dependency

I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
InputStream is = null;
try {
  is = new BufferedInputStream(new FileInputStream(input));
  //Disable write limit.
  contenthandler = new BodyContentHandler(-1);
   metadata = new Metadata();
  pdfparser = new PDFParser();
  context = new ParseContext();
  pdfparser.parse(is, contenthandler, metadata, context);
  docBody=contenthandler.toString();
  //System.out.println(contenthandler.toString());
}
catch (Exception e) {
   System.out.println(Exception in updating docbody for report == 
 + report.getDocID());
   if(is==null)
 System.out.println(The input stream is a null object);
   e.printStackTrace();
  logger.log(Level.SEVERE, e.getMessage(), e);
}
finally {
if (is != null) is.close();
contenthandler=null;
metadata=null;
pdfparser=null;
context =null;
}


Exception:-
I am just including the null pointer exception in the parser below.

10:53:11,696 INFO  [stdout] (Thread-11 
(HornetQ-client-global-threads-1619682129)) Exception in updating docbody for 
report == RPT_764268
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
java.lang.reflect.Method.invoke(Method.java:597)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at

RE: Memory issues with PDF parser

RE: Memory issues with PDF parser

RE: Memory issues with PDF parser

3 matches

Site Navigation

Mail list logo

Footer information