Re: Parsing huge PDF (400Mb, 2700 pages)

John Patrick Thu, 14 Nov 2019 08:31:47 -0800

What jdk are you using?
Java 8? 11? 13? i.e. a version that is currently in active support
Are you using the latest release of that version?
Have you switch on gc logging and seen if that is the issue?
Constantly doing gc? You might need to tweak the arguments depending
on what gc you are using?


If you take a look at the classic gc diagram from say here
https://geekspearls.blogspot.com/2016/02/how-java-garbage-collection-works.html

if your file is 400MB and it isn't streaming, then your eden might
need to be more than it's default value otherwise the eden will get
instantly use then moved to tenured.

gc logs should give you an idea.

Have you tried restarting and see if it's faster if it is the 1st file
to process?

John

On Thu, 14 Nov 2019 at 16:16, Ribeaud, Christian (Ext)
<christian.ribe...@novartis.com> wrote:
>
> Good evening,
>
>
>
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) 
> that PDFBox does NOT stream the PDF.
>
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
>
>
>
> christian
>
>
>
> From: Tim Allison <talli...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org
> Cc: us...@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
>
>
>
> CC'ing colleagues on PDFBox...any recommendations?
>
>
>
> Sergey's recommendation is great for documents that can be parsed via 
> streaming.  However, PDFBox does not currently parse PDFs in a streaming 
> mode.  It builds the full document tree -- PDFBox colleagues let me know if 
> I'm wrong.
>
>
>
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sberyoz...@gmail.com> wrote:
>
> Hi,
>
> Are you using tika-server ? If yes and you can submit the data using a 
> multipart/form-data payload then it may help, CXF (used by tika-server) 
> should do the best effort at saving the multipart payloads to the temp 
> locations on the disk, and thus minimize the memory requirements
>
>
>
> Cheers, Sergey
>
>
>
>
>
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) 
> <christian.ribe...@novartis.com> wrote:
>
> Hi,
>
> My application handles all kind of documents (mainly PDFs). In a very few 
> cases, you might expect huge PDFs (< 500MB).
>
> By around 400MB I am hitting the wall, parsing takes ages (although quite 
> fast at the beginning). I've tried several ideas but none of them brought the 
> desired amelioration.
>
> I have the impression that memory plays a role. I have no more than 3GB (and 
> I think this should be enough as we are streaming the document and using 
> event based XML parser).
>
> Are they things I should be aware of?
>
> Any hint would be very welcome. Thanks and have a nice day,
>
> christian

Re: Parsing huge PDF (400Mb, 2700 pages)

Reply via email to