RE: Parsing huge PDF (400Mb, 2700 pages)

Ribeaud, Christian (Ext) Thu, 14 Nov 2019 09:52:15 -0800

Hi,

I’ve read regarding the on-demand parser. I might have a look.


Unfortunately, I am NOT allowed to share the PDF.

What am I trying to do is the following: I am writing an AWS Lambda for parsing 
the PDF by page. The text should be extracted and send to Elasticsearch.

Because of the Lambda environment, I have limited resources: 3GB and 15mn 
runtime max.

This setup works marvelously with the majority of the PDFs. With the ones 
bigger than around 400Mb, I am overrunning the time limit.

The problem is NOT Tika related, it is PDFBox related (I did a check). So I 
will have to find another strategy for the time being.

Thanks to all for the feedback. Very appreciated.

Kind regards and have a nice evening,

christian

From: Tilman Hausherr <thaush...@t-online.de>
Sent: Donnerstag, 14. November 2019 18:05
To: user@tika.apache.org
Cc: us...@pdfbox.apache.org
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

The PDF can be much bigger than 3GB when decompressed.

What you could try

1) using a scratch file (will be even slower) when opening the document
2) the on-demand parser, see
https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>

there is a branch on the svn server, you have to build from source.

Tilman

Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
Good evening,

No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) 
that PDFBox does NOT stream the PDF.
So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.

christian

From: Tim Allison <talli...@apache.org><mailto:talli...@apache.org>
Sent: Donnerstag, 14. November 2019 15:07
To: user@tika.apache.org<mailto:user@tika.apache.org>
Cc: us...@pdfbox.apache.org<mailto:us...@pdfbox.apache.org>
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via 
streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  
It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin 
<sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:
Hi,
Are you using tika-server ? If yes and you can submit the data using a 
multipart/form-data payload then it may help, CXF (used by tika-server) should 
do the best effort at saving the multipart payloads to the temp locations on 
the disk, and thus minimize the memory requirements

Cheers, Sergey


On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) 
<christian.ribe...@novartis.com<mailto:christian.ribe...@novartis.com>> wrote:
Hi,

My application handles all kind of documents (mainly PDFs). In a very few 
cases, you might expect huge PDFs (< 500MB).

By around 400MB I am hitting the wall, parsing takes ages (although quite fast 
at the beginning). I've tried several ideas but none of them brought the 
desired amelioration.

I have the impression that memory plays a role. I have no more than 3GB (and I 
think this should be enough as we are streaming the document and using event 
based XML parser).

Are they things I should be aware of?

Any hint would be very welcome. Thanks and have a nice day,

christian

RE: Parsing huge PDF (400Mb, 2700 pages)

Reply via email to