Re: ExtractText command line tool - modify code

Tilman Hausherr Tue, 01 Mar 2016 10:28:54 -0800

Am 01.03.2016 um 13:33 schrieb Nicolas Paris:

Hello,


My use case is I extract text from the same pdf in 2 ways : one sorted and
one non sorted.
This process takes 2 seconds. Its too long (I have 1M pdf to extract)

I wonder if it could be feaseable to modify the code (
https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java)
in order to combine the two actions in one.

The output would be something like
extractSorted
separator
extractNonSorted

And the command line would be "pdfbox..extractText -combine -nonSort -sort"
.

Maybe this is not a good idea. Then have you advices in order to improve
extract performances ?

You could write a software that does both extracts in parallel (itshould use different PDDocument objects).

Re performance - the current snapshot is a bit faster than RC3., thanksto PDFBOX-3224 which improved performance by about 20%.

I don't have a suggestion how to improve performance... use a fastcomputer with enough memory. Or try other products:

https://pdfliberation.wordpress.com/

But I think PDFBox is not that bad, considering this project:
https://github.com/jsonstein/HRC-emails-PDF2TXT

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ExtractText command line tool - modify code

Reply via email to