Am 01.03.2016 um 20:29 schrieb Nicolas Paris:
2016-03-01 19:28 GMT+01:00 Tilman Hausherr <[email protected]>:
Am 01.03.2016 um 13:33 schrieb Nicolas Paris:
Hello,
My use case is I extract text from the same pdf in 2 ways : one sorted and
one non sorted.
This process takes 2 seconds. Its too long (I have 1M pdf to extract)
I wonder if it could be feaseable to modify the code (
https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java
)
in order to combine the two actions in one.
The output would be something like
extractSorted
separator
extractNonSorted
And the command line would be "pdfbox..extractText -combine -nonSort
-sort"
.
Maybe this is not a good idea. Then have you advices in order to improve
extract performances ?
You could write a software that does both extracts in parallel (it should
use different PDDocument objects).
I made it work. Just by editing the java file I was talking about. line
230. By adding a new
stripper.writeText( document, output );
with an other config, I am able multiply performances by 2 (the use case
described in previous email). I could do that in 2 threads, but I allready
run the command in multi linux processes.
Re performance - the current snapshot is a bit faster than RC3., thanks to
PDFBOX-3224 which improved performance by about 20%.
You mean the github version I cloned and compile is not the RC3 ?
Sorry, that is of course the latest snapshot (mirror), so you do already
have max speed.
Tilman
I don't have a suggestion how to improve performance... use a fast
computer with enough memory. Or try other products:
https://pdfliberation.wordpress.com/
Thanks for the link I didn't knew them. Actually I already
have
tested others but the hability to "sort" the text is very important for my
pdf.
(python pdfminer, linux pdf2html)
But I think PDFBox is not that bad, considering this project:
https://github.com/jsonstein/HRC-emails-PDF2TXT
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]