[ https://issues.apache.org/jira/browse/PDFBOX-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386144#comment-15386144 ]
Tilman Hausherr commented on PDFBOX-3429: ----------------------------------------- No need to apologize. Your suspicion is a possible logical conclusion of your observation. Here's a thesis about lock profiling: http://www-public.tem-tsp.eu/~thomas_g/research/etudiants/theses/david-phd-thesis.pdf sadly only one product in the first is free, hprof. I tried it, but did not get an output like the one in the thesis, the table at the end did not show any classes or lost time. Either there is nothing, or my test code (that strips 8 identical PDF files at the same time) didn't find it. He has his own profiler, Free Lunch, which you can get here: https://github.com/flodav/FreeLunch However it means building a special JVM. A binary is provided for Linux x64 architectures only. That is this guy: https://www.linkedin.com/in/florian-david-97824248 http://floriandavid.org/work.php > Improve ExtractText Concurrency > ------------------------------- > > Key: PDFBOX-3429 > URL: https://issues.apache.org/jira/browse/PDFBOX-3429 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.1 > Environment: Win7, jdk1.8.0_60 x64 > Reporter: Luis Filipe Nassif > Priority: Minor > Labels: optimization > Attachments: cpu-pdfbox-2.0.1.png, cpu-pdfbox1.8.10.png > > > While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text > extraction application, I noted cpu usage aroung 80% in my 6 core computer > when processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec > to complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, > cpu usage stays aroung 100%. It took 4min37sec to complete. The dataset is > read from a ramdrive, so there is no i/o bottleneck. I suspect there is some > new synchronization code that blocks the threads for a non trivial amount of > time, resulting in less cpu usage than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org