Poor text extraction performance in PDFTextStripper.java
--------------------------------------------------------
Key: PDFBOX-956
URL: https://issues.apache.org/jira/browse/PDFBOX-956
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.4.0
Reporter: Kevin Jackson
Fix For: 1.5.0
The worst case performance of the suppressDuplicateOverlappingText logic in
processTextPosition is O(n^2).
The patch is to use a TreeMap to achieve O(N log N) performance.
The example PDF took over 2 hours to extract the text before this patch and
less than 10 minute after.
BTW: The extracted text is also quite different compared to Adobe Reader. Not
sure which is correct but for this document it doesn't matter.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira