[
https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273931#comment-13273931
]
Michael McCandless commented on PDFBOX-1305:
--------------------------------------------
I just tested this on PDFBox's current trunk (to be 1.7.0) and ExtractText ran
in ~9 seconds (on a recent ivy bridge machine)...
It could be you are seeing the slowness that was fixed in PDFBOX-956?
> Text extraction takes huge amount of time on some files
> -------------------------------------------------------
>
> Key: PDFBOX-1305
> URL: https://issues.apache.org/jira/browse/PDFBOX-1305
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0
> Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7.
> Same result with JDK 7u4 and JDK 6u32
> Reporter: Roger HÃ¥kansson
> Attachments: 20020101ab3x012a.pdf
>
>
> I've got 1.2M single-page PDF files which I'm indexing using Solr (which is
> using Tika, which is using PDFBox) and some of them takes between 20min up to
> an hour to index.
> This is a huge problem for me, in 48hours I've indexed about 45k files and 19
> hours of that time was spent on just 279 files.
> I've traced it to PDFBox taking a lot of time extracting the text from the
> documents.
> I've tested extracting the text using pdfbox-app's ExtractText with the same
> result, the text is extracted but it takes forever...
> The attached file took about 23min (using ExtractText) and from the result I
> can see a lot of "rubbish text" which I don't see in the text extracted from
> files that takes a normal amount of time (up to a few seconds per file) to
> parse.
> When running truss (on Solaris, strace on Linux) on the java-process, I can
> see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this
> problem but I just want to mention it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira