Your document has 265 pages. What are you comparing with what? Your
document against another document? or PDFBox against other code? I have run
your document and it runs at the same speed as most others - it takes 50
secs for first 200 pp, on mine. It will depend at least on the speed of
your machine and the number of processors that can be parallelised .


On Mon, Dec 23, 2013 at 3:12 PM, Clemens Wyss DEV <[email protected]>wrote:

> Opened an issue therefor
> https://issues.apache.org/jira/browse/PDFBOX-1821
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss - MySign AG [mailto:[email protected]]
> Gesendet: Sonntag, 22. Dezember 2013 17:37
> An: '[email protected]'
> Betreff: Parsing a pdf file takes 3minutes
>
> I initially posted this question in the tika-mailing list, and I even
> created an issue herefore:
> https://issues.apache.org/jira/browse/TIKA-1213
> Hopefully now being on the right list, I re-phrase the problem I am
> confronted with:
> I have (several) pdf documents which take up to 3minutes to be
> parsed/extracted (for later lucene indexing).
> For example  the pdf which is attached to the jira issue requires 3minutes.
>
> How/why is this possible? How can I improve on this?
>
> Any help appreciated
> Clemens
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to