Your document has 265 pages. What are you comparing with what? Your document against another document? or PDFBox against other code? I have run your document and it runs at the same speed as most others - it takes 50 secs for first 200 pp, on mine. It will depend at least on the speed of your machine and the number of processors that can be parallelised .
On Mon, Dec 23, 2013 at 3:12 PM, Clemens Wyss DEV <[email protected]>wrote: > Opened an issue therefor > https://issues.apache.org/jira/browse/PDFBOX-1821 > > -----Ursprüngliche Nachricht----- > Von: Clemens Wyss - MySign AG [mailto:[email protected]] > Gesendet: Sonntag, 22. Dezember 2013 17:37 > An: '[email protected]' > Betreff: Parsing a pdf file takes 3minutes > > I initially posted this question in the tika-mailing list, and I even > created an issue herefore: > https://issues.apache.org/jira/browse/TIKA-1213 > Hopefully now being on the right list, I re-phrase the problem I am > confronted with: > I have (several) pdf documents which take up to 3minutes to be > parsed/extracted (for later lucene indexing). > For example the pdf which is attached to the jira issue requires 3minutes. > > How/why is this possible? How can I improve on this? > > Any help appreciated > Clemens > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

