PDFBox Colleagues, Any ideas? ---------- Forwarded message --------- From: Tim Allison <talli...@apache.org> Date: Tue, Feb 26, 2019 at 12:13 PM Subject: Re: Very slow PDF parsing. To: <u...@tika.apache.org>
Sorry...that's an OCR tool. One thing that can slow down processing dramatically is if you have tesseract installed (try typing 'tesseract' on your commandline) and if you've turned it on for PDFs. I suspect this isn't your problem, though. On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com> wrote: > Thanks Tim, > But frankly speaking, it's a shame, but don't know what is tessercat is in > this context 🙂 > > Thanks > > On Tue, Feb 26, 2019, 19:04 Tim Allison <talli...@apache.org> wrote: > >> Thank you, Slava! >> >> Do you have tesseract installed? >> >> Colleagues on PDFBox, any recommendations? >> >> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slav...@gmail.com> wrote: >> > >> > Hi, >> > >> > I have large PDF (about 65mb) that contains mainly text and some images. >> > >> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 >> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running >> CentOS Linux). >> > >> > Please advise if there anything I can do to speedup.Or maybe it's a bug >> in PDFBox ? >> > >> > When I'm printing java stack , I see all the time in this stack : >> > >> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.find(Unknown Source) >> > >> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >> > >> > at java.util.HashMap.getNode(Unknown Source) >> > >> > at java.util.HashMap.containsKey(Unknown Source) >> > >> > at java.util.HashSet.contains(Unknown Source) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> > >> > at >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> > >> > at >> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >> > >> > at >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >> > >> > at >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >> > >> > at >> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >> > >> > at >> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >> > >> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >> > >> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >> > >> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >> > >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >> > >> > >> > P.S. Btw, the PDF is not encrypted at all. >> > >> > Thanks >> >