This is the default in Tika, where the default for maxMainMemoryBytes=500MB.
Slava, how are you calling this in Tika? With a TikaInputStream via tika-app or tika-server or something else? MemoryUsageSetting memoryUsageSetting = MemoryUsageSetting.setupMainMemoryOnly(); if (localConfig.getMaxMainMemoryBytes() >= 0) { memoryUsageSetting = MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); } if (tstream != null && tstream.hasFile()) { // File based -- send file directly to PDFBox pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, memoryUsageSetting); } else { pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password, memoryUsageSetting); } On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Hi, > > As usual, it would be nice to have the PDF, so that we could run the > profiler. > > The HashSet is used to avoid decrypting objects twice. > > The "not encrypted" file is likely encrypted with an empty user password. > > It would also be interesting to hear what parameter is passed to > MemoryUsageSetting when load() is called. > > Tilman > > > > Am 26.02.2019 um 18:14 schrieb Tim Allison: > > PDFBox Colleagues, > > Any ideas? > > > > ---------- Forwarded message --------- > > From: Tim Allison <talli...@apache.org> > > Date: Tue, Feb 26, 2019 at 12:13 PM > > Subject: Re: Very slow PDF parsing. > > To: <u...@tika.apache.org> > > > > > > Sorry...that's an OCR tool. One thing that can slow down processing > > dramatically is if you have tesseract installed (try typing 'tesseract' > on > > your commandline) and if you've turned it on for PDFs. I suspect this > > isn't your problem, though. > > > > > > > > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com> wrote: > > > >> Thanks Tim, > >> But frankly speaking, it's a shame, but don't know what is tessercat is > in > >> this context 🙂 > >> > >> Thanks > >> > >> On Tue, Feb 26, 2019, 19:04 Tim Allison <talli...@apache.org> wrote: > >> > >>> Thank you, Slava! > >>> > >>> Do you have tesseract installed? > >>> > >>> Colleagues on PDFBox, any recommendations? > >>> > >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slav...@gmail.com> wrote: > >>>> Hi, > >>>> > >>>> I have large PDF (about 65mb) that contains mainly text and some > images. > >>>> > >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 > >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, > running > >>> CentOS Linux). > >>>> Please advise if there anything I can do to speedup.Or maybe it's a > bug > >>> in PDFBox ? > >>>> When I'm printing java stack , I see all the time in this stack : > >>>> > >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.find(Unknown Source) > >>>> > >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) > >>>> > >>>> at java.util.HashMap.getNode(Unknown Source) > >>>> > >>>> at java.util.HashMap.containsKey(Unknown Source) > >>>> > >>>> at java.util.HashSet.contains(Unknown Source) > >>>> > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) > >>>> at > >>> > org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) > >>>> at > >>> > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) > >>>> at > >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) > >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) > >>>> > >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) > >>>> > >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) > >>>> > >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) > >>>> > >>>> > >>>> P.S. Btw, the PDF is not encrypted at all. > >>>> > >>>> Thanks > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >