Any chance you can share the file directly w me or someone else on the PDFBox team?
On Wed, Feb 27, 2019 at 11:24 AM Slava G <slav...@gmail.com> wrote: > After 3h 40m it's still parsing using PDFBox 2.0.14 app... > Thanks > > On Wed, Feb 27, 2019 at 3:29 PM Slava G <slav...@gmail.com> wrote: > >> With 2.0.14 it's 40 minutes running, no result, still working... >> Seems that issue is still there. >> Thanks >> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <slav...@gmail.com> wrote: >> >>> Checking with 2.0.14. Started as an app. Will update soon. >>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <talli...@apache.org> wrote: >>> >>>> Any chance you could try with the 2.0.14 release candidate...unless you >>>> have already? >>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ >>>> >>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slav...@gmail.com> wrote: >>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 >>>>> hours and still counting... >>>>> It's seems to be a PDFBox issue. >>>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdat...@gmail.com> wrote: >>>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF. >>>>>> It can be easier to investigate the problem. >>>>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >>>>>> >>>>>> >>>>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian....@gmail.com> >>>>>> a écrit : >>>>>> >>>>>>> Just looking at the stack trace it won't be the same anymore due to >>>>>>> PDFBOX-4453 >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it >>>>>>> changes how decryption is handled. Not sure if related though. >>>>>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox >>>>>>> command-line ExtractText command ( >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slav...@gmail.com> wrote: >>>>>>> >>>>>>>> This is the code : >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>>>>>>> PDFParser tmpPdf = new PDFParser(); >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>>>>>> config.setMaxMainMemoryBytes(31457280); >>>>>>>> config.setExtractAcroFormContent(false); >>>>>>>> config.setExtractBookmarksText(false); >>>>>>>> config.setCatchIntermediateIOExceptions(true); >>>>>>>> Metadata metadata = new Metadata(); >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>>>>>>> ParseContext()); >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <talli...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> This is the default in Tika, where the default for >>>>>>>>> maxMainMemoryBytes=500MB. >>>>>>>>> >>>>>>>>> Slava, how are you calling this in Tika? With a TikaInputStream >>>>>>>>> via tika-app or tika-server or something else? >>>>>>>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting = >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>>>>>>> memoryUsageSetting = >>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>>>>>>> } >>>>>>>>> if (tstream != null && tstream.hasFile()) { >>>>>>>>> // File based -- send file directly to PDFBox >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), >>>>>>>>> password, memoryUsageSetting); >>>>>>>>> } else { >>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>>>>>>>> password, memoryUsageSetting); >>>>>>>>> } >>>>>>>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>>>>>>>> thaush...@t-online.de> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run >>>>>>>>>> the >>>>>>>>>> profiler. >>>>>>>>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice. >>>>>>>>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user >>>>>>>>>> password. >>>>>>>>>> >>>>>>>>>> It would also be interesting to hear what parameter is passed to >>>>>>>>>> MemoryUsageSetting when load() is called. >>>>>>>>>> >>>>>>>>>> Tilman >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>>>>>>>> > PDFBox Colleagues, >>>>>>>>>> > Any ideas? >>>>>>>>>> > >>>>>>>>>> > ---------- Forwarded message --------- >>>>>>>>>> > From: Tim Allison <talli...@apache.org> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>>>>>>>> > Subject: Re: Very slow PDF parsing. >>>>>>>>>> > To: <u...@tika.apache.org> >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >>>>>>>>>> processing >>>>>>>>>> > dramatically is if you have tesseract installed (try typing >>>>>>>>>> 'tesseract' on >>>>>>>>>> > your commandline) and if you've turned it on for PDFs. I >>>>>>>>>> suspect this >>>>>>>>>> > isn't your problem, though. >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> > >>>>>>>>>> >> Thanks Tim, >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is >>>>>>>>>> tessercat is in >>>>>>>>>> >> this context 🙂 >>>>>>>>>> >> >>>>>>>>>> >> Thanks >>>>>>>>>> >> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <talli...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >> >>>>>>>>>> >>> Thank you, Slava! >>>>>>>>>> >>> >>>>>>>>>> >>> Do you have tesseract installed? >>>>>>>>>> >>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations? >>>>>>>>>> >>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slav...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>> Hi, >>>>>>>>>> >>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and >>>>>>>>>> some images. >>>>>>>>>> >>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA >>>>>>>>>> 1.19.1 >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD >>>>>>>>>> disk, running >>>>>>>>>> >>> CentOS Linux). >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe >>>>>>>>>> it's a bug >>>>>>>>>> >>> in PDFBox ? >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this >>>>>>>>>> stack : >>>>>>>>>> >>>> >>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>>>>>>>> >>>> at >>>>>>>>>> >>> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>>>>>>>> >>>> at >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>>>>>>>> >>>> >>>>>>>>>> >>>> at >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>>>>>>>> >>>> >>>>>>>>>> >>>> Thanks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>>> >>>>>>>>>>