After 3h 40m it's still parsing using PDFBox 2.0.14 app... Thanks On Wed, Feb 27, 2019 at 3:29 PM Slava G <slav...@gmail.com> wrote:
> With 2.0.14 it's 40 minutes running, no result, still working... > Seems that issue is still there. > Thanks > > On Wed, Feb 27, 2019 at 2:52 PM Slava G <slav...@gmail.com> wrote: > >> Checking with 2.0.14. Started as an app. Will update soon. >> >> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <talli...@apache.org> wrote: >> >>> Any chance you could try with the 2.0.14 release candidate...unless you >>> have already? >>> >>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ >>> >>> >>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slav...@gmail.com> wrote: >>> >>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 >>>> hours and still counting... >>>> It's seems to be a PDFBox issue. >>>> >>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdat...@gmail.com> wrote: >>>> >>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget* >>>>> or *curl* bash client to parse your 65Mo PDF. >>>>> It can be easier to investigate the problem. >>>>> >>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >>>>> >>>>> >>>>> >>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian....@gmail.com> >>>>> a écrit : >>>>> >>>>>> Just looking at the stack trace it won't be the same anymore due to >>>>>> PDFBOX-4453 >>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes >>>>>> how decryption is handled. Not sure if related though. >>>>>> >>>>>> Can you duplicate the problem without Tika using just PDFBox >>>>>> command-line ExtractText command ( >>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >>>>>> >>>>>> >>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slav...@gmail.com> wrote: >>>>>> >>>>>>> This is the code : >>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >>>>>>> PDFParser tmpPdf = new PDFParser(); >>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>>>>> config.setMaxMainMemoryBytes(31457280); >>>>>>> config.setExtractAcroFormContent(false); >>>>>>> config.setExtractBookmarksText(false); >>>>>>> config.setCatchIntermediateIOExceptions(true); >>>>>>> Metadata metadata = new Metadata(); >>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >>>>>>> ParseContext()); >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <talli...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> This is the default in Tika, where the default for >>>>>>>> maxMainMemoryBytes=500MB. >>>>>>>> >>>>>>>> Slava, how are you calling this in Tika? With a TikaInputStream >>>>>>>> via tika-app or tika-server or something else? >>>>>>>> >>>>>>>> MemoryUsageSetting memoryUsageSetting = >>>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >>>>>>>> memoryUsageSetting = >>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >>>>>>>> } >>>>>>>> if (tstream != null && tstream.hasFile()) { >>>>>>>> // File based -- send file directly to PDFBox >>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password, >>>>>>>> memoryUsageSetting); >>>>>>>> } else { >>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), >>>>>>>> password, memoryUsageSetting); >>>>>>>> } >>>>>>>> >>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >>>>>>>> thaush...@t-online.de> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> As usual, it would be nice to have the PDF, so that we could run >>>>>>>>> the >>>>>>>>> profiler. >>>>>>>>> >>>>>>>>> The HashSet is used to avoid decrypting objects twice. >>>>>>>>> >>>>>>>>> The "not encrypted" file is likely encrypted with an empty user >>>>>>>>> password. >>>>>>>>> >>>>>>>>> It would also be interesting to hear what parameter is passed to >>>>>>>>> MemoryUsageSetting when load() is called. >>>>>>>>> >>>>>>>>> Tilman >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >>>>>>>>> > PDFBox Colleagues, >>>>>>>>> > Any ideas? >>>>>>>>> > >>>>>>>>> > ---------- Forwarded message --------- >>>>>>>>> > From: Tim Allison <talli...@apache.org> >>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >>>>>>>>> > Subject: Re: Very slow PDF parsing. >>>>>>>>> > To: <u...@tika.apache.org> >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >>>>>>>>> processing >>>>>>>>> > dramatically is if you have tesseract installed (try typing >>>>>>>>> 'tesseract' on >>>>>>>>> > your commandline) and if you've turned it on for PDFs. I >>>>>>>>> suspect this >>>>>>>>> > isn't your problem, though. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> >> Thanks Tim, >>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is >>>>>>>>> tessercat is in >>>>>>>>> >> this context 🙂 >>>>>>>>> >> >>>>>>>>> >> Thanks >>>>>>>>> >> >>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <talli...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >> >>>>>>>>> >>> Thank you, Slava! >>>>>>>>> >>> >>>>>>>>> >>> Do you have tesseract installed? >>>>>>>>> >>> >>>>>>>>> >>> Colleagues on PDFBox, any recommendations? >>>>>>>>> >>> >>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slav...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>> Hi, >>>>>>>>> >>>> >>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and >>>>>>>>> some images. >>>>>>>>> >>>> >>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA >>>>>>>>> 1.19.1 >>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD >>>>>>>>> disk, running >>>>>>>>> >>> CentOS Linux). >>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe >>>>>>>>> it's a bug >>>>>>>>> >>> in PDFBox ? >>>>>>>>> >>>> When I'm printing java stack , I see all the time in this >>>>>>>>> stack : >>>>>>>>> >>>> >>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >>>>>>>>> >>>> >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >>>>>>>>> >>>> at >>>>>>>>> >>> >>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >>>>>>>>> >>>> at >>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >>>>>>>>> >>>> >>>>>>>>> >>>> at >>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >>>>>>>>> >>>> >>>>>>>>> >>>> at >>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >>>>>>>>> >>>> >>>>>>>>> >>>> at >>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >>>>>>>>> >>>> >>>>>>>>> >>>> Thanks >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>>>>>> >>>>>>>>>