Tim, to what email to send you the PDF ? Thanks On Thu, Feb 28, 2019 at 3:57 PM Slava G <slav...@gmail.com> wrote:
> I'll once I'll get customer's approval. > Meanwhile I can do any checks, if you can specify what to check. > Thanks > > On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <talli...@apache.org> wrote: > >> Any chance you can share the file directly w me or someone else on the >> PDFBox team? >> >> On Wed, Feb 27, 2019 at 11:24 AM Slava G <slav...@gmail.com> wrote: >> >> > After 3h 40m it's still parsing using PDFBox 2.0.14 app... >> > Thanks >> > >> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <slav...@gmail.com> wrote: >> > >> >> With 2.0.14 it's 40 minutes running, no result, still working... >> >> Seems that issue is still there. >> >> Thanks >> >> >> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <slav...@gmail.com> wrote: >> >> >> >>> Checking with 2.0.14. Started as an app. Will update soon. >> >>> >> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <talli...@apache.org> >> wrote: >> >>> >> >>>> Any chance you could try with the 2.0.14 release candidate...unless >> you >> >>>> have already? >> >>>> >> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/ >> >>>> >> >>>> >> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slav...@gmail.com> wrote: >> >>>> >> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 >> >>>>> hours and still counting... >> >>>>> It's seems to be a PDFBox issue. >> >>>>> >> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdat...@gmail.com> >> wrote: >> >>>>> >> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a >> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF. >> >>>>>> It can be easier to investigate the problem. >> >>>>>> >> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat < >> cristian....@gmail.com> >> >>>>>> a écrit : >> >>>>>> >> >>>>>>> Just looking at the stack trace it won't be the same anymore due >> to >> >>>>>>> PDFBOX-4453 >> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it >> >>>>>>> changes how decryption is handled. Not sure if related though. >> >>>>>>> >> >>>>>>> Can you duplicate the problem without Tika using just PDFBox >> >>>>>>> command-line ExtractText command ( >> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file? >> >>>>>>> >> >>>>>>> >> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slav...@gmail.com> >> wrote: >> >>>>>>> >> >>>>>>>> This is the code : >> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath()); >> >>>>>>>> PDFParser tmpPdf = new PDFParser(); >> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >> >>>>>>>> config.setMaxMainMemoryBytes(31457280); >> >>>>>>>> config.setExtractAcroFormContent(false); >> >>>>>>>> config.setExtractBookmarksText(false); >> >>>>>>>> config.setCatchIntermediateIOExceptions(true); >> >>>>>>>> Metadata metadata = new Metadata(); >> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf"); >> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new >> >>>>>>>> ParseContext()); >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <talli...@apache.org >> > >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> >> >>>>>>>>> This is the default in Tika, where the default for >> >>>>>>>>> maxMainMemoryBytes=500MB. >> >>>>>>>>> >> >>>>>>>>> Slava, how are you calling this in Tika? With a TikaInputStream >> >>>>>>>>> via tika-app or tika-server or something else? >> >>>>>>>>> >> >>>>>>>>> MemoryUsageSetting memoryUsageSetting = >> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly(); >> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) { >> >>>>>>>>> memoryUsageSetting = >> >>>>>>>>> >> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes()); >> >>>>>>>>> } >> >>>>>>>>> if (tstream != null && tstream.hasFile()) { >> >>>>>>>>> // File based -- send file directly to PDFBox >> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), >> >>>>>>>>> password, memoryUsageSetting); >> >>>>>>>>> } else { >> >>>>>>>>> pdfDocument = PDDocument.load(new >> CloseShieldInputStream(stream), >> >>>>>>>>> password, memoryUsageSetting); >> >>>>>>>>> } >> >>>>>>>>> >> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr < >> >>>>>>>>> thaush...@t-online.de> wrote: >> >>>>>>>>> >> >>>>>>>>>> Hi, >> >>>>>>>>>> >> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could >> run >> >>>>>>>>>> the >> >>>>>>>>>> profiler. >> >>>>>>>>>> >> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice. >> >>>>>>>>>> >> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user >> >>>>>>>>>> password. >> >>>>>>>>>> >> >>>>>>>>>> It would also be interesting to hear what parameter is passed >> to >> >>>>>>>>>> MemoryUsageSetting when load() is called. >> >>>>>>>>>> >> >>>>>>>>>> Tilman >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison: >> >>>>>>>>>> > PDFBox Colleagues, >> >>>>>>>>>> > Any ideas? >> >>>>>>>>>> > >> >>>>>>>>>> > ---------- Forwarded message --------- >> >>>>>>>>>> > From: Tim Allison <talli...@apache.org> >> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM >> >>>>>>>>>> > Subject: Re: Very slow PDF parsing. >> >>>>>>>>>> > To: <u...@tika.apache.org> >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > Sorry...that's an OCR tool. One thing that can slow down >> >>>>>>>>>> processing >> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing >> >>>>>>>>>> 'tesseract' on >> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs. I >> >>>>>>>>>> suspect this >> >>>>>>>>>> > isn't your problem, though. >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com> >> >>>>>>>>>> wrote: >> >>>>>>>>>> > >> >>>>>>>>>> >> Thanks Tim, >> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is >> >>>>>>>>>> tessercat is in >> >>>>>>>>>> >> this context 🙂 >> >>>>>>>>>> >> >> >>>>>>>>>> >> Thanks >> >>>>>>>>>> >> >> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison < >> talli...@apache.org> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >> >>>>>>>>>> >>> Thank you, Slava! >> >>>>>>>>>> >>> >> >>>>>>>>>> >>> Do you have tesseract installed? >> >>>>>>>>>> >>> >> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations? >> >>>>>>>>>> >>> >> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G < >> slav...@gmail.com> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >>>> Hi, >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text >> and >> >>>>>>>>>> some images. >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more >> (TIKA >> >>>>>>>>>> 1.19.1 >> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with >> SSD >> >>>>>>>>>> disk, running >> >>>>>>>>>> >>> CentOS Linux). >> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or >> maybe >> >>>>>>>>>> it's a bug >> >>>>>>>>>> >>> in PDFBox ? >> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this >> >>>>>>>>>> stack : >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at >> org.apache.pdfbox.cos.COSString.equals(COSString.java:259) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> >>> >> >>>>>>>>>> >> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185) >> >>>>>>>>>> >>>> at >> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at >> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at >> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> at >> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152) >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all. >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> Thanks >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> --------------------------------------------------------------------- >> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >>>>>>>>>> >> >>>>>>>>>> >> >