Re: Fwd: Very slow PDF parsing.

Tim Allison Thu, 28 Feb 2019 05:56:21 -0800

Any chance you can share the file directly w me or someone else on the
PDFBox team?


On Wed, Feb 27, 2019 at 11:24 AM Slava G <slav...@gmail.com> wrote:

> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> Thanks
>
> On Wed, Feb 27, 2019 at 3:29 PM Slava G <slav...@gmail.com> wrote:
>
>> With 2.0.14 it's 40 minutes running, no result, still working...
>> Seems that issue is still there.
>> Thanks
>>
>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <slav...@gmail.com> wrote:
>>
>>> Checking with 2.0.14. Started as an app. Will update soon.
>>>
>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <talli...@apache.org> wrote:
>>>
>>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>>> have already?
>>>>
>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>>
>>>>
>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slav...@gmail.com> wrote:
>>>>
>>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>>> hours and still counting...
>>>>> It's seems to be a PDFBox issue.
>>>>>
>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdat...@gmail.com> wrote:
>>>>>
>>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>>>> It can be easier to investigate the problem.
>>>>>>
>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian....@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>>> PDFBOX-4453
>>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>>>>> changes how decryption is handled. Not sure if related though.
>>>>>>>
>>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>>> command-line ExtractText command (
>>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slav...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This is the code :
>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>>> config.setExtractBookmarksText(false);
>>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>>> Metadata metadata = new Metadata();
>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>>> ParseContext());
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <talli...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is the default in Tika, where the default for
>>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>>
>>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>>
>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>>> memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>>> }
>>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> } else {
>>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>>> thaush...@t-online.de> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>>> the
>>>>>>>>>> profiler.
>>>>>>>>>>
>>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>>
>>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>>> password.
>>>>>>>>>>
>>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>>> > PDFBox Colleagues,
>>>>>>>>>> >    Any ideas?
>>>>>>>>>> >
>>>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>>>> > From: Tim Allison <talli...@apache.org>
>>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>>>> > To: <u...@tika.apache.org>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>>> processing
>>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>>>> 'tesseract' on
>>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>>> suspect this
>>>>>>>>>> > isn't your problem, though.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> >> Thanks Tim,
>>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>>> tessercat is in
>>>>>>>>>> >> this context 🙂
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks
>>>>>>>>>> >>
>>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <talli...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >>> Thank you, Slava!
>>>>>>>>>> >>>
>>>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>>>> >>>
>>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>>>> >>>
>>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slav...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >>>> Hi,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>>> some images.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>>> 1.19.1
>>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>>> disk, running
>>>>>>>>>> >>> CentOS Linux).
>>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>>> it's a bug
>>>>>>>>>> >>> in PDFBox ?
>>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>>>> stack :
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Reply via email to