Re: Fwd: Very slow PDF parsing.

Slava G Thu, 28 Feb 2019 08:27:28 -0800

Tim, to what email to send you the PDF ?
Thanks

On Thu, Feb 28, 2019 at 3:57 PM Slava G <slav...@gmail.com> wrote:


> I'll once I'll get customer's approval.
> Meanwhile I can do any checks, if you can specify what to check.
> Thanks
>
> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <talli...@apache.org> wrote:
>
>> Any chance you can share the file directly w me or someone else on the
>> PDFBox team?
>>
>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <slav...@gmail.com> wrote:
>>
>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>> > Thanks
>> >
>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <slav...@gmail.com> wrote:
>> >
>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>> >> Seems that issue is still there.
>> >> Thanks
>> >>
>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <slav...@gmail.com> wrote:
>> >>
>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>> >>>
>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <talli...@apache.org>
>> wrote:
>> >>>
>> >>>> Any chance you could try with the 2.0.14 release candidate...unless
>> you
>> >>>> have already?
>> >>>>
>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>> >>>>
>> >>>>
>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <slav...@gmail.com> wrote:
>> >>>>
>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>> >>>>> hours and still counting...
>> >>>>> It's seems to be a PDFBox issue.
>> >>>>>
>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jbdat...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>> >>>>>> It can be easier to investigate the problem.
>> >>>>>>
>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>> cristian....@gmail.com>
>> >>>>>> a écrit :
>> >>>>>>
>> >>>>>>> Just looking at the stack trace it won't be the same anymore due
>> to
>> >>>>>>> PDFBOX-4453
>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>> >>>>>>> changes how decryption is handled. Not sure if related though.
>> >>>>>>>
>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>> >>>>>>> command-line ExtractText command (
>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <slav...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> This is the code :
>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>> >>>>>>>> config.setExtractAcroFormContent(false);
>> >>>>>>>> config.setExtractBookmarksText(false);
>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>> >>>>>>>> Metadata metadata = new Metadata();
>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>> >>>>>>>> ParseContext());
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <talli...@apache.org
>> >
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> This is the default in Tika, where the default for
>> >>>>>>>>> maxMainMemoryBytes=500MB.
>> >>>>>>>>>
>> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>> >>>>>>>>> via tika-app or tika-server or something else?
>> >>>>>>>>>
>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>> >>>>>>>>> memoryUsageSetting =
>> >>>>>>>>>
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>> >>>>>>>>> }
>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>> >>>>>>>>> // File based -- send file directly to PDFBox
>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>> >>>>>>>>> password, memoryUsageSetting);
>> >>>>>>>>> } else {
>> >>>>>>>>> pdfDocument = PDDocument.load(new
>> CloseShieldInputStream(stream),
>> >>>>>>>>> password, memoryUsageSetting);
>> >>>>>>>>> }
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>> >>>>>>>>> thaush...@t-online.de> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>> run
>> >>>>>>>>>> the
>> >>>>>>>>>> profiler.
>> >>>>>>>>>>
>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>> >>>>>>>>>>
>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>> >>>>>>>>>> password.
>> >>>>>>>>>>
>> >>>>>>>>>> It would also be interesting to hear what parameter is passed
>> to
>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>> >>>>>>>>>>
>> >>>>>>>>>> Tilman
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>> >>>>>>>>>> > PDFBox Colleagues,
>> >>>>>>>>>> >    Any ideas?
>> >>>>>>>>>> >
>> >>>>>>>>>> > ---------- Forwarded message ---------
>> >>>>>>>>>> > From: Tim Allison <talli...@apache.org>
>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>> >>>>>>>>>> > To: <u...@tika.apache.org>
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>> >>>>>>>>>> processing
>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>> >>>>>>>>>> 'tesseract' on
>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>> >>>>>>>>>> suspect this
>> >>>>>>>>>> > isn't your problem, though.
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slav...@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >
>> >>>>>>>>>> >> Thanks Tim,
>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>> >>>>>>>>>> tessercat is in
>> >>>>>>>>>> >> this context 🙂
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Thanks
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>> talli...@apache.org>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >>
>> >>>>>>>>>> >>> Thank you, Slava!
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> Do you have tesseract installed?
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>> slav...@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >>>> Hi,
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>> and
>> >>>>>>>>>> some images.
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>> (TIKA
>> >>>>>>>>>> 1.19.1
>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
>> SSD
>> >>>>>>>>>> disk, running
>> >>>>>>>>>> >>> CentOS Linux).
>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>> maybe
>> >>>>>>>>>> it's a bug
>> >>>>>>>>>> >>> in PDFBox ?
>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>> >>>>>>>>>> stack :
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> Thanks
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> >>>>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>>
>>
>

Re: Fwd: Very slow PDF parsing.

Reply via email to