Re: AW: Re: Possible memory leak when extracting text?

Søren Pedersen Wed, 15 May 2019 02:28:41 -0700

OK, but is that a problem in my local app, or is it a problem with the files on 
the snapshot repo?


I now see that when my app downloads the file from the FTP server, the file 
seems to be corrupted, and I am probably running into the exact problem you 
linked to. This is how the file looks like after I have fetched it via FTP: 
https://we.tl/t-RglMuSLEIz

When opening that file in a PDF reader I see that the text is all jumbled up. 
When I use v. 2.0.15 to extract text from it I get that OutOfMemoryError.

When I use 2.0.16-SNAPSHOT I get this error instead:

[]SPE@spe-imac[]:[[]~/Downloads[[][]$ java -jar 
pdfbox-app-2.0.16-20190513.182615-76.jar ExtractText 
4236a711-0f64-44ed-a2f2-e6342153809b.pdf
May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 130, length: 189182, 
expected end position: 189312
May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 249137, length: 11400, 
expected end position: 260537
May 15, 2019 11:12:06 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font CIDFont+F1
java.io.IOException: LangSysRecords not alphabetically sorted by LangSys tag: 
ltÒ <= scÊ
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:125)
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:109)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:62)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139)
at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:192)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375)
at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)

May 15, 2019 11:12:07 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 
findFontOrSubstitute
WARNING: Using fallback font LiberationSans for CID-keyed TrueType font 
CIDFont+F1

So at least v. 2.0.16 does not go out of memory :)

My real problem seems to be in the FTP transfer then

On 15 May 2019, 09.43 +0200, Tilman Hausherr <thaush...@t-online.de>, wrote:
> this is some problem with the numbers of components being different. Try
> pdfbox-app instead.
> Tilman
>
>
> ------------------------------------------------------------------------
> Gesendet mit der Telekom Mail App
> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>
>
>
> --- Original-Nachricht ---
> Von: Søren Pedersen
> Betreff: Re: Possible memory leak when extracting text?
> Datum: 15.05.2019, 9:08 Uhr
> An: users@pdfbox.apache.org
>
>
>
>
>
> I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my
> application, but I keep having issues. I added this to my pom file:
>
> <repositories>
> <repository>
> <id>repository.apache.org.snapshots</id>
> <http://repository.apache.org.snapshots</id>> ;
> <name>Apache snapshots repo</name>
> <url>https://repository.apache.org/content/groups/snapshots/</url>
> <https://repository.apache.org/content/groups/snapshots/</url>> ;
> <snapshots>
> <enabled>true</enabled>
> </snapshots>
> <releases>
> <enabled>false</enabled>
> </releases>
> </repository>
> </repositories>
>
> And then I added this under dependencies:
>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>pdfbox</artifactId>
> <version>2.0.16-SNAPSHOT</version>
> </dependency>
>
> When I run “mvn compile” I get this error:
>
> [ERROR] Failed to execute goal on project pdftextextractor: Could not
> resolve dependencies for project nu.optimise:pdftextextractor:jar:1.0:
> Failed to collect dependencies at
> org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to read artifact
> descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Could not find
> artifact org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in
> repository.apache.org.snapshots <http://repository.apache.org.snapshots> (
> https://repository.apache.org/content/groups/snapshots
> <https://repository.apache.org/content/groups/snapshots> /) -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> <http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException>
>
>
> I am probably missing something obvious, but I haven’t been working with
> Java for that long, so I have no clue what to do (my googling skills did
> not prevail).
>
> Do you have any tips?
>
> Thanks a lot in advance!
>
> Best regards,
> Søren
>
>
> On 11 May 2019, 11.04 +0200, Tilman Hausherr <thaush...@t-online.de
> <mailto:thaush...@t-online.de> >, wrote:
> > The reason I mentioned 2.0.16 is because of this bug:
> > https://issues.apache.org/jira/browse/PDFBOX-4489
> <https://issues.apache.org/jira/browse/PDFBOX-4489>
> >
> > that one happened with a corrupt file. Yours isn't, but it might be if
> > it gets corrupted in transfer or in filtering.
> >
> > 2.0.16 snapshot is here:
> >
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT
> <https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT>
> /
> >
> > Tilman
> >
> > Am 11.05.2019 um 06:54 schrieb Søren Pedersen:
> > > Ok, that is very interesting. Thanks a lot for looking into this!
> > >
> > > I am a bit baffled as to why we experience the memory leak then, but I
> guess I will have to dig more into it.
> > >
> > > Best regards,
> > > Søren
> > > On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <andr...@lehmi.de
> <mailto:andr...@lehmi.de> >, wrote:
> > > > Am 10.05.19 um 15:52 schrieb Søren Pedersen:
> > > > > I have done some more testing, and I found that when I run on
> Windows there are no problems, but when I run on Linux I get the memory
> leak. Tilman, would you be able to run the same test on a Linux box? - or
> maybe using a Linux Docker container, like I showed originally?
> > > > I've extracted the text on linux (fedora 30, openjdk 1.8.0_212)
> without any
> > > > problems using
> > > >
> > > > java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText
> > > >
> > > > where -Xmx9m is the smallest working value
> > > >
> > > > Andreas
> > > >
> > > > > We would prefer to run our app on Linux, but this looks like a
> blocker for that unfortunately :(
> > > > >
> > > > > Best regards,
> > > > > Søren Pedersen
> > > > > On 10 May 2019, 09.32 +0200, Søren Pedersen <sh.peder...@gmail.com
> <mailto:sh.peder...@gmail.com> >, wrote:
> > > > > > Ok, thanks a lot for looking into this Tilman. I will try your
> suggestion and keep fiddling with it :)
> > > > > >
> > > > > > Have a great weekend!
> > > > > > On 10 May 2019, 08.12 +0200, Tilman Hausherr <
> thaush...@t-online.de <mailto:thaush...@t-online.de> >, wrote:
> > > > > > > Am 10.05.2019 um 07:22 schrieb Søren Pedersen:
> > > > > > > > We have an application that can index the contents of PDF
> files, so that we
> > > > > > > > can use that for a search algorithm. We use the Apache PDFBox
> library for
> > > > > > > > extracting text from a PDF, like this (where inputStream is a
> > > > > > > > ByteArrayInputStream containing the contents of the PDF
> file):
> > > > > > > >
> > > > > > > > PDFTextStripper pdfStripper = new PDFTextStripper();
> > > > > > > > pdDoc = PDDocument.load(inputStream,
> > > > > > > > MemoryUsageSetting.setupTempFileOnly
> <http://MemoryUsageSetting.setupTempFileOnly> ());
> > > > > > > > String parsedText = pdfStripper.getText(pdDoc
> <http://pdfStripper.getText(pdDoc> );
> > > > > > >
> > > > > > > You can pass the byte[] directly to load(). Also make sure that
> the
> > > > > > > bytes are not altered in any way, e.g. through a incorrectly
> configured
> > > > > > > web downloading, or an incorrectly configured resource loading
> > > > > > > ("filtering" option must be false).
> > > > > > >
> > > > > > >
> > > > > > > Also retry with 2.0.16 snapshot.
> > > > > > >
> > > > > > > Tilman
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> <mailto:users-unsubscr...@pdfbox.apache.org>
> > > > > > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> <mailto:users-h...@pdfbox.apache.org>
> > > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> <mailto:users-unsubscr...@pdfbox.apache.org>
> > > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> <mailto:users-h...@pdfbox.apache.org>
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> <mailto:users-unsubscr...@pdfbox.apache.org>
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> <mailto:users-h...@pdfbox.apache.org>
> >

Re: AW: Re: Possible memory leak when extracting text?

Reply via email to