Re: Possible memory leak when extracting text?

Tilman Hausherr Wed, 15 May 2019 09:11:21 -0700

Am 15.05.2019 um 11:27 schrieb Søren Pedersen:

OK, but is that a problem in my local app, or is it a problem with the files on 
the snapshot repo?

I suspect that it is a problem with the build process. Maybe it isbecause we changed something a few months ago. I have the problem toosometimes. I didn't really research this because I do the builds myselfanyway, the pdfbox-app can be used too, and I don't know enough to fix it.


I now see that when my app downloads the file from the FTP server, the file 
seems to be corrupted, and I am probably running into the exact problem you 
linked to. This is how the file looks like after I have fetched it via FTP: 
https://we.tl/t-RglMuSLEI


LOL, the good old ftp ascii transfer.

The message you showed is related to this issue:

https://issues.apache.org/jira/browse/PDFBOX-4489

Tilman


When opening that file in a PDF reader I see that the text is all jumbled up. 
When I use v. 2.0.15 to extract text from it I get that OutOfMemoryError.

When I use 2.0.16-SNAPSHOT I get this error instead:

[]SPE@spe-imac[]:[[]~/Downloads[[][]$ java -jar 
pdfbox-app-2.0.16-20190513.182615-76.jar ExtractText 
4236a711-0f64-44ed-a2f2-e6342153809b.pdf
May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 130, length: 189182, 
expected end position: 189312
May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser 
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using 
workaround to read the stream, stream start position: 249137, length: 11400, 
expected end position: 260537
May 15, 2019 11:12:06 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font CIDFont+F1
java.io.IOException: LangSysRecords not alphabetically sorted by LangSys tag: ltÒ 
<= scÊ
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:125)
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:109)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:62)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139)
at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:192)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375)
at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)

May 15, 2019 11:12:07 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 
findFontOrSubstitute
WARNING: Using fallback font LiberationSans for CID-keyed TrueType font 
CIDFont+F1

So at least v. 2.0.16 does not go out of memory :)

My real problem seems to be in the FTP transfer then

On 15 May 2019, 09.43 +0200, Tilman Hausherr <thaush...@t-online.de>, wrote:

this is some problem with the numbers of components being different. Try
pdfbox-app instead.
Tilman


------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>



--- Original-Nachricht ---
Von: Søren Pedersen
Betreff: Re: Possible memory leak when extracting text?
Datum: 15.05.2019, 9:08 Uhr
An: users@pdfbox.apache.org





I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my
application, but I keep having issues. I added this to my pom file:

<repositories>
<repository>
<id>repository.apache.org.snapshots</id>
<http://repository.apache.org.snapshots</id>> ;
<name>Apache snapshots repo</name>
<url>https://repository.apache.org/content/groups/snapshots/</url>
<https://repository.apache.org/content/groups/snapshots/</url>> ;
<snapshots>
<enabled>true</enabled>
</snapshots>
<releases>
<enabled>false</enabled>
</releases>
</repository>
</repositories>

And then I added this under dependencies:

<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16-SNAPSHOT</version>
</dependency>

When I run “mvn compile” I get this error:

[ERROR] Failed to execute goal on project pdftextextractor: Could not
resolve dependencies for project nu.optimise:pdftextextractor:jar:1.0:
Failed to collect dependencies at
org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to read artifact
descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Could not find
artifact org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in
repository.apache.org.snapshots <http://repository.apache.org.snapshots> (
https://repository.apache.org/content/groups/snapshots
<https://repository.apache.org/content/groups/snapshots> /) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
<http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException>


I am probably missing something obvious, but I haven’t been working with
Java for that long, so I have no clue what to do (my googling skills did
not prevail).

Do you have any tips?

Thanks a lot in advance!

Best regards,
Søren


On 11 May 2019, 11.04 +0200, Tilman Hausherr <thaush...@t-online.de
<mailto:thaush...@t-online.de> >, wrote:

The reason I mentioned 2.0.16 is because of this bug:
https://issues.apache.org/jira/browse/PDFBOX-4489

<https://issues.apache.org/jira/browse/PDFBOX-4489>

that one happened with a corrupt file. Yours isn't, but it might be if
it gets corrupted in transfer or in filtering.

2.0.16 snapshot is here:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT
<https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT>
/

Tilman

Am 11.05.2019 um 06:54 schrieb Søren Pedersen:

Ok, that is very interesting. Thanks a lot for looking into this!

I am a bit baffled as to why we experience the memory leak then, but I

guess I will have to dig more into it.

Best regards,
Søren
On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <andr...@lehmi.de

<mailto:andr...@lehmi.de> >, wrote:

Am 10.05.19 um 15:52 schrieb Søren Pedersen:

I have done some more testing, and I found that when I run on

Windows there are no problems, but when I run on Linux I get the memory
leak. Tilman, would you be able to run the same test on a Linux box? - or
maybe using a Linux Docker container, like I showed originally?

I've extracted the text on linux (fedora 30, openjdk 1.8.0_212)

without any

problems using

java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText

where -Xmx9m is the smallest working value

Andreas

We would prefer to run our app on Linux, but this looks like a

blocker for that unfortunately :(

Best regards,
Søren Pedersen
On 10 May 2019, 09.32 +0200, Søren Pedersen <sh.peder...@gmail.com

<mailto:sh.peder...@gmail.com> >, wrote:

Ok, thanks a lot for looking into this Tilman. I will try your

suggestion and keep fiddling with it :)

Have a great weekend!
On 10 May 2019, 08.12 +0200, Tilman Hausherr <

thaush...@t-online.de <mailto:thaush...@t-online.de> >, wrote:

Am 10.05.2019 um 07:22 schrieb Søren Pedersen:

We have an application that can index the contents of PDF

files, so that we

can use that for a search algorithm. We use the Apache PDFBox

library for

extracting text from a PDF, like this (where inputStream is a
ByteArrayInputStream containing the contents of the PDF

file):

PDFTextStripper pdfStripper = new PDFTextStripper();
pdDoc = PDDocument.load(inputStream,
MemoryUsageSetting.setupTempFileOnly

<http://MemoryUsageSetting.setupTempFileOnly> ());

String parsedText = pdfStripper.getText(pdDoc

<http://pdfStripper.getText(pdDoc> );

You can pass the byte[] directly to load(). Also make sure that

the

bytes are not altered in any way, e.g. through a incorrectly

configured

web downloading, or an incorrectly configured resource loading
("filtering" option must be false).


Also retry with 2.0.16 snapshot.

Tilman

---------------------------------------------------------------------

To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org

<mailto:users-unsubscr...@pdfbox.apache.org>

For additional commands, e-mail: users-h...@pdfbox.apache.org

<mailto:users-h...@pdfbox.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org

<mailto:users-unsubscr...@pdfbox.apache.org>

For additional commands, e-mail: users-h...@pdfbox.apache.org

<mailto:users-h...@pdfbox.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org

<mailto:users-unsubscr...@pdfbox.apache.org>

For additional commands, e-mail: users-h...@pdfbox.apache.org

<mailto:users-h...@pdfbox.apache.org>




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Possible memory leak when extracting text?

Reply via email to