OK, but is that a problem in my local app, or is it a problem with the files on the snapshot repo?
I now see that when my app downloads the file from the FTP server, the file seems to be corrupted, and I am probably running into the exact problem you linked to. This is how the file looks like after I have fetched it via FTP: https://we.tl/t-RglMuSLEIz When opening that file in a PDF reader I see that the text is all jumbled up. When I use v. 2.0.15 to extract text from it I get that OutOfMemoryError. When I use 2.0.16-SNAPSHOT I get this error instead: []SPE@spe-imac[]:[[]~/Downloads[[][]$ java -jar pdfbox-app-2.0.16-20190513.182615-76.jar ExtractText 4236a711-0f64-44ed-a2f2-e6342153809b.pdf May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength WARNING: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 130, length: 189182, expected end position: 189312 May 15, 2019 11:12:06 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength WARNING: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 249137, length: 11400, expected end position: 260537 May 15, 2019 11:12:06 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init> WARNING: Could not read embedded OTF for font CIDFont+F1 java.io.IOException: LangSysRecords not alphabetically sorted by LangSys tag: ltÒ <= scÊ at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:125) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73) at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:109) at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:62) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139) at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:192) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) May 15, 2019 11:12:07 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 findFontOrSubstitute WARNING: Using fallback font LiberationSans for CID-keyed TrueType font CIDFont+F1 So at least v. 2.0.16 does not go out of memory :) My real problem seems to be in the FTP transfer then On 15 May 2019, 09.43 +0200, Tilman Hausherr <thaush...@t-online.de>, wrote: > this is some problem with the numbers of components being different. Try > pdfbox-app instead. > Tilman > > > ------------------------------------------------------------------------ > Gesendet mit der Telekom Mail App > <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer> > > > > --- Original-Nachricht --- > Von: Søren Pedersen > Betreff: Re: Possible memory leak when extracting text? > Datum: 15.05.2019, 9:08 Uhr > An: users@pdfbox.apache.org > > > > > > I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my > application, but I keep having issues. I added this to my pom file: > > <repositories> > <repository> > <id>repository.apache.org.snapshots</id> > <http://repository.apache.org.snapshots</id>> ; > <name>Apache snapshots repo</name> > <url>https://repository.apache.org/content/groups/snapshots/</url> > <https://repository.apache.org/content/groups/snapshots/</url>> ; > <snapshots> > <enabled>true</enabled> > </snapshots> > <releases> > <enabled>false</enabled> > </releases> > </repository> > </repositories> > > And then I added this under dependencies: > > <dependency> > <groupId>org.apache.pdfbox</groupId> > <artifactId>pdfbox</artifactId> > <version>2.0.16-SNAPSHOT</version> > </dependency> > > When I run “mvn compile” I get this error: > > [ERROR] Failed to execute goal on project pdftextextractor: Could not > resolve dependencies for project nu.optimise:pdftextextractor:jar:1.0: > Failed to collect dependencies at > org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to read artifact > descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Could not find > artifact org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in > repository.apache.org.snapshots <http://repository.apache.org.snapshots> ( > https://repository.apache.org/content/groups/snapshots > <https://repository.apache.org/content/groups/snapshots> /) -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException > <http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException> > > > I am probably missing something obvious, but I haven’t been working with > Java for that long, so I have no clue what to do (my googling skills did > not prevail). > > Do you have any tips? > > Thanks a lot in advance! > > Best regards, > Søren > > > On 11 May 2019, 11.04 +0200, Tilman Hausherr <thaush...@t-online.de > <mailto:thaush...@t-online.de> >, wrote: > > The reason I mentioned 2.0.16 is because of this bug: > > https://issues.apache.org/jira/browse/PDFBOX-4489 > <https://issues.apache.org/jira/browse/PDFBOX-4489> > > > > that one happened with a corrupt file. Yours isn't, but it might be if > > it gets corrupted in transfer or in filtering. > > > > 2.0.16 snapshot is here: > > > https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT > <https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT> > / > > > > Tilman > > > > Am 11.05.2019 um 06:54 schrieb Søren Pedersen: > > > Ok, that is very interesting. Thanks a lot for looking into this! > > > > > > I am a bit baffled as to why we experience the memory leak then, but I > guess I will have to dig more into it. > > > > > > Best regards, > > > Søren > > > On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <andr...@lehmi.de > <mailto:andr...@lehmi.de> >, wrote: > > > > Am 10.05.19 um 15:52 schrieb Søren Pedersen: > > > > > I have done some more testing, and I found that when I run on > Windows there are no problems, but when I run on Linux I get the memory > leak. Tilman, would you be able to run the same test on a Linux box? - or > maybe using a Linux Docker container, like I showed originally? > > > > I've extracted the text on linux (fedora 30, openjdk 1.8.0_212) > without any > > > > problems using > > > > > > > > java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText > > > > > > > > where -Xmx9m is the smallest working value > > > > > > > > Andreas > > > > > > > > > We would prefer to run our app on Linux, but this looks like a > blocker for that unfortunately :( > > > > > > > > > > Best regards, > > > > > Søren Pedersen > > > > > On 10 May 2019, 09.32 +0200, Søren Pedersen <sh.peder...@gmail.com > <mailto:sh.peder...@gmail.com> >, wrote: > > > > > > Ok, thanks a lot for looking into this Tilman. I will try your > suggestion and keep fiddling with it :) > > > > > > > > > > > > Have a great weekend! > > > > > > On 10 May 2019, 08.12 +0200, Tilman Hausherr < > thaush...@t-online.de <mailto:thaush...@t-online.de> >, wrote: > > > > > > > Am 10.05.2019 um 07:22 schrieb Søren Pedersen: > > > > > > > > We have an application that can index the contents of PDF > files, so that we > > > > > > > > can use that for a search algorithm. We use the Apache PDFBox > library for > > > > > > > > extracting text from a PDF, like this (where inputStream is a > > > > > > > > ByteArrayInputStream containing the contents of the PDF > file): > > > > > > > > > > > > > > > > PDFTextStripper pdfStripper = new PDFTextStripper(); > > > > > > > > pdDoc = PDDocument.load(inputStream, > > > > > > > > MemoryUsageSetting.setupTempFileOnly > <http://MemoryUsageSetting.setupTempFileOnly> ()); > > > > > > > > String parsedText = pdfStripper.getText(pdDoc > <http://pdfStripper.getText(pdDoc> ); > > > > > > > > > > > > > > You can pass the byte[] directly to load(). Also make sure that > the > > > > > > > bytes are not altered in any way, e.g. through a incorrectly > configured > > > > > > > web downloading, or an incorrectly configured resource loading > > > > > > > ("filtering" option must be false). > > > > > > > > > > > > > > > > > > > > > Also retry with 2.0.16 snapshot. > > > > > > > > > > > > > > Tilman > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > <mailto:users-unsubscr...@pdfbox.apache.org> > > > > > > > For additional commands, e-mail: users-h...@pdfbox.apache.org > <mailto:users-h...@pdfbox.apache.org> > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > <mailto:users-unsubscr...@pdfbox.apache.org> > > > > For additional commands, e-mail: users-h...@pdfbox.apache.org > <mailto:users-h...@pdfbox.apache.org> > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > <mailto:users-unsubscr...@pdfbox.apache.org> > > For additional commands, e-mail: users-h...@pdfbox.apache.org > <mailto:users-h...@pdfbox.apache.org> > >