this is some problem with the numbers of components being different. Try
pdfbox-app instead.
Tilman
------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
--- Original-Nachricht ---
Von: Søren Pedersen
Betreff: Re: Possible memory leak when extracting text?
Datum: 15.05.2019, 9:08 Uhr
An: users@pdfbox.apache.org
I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my
application, but I keep having issues. I added this to my pom file:
<repositories>
<repository>
<id>repository.apache.org.snapshots</id>
<http://repository.apache.org.snapshots</id>> ;
<name>Apache snapshots repo</name>
<url>https://repository.apache.org/content/groups/snapshots/</url>
<https://repository.apache.org/content/groups/snapshots/</url>> ;
<snapshots>
<enabled>true</enabled>
</snapshots>
<releases>
<enabled>false</enabled>
</releases>
</repository>
</repositories>
And then I added this under dependencies:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16-SNAPSHOT</version>
</dependency>
When I run “mvn compile” I get this error:
[ERROR] Failed to execute goal on project pdftextextractor: Could not
resolve dependencies for project nu.optimise:pdftextextractor:jar:1.0:
Failed to collect dependencies at
org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to read artifact
descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Could not find
artifact org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in
repository.apache.org.snapshots <http://repository.apache.org.snapshots> (
https://repository.apache.org/content/groups/snapshots
<https://repository.apache.org/content/groups/snapshots> /) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
<http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException>
I am probably missing something obvious, but I haven’t been working with
Java for that long, so I have no clue what to do (my googling skills did
not prevail).
Do you have any tips?
Thanks a lot in advance!
Best regards,
Søren
On 11 May 2019, 11.04 +0200, Tilman Hausherr <thaush...@t-online.de
<mailto:thaush...@t-online.de> >, wrote:
The reason I mentioned 2.0.16 is because of this bug:
https://issues.apache.org/jira/browse/PDFBOX-4489
<https://issues.apache.org/jira/browse/PDFBOX-4489>
that one happened with a corrupt file. Yours isn't, but it might be if
it gets corrupted in transfer or in filtering.
2.0.16 snapshot is here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT
<https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT>
/
Tilman
Am 11.05.2019 um 06:54 schrieb Søren Pedersen:
Ok, that is very interesting. Thanks a lot for looking into this!
I am a bit baffled as to why we experience the memory leak then, but I
guess I will have to dig more into it.
Best regards,
Søren
On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <andr...@lehmi.de
<mailto:andr...@lehmi.de> >, wrote:
Am 10.05.19 um 15:52 schrieb Søren Pedersen:
I have done some more testing, and I found that when I run on
Windows there are no problems, but when I run on Linux I get the memory
leak. Tilman, would you be able to run the same test on a Linux box? - or
maybe using a Linux Docker container, like I showed originally?
I've extracted the text on linux (fedora 30, openjdk 1.8.0_212)
without any
problems using
java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText
where -Xmx9m is the smallest working value
Andreas
We would prefer to run our app on Linux, but this looks like a
blocker for that unfortunately :(
Best regards,
Søren Pedersen
On 10 May 2019, 09.32 +0200, Søren Pedersen <sh.peder...@gmail.com
<mailto:sh.peder...@gmail.com> >, wrote:
Ok, thanks a lot for looking into this Tilman. I will try your
suggestion and keep fiddling with it :)
Have a great weekend!
On 10 May 2019, 08.12 +0200, Tilman Hausherr <
thaush...@t-online.de <mailto:thaush...@t-online.de> >, wrote:
Am 10.05.2019 um 07:22 schrieb Søren Pedersen:
We have an application that can index the contents of PDF
files, so that we
can use that for a search algorithm. We use the Apache PDFBox
library for
extracting text from a PDF, like this (where inputStream is a
ByteArrayInputStream containing the contents of the PDF
file):
PDFTextStripper pdfStripper = new PDFTextStripper();
pdDoc = PDDocument.load(inputStream,
MemoryUsageSetting.setupTempFileOnly
<http://MemoryUsageSetting.setupTempFileOnly> ());
String parsedText = pdfStripper.getText(pdDoc
<http://pdfStripper.getText(pdDoc> );
You can pass the byte[] directly to load(). Also make sure that
the
bytes are not altered in any way, e.g. through a incorrectly
configured
web downloading, or an incorrectly configured resource loading
("filtering" option must be false).
Also retry with 2.0.16 snapshot.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
<mailto:users-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: users-h...@pdfbox.apache.org
<mailto:users-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
<mailto:users-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: users-h...@pdfbox.apache.org
<mailto:users-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
<mailto:users-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: users-h...@pdfbox.apache.org
<mailto:users-h...@pdfbox.apache.org>