Oh. I had looked at the trunk and not at 3.0. That was likely a mistake in refactoring. Fixed in

 https://issues.apache.org/jira/browse/PDFBOX-5757

and you get get a snapshot here
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

Tilman


On 01.02.2024 15:25, Lars Juel Jensen wrote:
That is weird.. The source file I am looking at for version 3.0.1 does not
pass it:
-->
https://github.com/apache/pdfbox/blob/3.0.1/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L91

On Wed, Jan 31, 2024 at 4:57 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

On 31.01.2024 16:19, Lars Juel Jensen wrote:
Well that's my problem.. It works with PDFBox2 with reasonable sized
files.
When it comes to the big ones it crashes.. So reading the migration guide
for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
create my own reader and stream cache. I see that I can provide my own
RandomAccessReader when I call Loader.loadPDF, but the loadPDF method
that
takes a StreamCacheCreate function does not work as promised as the
StreamCacheCreateFunction is not passed from PDFParser to COSParser in
the
PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
this is a bug?
I don't know if there is a bug, but it is passed:

      public PDFParser(RandomAccessRead source, String
decryptionPassword, InputStream keyStore,
              String alias, StreamCacheCreateFunction
streamCacheCreateFunction) throws IOException
      {
          super(source, decryptionPassword, keyStore, alias,
streamCacheCreateFunction);
      }

and here's COSParser:

      public COSParser(RandomAccessRead source, String password,
InputStream keyStore,
              String keyAlias, StreamCacheCreateFunction
streamCacheCreateFunction) throws IOException
      {
          super(source);
          this.password = password;
          this.keyAlias = keyAlias;
          fileLen = source.length();
          keyStoreInputStream = keyStore;
          init(streamCacheCreateFunction);
      }

If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you
create a scenario to reproduce this? Preferably without using a container.

Tilman

On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

On 31.01.2024 14:48, Lars Juel Jensen wrote:
This creates another problem for me. I am running PDFBox in a
kubernetes
cluster on premises with limited resources. I can not setup persistent
volume claims nor ephemeral volumes, and I can not change how my pods
are
started. I have limited resources and an emptyDir that is mounted on
/tmp
where the temporary files go. The emptyDir is mapped to a portion of
the
kubernetes node's memory, and this memory is shared with many other
services. All in all - I need to keep a very low memory and tempFile
footprint, hence the InputStream. Using RandomAccessReadBuffer with an
InputStream loads the entire PDF into memory, and I can encounter PDF
documents that can be over 1GB in size. So loading everything into
memory
is not an option.
You can try to create your own class extending RandomAccessRead.

If your /tmp is mapped on main memory, then it doesn't make sense to use
a temp file at all, you're just wasting time.

Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
scratch file) and had an even bigger footprint because it was also
parsing the complete PDF. So if your project was working with PDFBox 2
then it should work with PDFBox 3.

Tilman



On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <
thaush...@t-online.de>
wrote:

On 31.01.2024 09:50, Lars Juel Jensen wrote:
In PDFBox2 I could do:

PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())

But there is no equivalent to this in PDFBox3. How do I read a PDF
from
an
inputstream?

|Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
IOUtils.createTempFileOnlyStreamCache());|

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to