Re: Replace methods using an InputStream from Loader.loadPDF

2022-08-01 Thread Andreas Lehmkuehler

Am 01.08.22 um 20:20 schrieb Tilman Hausherr:

+1 but
- the explanation below (when to use which class) should be in the javadoc
- the removal should be in the migration guide

It is already on my TODO list

Andreas



Tilman

Am 31.07.2022 um 15:18 schrieb Andreas Lehmkuehler:

Hi fellow devs,


there was a discussion on JIRA [1] about the changed behaviour of the parser 
due to the removal of the ScratchFileBuffer when reading a pdf.


Additionally there was the post "High memory usage with pdfbox 3" on 
users@pdfbox targeting the very same topic


After explaining myself and my changes twice I came to conclusion that I'm 
going to have to do so in the future again and again if we don't change the 
API of Loader.loadPDF


People simply realize that all methods to be used for loading a pdf are moved 
from PDDocument to Loader. They expect the very same behaviour when using a 
similar api and that is understandable from a user point of view.


We have to remove the loadPDF variants using InputStream and replace them with 
RandomAccessRead.


It it comes to InputStreams users have to decide how to procide:
* copy the InputStream to memory by using RandomAccessReadBuffer
* copy the InputStream to a file and use RandomAccessReadBufferedFile or 
RandomAccessReadMemoryMappedFile


This would make it more transparent what happens under the hood when using the 
different kinds of loadPDF methods:


* a byte array as source is already in memory and the obvious choice is to use 
RandomAccessReadBuffer as a wrapper
* a file as source targets a local file and the most obvious choice is to use 
RandomAccessReadBufferedFile as a wrapper. We should document that as the 
other alternative RandomAccessReadMemoryMappedFile is offered in this case
* RandomAccessRead as source is the most obvious one and the user decides how 
to create it. Additionally is ist possible to implement some own caching 
loading and/or mechanism


I know, this will lead to some changes in the codebase of our users, but they 
have to do it in any case as the method was moved, so why not change the data 
type as well



WDYT? Am I missing something?

Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5462

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Replace methods using an InputStream from Loader.loadPDF

2022-08-01 Thread Tilman Hausherr

+1 but
- the explanation below (when to use which class) should be in the javadoc
- the removal should be in the migration guide

Tilman

Am 31.07.2022 um 15:18 schrieb Andreas Lehmkuehler:

Hi fellow devs,


there was a discussion on JIRA [1] about the changed behaviour of the 
parser due to the removal of the ScratchFileBuffer when reading a pdf.


Additionally there was the post "High memory usage with pdfbox 3" on 
users@pdfbox targeting the very same topic


After explaining myself and my changes twice I came to conclusion that 
I'm going to have to do so in the future again and again if we don't 
change the API of Loader.loadPDF


People simply realize that all methods to be used for loading a pdf 
are moved from PDDocument to Loader. They expect the very same 
behaviour when using a similar api and that is understandable from a 
user point of view.


We have to remove the loadPDF variants using InputStream and replace 
them with RandomAccessRead.


It it comes to InputStreams users have to decide how to procide:
* copy the InputStream to memory by using RandomAccessReadBuffer
* copy the InputStream to a file and use RandomAccessReadBufferedFile 
or RandomAccessReadMemoryMappedFile


This would make it more transparent what happens under the hood when 
using the different kinds of loadPDF methods:


* a byte array as source is already in memory and the obvious choice 
is to use RandomAccessReadBuffer as a wrapper
* a file as source targets a local file and the most obvious choice is 
to use RandomAccessReadBufferedFile as a wrapper. We should document 
that as the other alternative RandomAccessReadMemoryMappedFile is 
offered in this case
* RandomAccessRead as source is the most obvious one and the user 
decides how to create it. Additionally is ist possible to implement 
some own caching loading and/or mechanism


I know, this will lead to some changes in the codebase of our users, 
but they have to do it in any case as the method was moved, so why not 
change the data type as well



WDYT? Am I missing something?

Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5462

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Replace methods using an InputStream from Loader.loadPDF

2022-07-31 Thread sahy...@fileaffairs.de
Hi,

I'm very much in favour of simpliying as much as possible and not doing
too much magic under the hood which can be better handled individually
by a developer. This will also leave room for an individual to come up
with an optimized version for specific uses cases.

+1 from my side.

BR
Maruan


Am Sonntag, dem 31.07.2022 um 15:18 +0200 schrieb Andreas Lehmkuehler:
> Hi fellow devs,
> 
> 
> there was a discussion on JIRA [1] about the changed behaviour of the
> parser due 
> to the removal of the ScratchFileBuffer when reading a pdf.
> 
> Additionally there was the post "High memory usage with pdfbox 3" on 
> users@pdfbox targeting the very same topic
> 
> After explaining myself and my changes twice I came to conclusion
> that I'm going 
> to have to do so in the future again and again if we don't change the
> API of 
> Loader.loadPDF
> 
> People simply realize that all methods to be used for loading a pdf
> are moved 
> from PDDocument to Loader. They expect the very same behaviour when
> using a 
> similar api and that is understandable from a user point of view.
> 
> We have to remove the loadPDF variants using InputStream and replace
> them with 
> RandomAccessRead.
> 
> It it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile
> or 
> RandomAccessReadMemoryMappedFile
> 
> This would make it more transparent what happens under the hood when
> using the 
> different kinds of loadPDF methods:
> 
> * a byte array as source is already in memory and the obvious choice
> is to use 
> RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice
> is to use 
> RandomAccessReadBufferedFile as a wrapper. We should document that as
> the other 
> alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user
> decides how to 
> create it. Additionally is ist possible to implement some own caching
> loading 
> and/or mechanism
> 
> I know, this will lead to some changes in the codebase of our users,
> but they 
> have to do it in any case as the method was moved, so why not change
> the data 
> type as well
> 
> 
> WDYT? Am I missing something?
> 
> Andreas
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-5462
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org