[jira] [Commented] (PDFBOX-5483) Replace methods using an InputStream from Loader.loadPDF

Jira Sun, 30 Oct 2022 05:18:06 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626221#comment-17626221
 ]


Andreas Lehmkühler commented on PDFBOX-5483:
--------------------------------------------

[~mkl] I totally understand your point of view, but sorry, I don't like it. Let 
me explain why.

The code of {{org.apache.pdfbox.io}} is still a work in progress. Most likely 
the following features will be added in near future:
* setting the buffersize for {{org.apache.pdfbox.io.RandomAccessReadBuffer}}
* setting the buffersize for 
{{org.apache.pdfbox.io.RandomAccessReadBufferedFile}}
* paging support for memory mapped files so that we might want to set the 
buffer size as well for 
{{org.apache.pdfbox.io.RandomAccessReadMemoryMappedFile}}
* I'm thinking of a replacement for the current implementation and usage of 
{{org.apache.pdfbox.io.ScratchFile}}. Something that isn't burried somewhere in 
org.apache.pdfbox.cos

Maybe there will be some other implementations of 
{{org.apache.pdfbox.io.RandomAccessRead}} and I'm pretty sure there are other 
things I can't imagine now.

However, if the code is located somewhere in the parser and/or loader all of 
those modifications require changes within code of the parser/loader and 
depending on the 
kind of changes different method signatures. IMHO that code should no be 
responsible for the management of the source of the data. That stuff belongs to 
{{org.apache.pdfbox.io}}.

Saying that, if someone wants to provide some convenience code it should be 
added somewhere within {{org.apache.pdfbox.io}}.



> Replace methods using an InputStream from Loader.loadPDF
> --------------------------------------------------------
>
>                 Key: PDFBOX-5483
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5483
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>
> As discussed on dev@pdfbox
> {quote}
> We have to remove the loadPDF variants using InputStream and replace them 
> with RandomAccessRead.
> If it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile or 
> RandomAccessReadMemoryMappedFile
> This would make it more transparent what happens under the hood when using 
> the different kinds of loadPDF methods:
> * a byte array as source is already in memory and the obvious choice is to 
> use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is to use 
> RandomAccessReadBufferedFile as a wrapper. We should document that as the 
> other alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user decides how 
> to create it. Additionally is ist possible to implement some own caching 
> loading and/or mechanism
> {quote}
> see PDFBOX-5462 and [High memory usage with pdfbox 
> 3|https://lists.apache.org/thread/6mmgp23v8b2yztj4hghkgkd14s1gzs8g] as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5483) Replace methods using an InputStream from Loader.loadPDF

Reply via email to