[
https://issues.apache.org/jira/browse/PDFBOX-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626221#comment-17626221
]
Andreas Lehmkühler commented on PDFBOX-5483:
--------------------------------------------
[~mkl] I totally understand your point of view, but sorry, I don't like it. Let
me explain why.
The code of {{org.apache.pdfbox.io}} is still a work in progress. Most likely
the following features will be added in near future:
* setting the buffersize for {{org.apache.pdfbox.io.RandomAccessReadBuffer}}
* setting the buffersize for
{{org.apache.pdfbox.io.RandomAccessReadBufferedFile}}
* paging support for memory mapped files so that we might want to set the
buffer size as well for
{{org.apache.pdfbox.io.RandomAccessReadMemoryMappedFile}}
* I'm thinking of a replacement for the current implementation and usage of
{{org.apache.pdfbox.io.ScratchFile}}. Something that isn't burried somewhere in
org.apache.pdfbox.cos
Maybe there will be some other implementations of
{{org.apache.pdfbox.io.RandomAccessRead}} and I'm pretty sure there are other
things I can't imagine now.
However, if the code is located somewhere in the parser and/or loader all of
those modifications require changes within code of the parser/loader and
depending on the
kind of changes different method signatures. IMHO that code should no be
responsible for the management of the source of the data. That stuff belongs to
{{org.apache.pdfbox.io}}.
Saying that, if someone wants to provide some convenience code it should be
added somewhere within {{org.apache.pdfbox.io}}.
> Replace methods using an InputStream from Loader.loadPDF
> --------------------------------------------------------
>
> Key: PDFBOX-5483
> URL: https://issues.apache.org/jira/browse/PDFBOX-5483
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 3.0.0 PDFBox
> Reporter: Andreas Lehmkühler
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 3.0.0 PDFBox
>
>
> As discussed on dev@pdfbox
> {quote}
> We have to remove the loadPDF variants using InputStream and replace them
> with RandomAccessRead.
> If it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile or
> RandomAccessReadMemoryMappedFile
> This would make it more transparent what happens under the hood when using
> the different kinds of loadPDF methods:
> * a byte array as source is already in memory and the obvious choice is to
> use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is to use
> RandomAccessReadBufferedFile as a wrapper. We should document that as the
> other alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user decides how
> to create it. Additionally is ist possible to implement some own caching
> loading and/or mechanism
> {quote}
> see PDFBOX-5462 and [High memory usage with pdfbox
> 3|https://lists.apache.org/thread/6mmgp23v8b2yztj4hghkgkd14s1gzs8g] as well
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]