[
https://issues.apache.org/jira/browse/PDFBOX-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-4542:
---------------------------------------
Labels: memory performance (was: Memory memory performance)
> Suggestion: Don't load large streams completely into memory, reference them
> instead
> -----------------------------------------------------------------------------------
>
> Key: PDFBOX-4542
> URL: https://issues.apache.org/jira/browse/PDFBOX-4542
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, PDModel
> Affects Versions: 2.0.14
> Reporter: Jonathan
> Priority: Minor
> Labels: memory, performance
>
> As we processed large PDF files, many of which containing large image
> streams, we wanted to avoid loading the entire streams into memory. Instead,
> we implemented a mechanism that merely referenced their location on disk.
> We eventually did this by subclassing COSStream, and then overriding
> COSParser.parseCOSStream(COSDictionary) to conditionally create our stream.
> Here is the code, this is currently still a work-in-progress. I've just
> refactored the entire mechanism.
> {code:java}
> public class ReferencedCOSStream
> extends COSStream
> {
> //~ Instance members
> ------------------------------------------------------------------------------------------------------------------------------
> boolean isReference = false;
> File reference = null;
> long offset = -1;
> long length = -1;
> //~ Constructors
> ----------------------------------------------------------------------------------------------------------------------------------
> private ReferencedCOSStream(final ScratchFile scratchFile)
> {
> super(scratchFile);
> }
> //~ Methods
> ---------------------------------------------------------------------------------------------------------------------------------------
> public static ReferencedCOSStream createFromCOSStream(final COSStream
> stream)
> {
> final ReferencedCOSStream out = new
> ReferencedCOSStream(stream.getScratchFile());
> for (final Map.Entry<COSName, COSBase> entry : stream.entrySet())
> {
> out.setItem(entry.getKey(), entry.getValue());
> }
> return out;
> }
> @Override
> public COSInputStream createInputStream(final DecodeOptions options)
> throws IOException
> {
> if (this.isReference)
> {
> final InputStream in = new SlicedFileInputStream(this.reference,
> this.offset, this.length);
> return COSInputStream.create(getFilterList(), this, in,
> this.getScratchFile(), options);
> }
> else
> {
> return super.createInputStream(options);
> }
> }
> @Override
> public InputStream createRawInputStream()
> throws IOException
> {
> if (this.isReference)
> {
> return new SlicedFileInputStream(this.reference, this.offset,
> this.length);
> }
> else
> {
> return super.createRawInputStream();
> }
> }
> @Override
> public OutputStream createOutputStream(final COSBase filters)
> throws IOException
> {
> this.isReference = false;
> return super.createOutputStream(filters);
> }
> @Override
> public OutputStream createRawOutputStream()
> throws IOException
> {
> this.isReference = false;
> return super.createRawOutputStream();
> }
> public void setReference(final File file,
> final long offset,
> final long length)
> {
> this.isReference = true;
> this.reference = file;
> this.offset = offset;
> this.length = length;
> this.setLong(COSName.LENGTH, length);
> }
> //~ Inner Classes
> ---------------------------------------------------------------------------------------------------------------------------------
> private class SlicedFileInputStream
> extends FileInputStream
> {
> //~ Instance members
> ---------------------------------------------------------------------------------------------------------------------------
> private long index;
> private final long length;
> //~ Constructors
> -------------------------------------------------------------------------------------------------------------------------------
> public SlicedFileInputStream(final File file,
> final long offset,
> final long length)
> throws FileNotFoundException, IOException
> {
> super(file);
> this.length = length;
> this.skip(offset);
> this.index = 0;
> }
> //~ Methods
> ------------------------------------------------------------------------------------------------------------------------------------
> @Override
> public int available()
> throws IOException
> {
> final long remaining = length - index;
> if (remaining < 0)
> {
> return 0;
> }
> return (int)remaining;
> }
> @Override
> public int read(final byte[] b)
> throws IOException
> {
> final int remaining = this.available();
> final int len = (remaining < b.length) ? remaining : b.length;
> index += len;
> if (len > 0)
> {
> return super.read(b, 0, len);
> }
> else
> {
> return -1;
> }
> }
> @Override
> public int read(final byte[] b,
> final int off,
> int len)
> throws IOException
> {
> final int remaining = this.available();
> len = (remaining < len) ? remaining : len;
> index += len;
> if (len > 0)
> {
> return super.read(b, 0, len);
> }
> else
> {
> return -1;
> }
> }
> @Override
> public long skip(final long n)
> throws IOException
> {
> index += n;
> return super.skip(n);
> }
> @Override
> public FileChannel getChannel()
> {
> throw new UnsupportedOperationException("Obtaining a FileChannel is
> not supported because a correct offset cannot be ensured.");
> }
> }
> }
> {code}
> {code:java}
> @Override
> protected COSStream parseCOSStream(final COSDictionary dic)
> throws IOException
> {
> /*
> * This needs to be dic.getItem because when we are parsing, the
> underlying object might still be null.
> */
> final COSNumber streamLengthObj =
> getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE));
> COSStream stream = document.createCOSStream(dic);
> // read 'stream'; this was already tested in parseObjectsDynamically()
> readString();
> skipWhiteSpaces();
> if (streamLengthObj == null)
> {
> if (isLenient)
> {
> LOG.warn("The stream doesn't provide any stream length, using
> fallback readUntilEnd, at offset " + source.getPosition());
> }
> else
> {
> throw new IOException("Missing length for stream.");
> }
> }
> if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024))
> {
> final long streamBegPos = source.getPosition();
> final ReferencedCOSStream refStream =
> ReferencedCOSStream.createFromCOSStream(stream);
> try
> {
> readValidStream(null, streamLengthObj);
> }
> finally
> {
> stream.setItem(COSName.LENGTH, streamLengthObj);
> }
> refStream.setReference(new File(reference), streamBegPos,
> source.getPosition() - streamBegPos);
> stream = refStream;
> }
> else
> {
> try(final OutputStream out = stream.createRawOutputStream())
> {
> if ((streamLengthObj != null) &&
> validateStreamLength(streamLengthObj.longValue()))
> {
> readValidStream(out, streamLengthObj);
> }
> else
> {
> readUntilEndStream(new EndstreamOutputStream(out));
> }
> }
> finally
> {
> stream.setItem(COSName.LENGTH, streamLengthObj);
> }
> }
> final String endStream = readString();
> if (endStream.equals("endobj") && isLenient)
> {
> LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset
> " + source.getPosition());
> // avoid follow-up warning about missing endobj
> source.rewind(ENDOBJ.length);
> }
> else if ((endStream.length() > 9) && isLenient &&
> endStream.substring(0, 9).equals(ENDSTREAM_STRING))
> {
> LOG.warn("stream ends with '" + endStream + "' instead of
> 'endstream' at offset " + source.getPosition());
> // unread the "extra" bytes
> source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length);
> }
> else if (!endStream.equals(ENDSTREAM_STRING))
> {
> throw new IOException("Error reading stream, expected='endstream'
> actual='" + endStream + "' at offset " + source.getPosition());
> }
> return stream;
> }
> {code}
> The class ReferencedCOSStream exposes the underlying data in exactly the same
> way as it does COSStream, but instead of keeping the storage in memory, it
> always opens a FileInputStream to retrieve the content. SlicedFileInputStream
> basically wraps around a FileInputStream and tries to imitate the behaviour
> of an InputStream for this specific chunk of data.
> I needed to expose some APIs for these classes, the method
> ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in
> PDDocument and create the stream directly, I just didn't want to also modify
> PDDocument.
> Right now, encrypted streams are currently loaded into memory by the
> SecurityHandler directly after creation. If you want to accept this proposal,
> it might make sense to move the decryption handling also into COSStream and
> ReferencedCOSStream and perform it upon request.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]