[
https://issues.apache.org/jira/browse/PDFBOX-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120538#comment-14120538
]
John Hewson edited comment on PDFBOX-2310 at 9/3/14 9:53 PM:
-------------------------------------------------------------
{code}
I get 107 matches in 24 files.
{code}
Only class member variables and any code which loops over page resources is
relevant, I only looked for private fields.
However, as you've spotted even short-term retention of the PDImageXObject
cache is a problem, and the file from PDFBOX-2101 is now having issues with
memory usage. This is due to a number of large images on a single page and
because PDResources is retaining the PDImageXObject instances during the loop
over the page's resources we end up accumulating cached images.
However, something's not right here, PDFToImage can render the document without
any memory issues, and it's not calling PDImageXObject#clear() and it loops
over the PDResources in exactly the same manner. There's something specific
about ExtractImages which is causing it to use more memory.
As the author of the PDFormXObject#getImage() method I'm beginning to wonder if
it should simply not cache images, as they're just so large. Downstream callers
such as PageDrawer could have their own much smarter caching policies such as
LRU or some system which takes into account memory pressure such as a
SoftReference.
was (Author: jahewson):
{code}
I get 107 matches in 24 files.
{code}
Only class member variables and any code which loops over page resources is
relevant, I only looked for private fields.
However, as you've spotted even short-term retention of the PDImageXObject
cache is a problem, and the file from PDFBOX-2101 is now having issues with
memory usage. This is due to number of large images on a single page and
because PDResources is retaining the PDImageXObject instances during the loop
over the page's resources we end up accumulating cached images.
However, something's not right here, PDFToImage can render the document without
any memory issues, and it's not calling PDImageXObject#clear() and it loops
over the PDResources in exactly the same manner. There's something specific
about ExtractImages which is causing it to use more memory.
As the author of the PDFormXObject#getImage() method I'm beginning to wonder if
it should simply not cache images, as they're just so large. Downstream callers
such as PageDrawer could have their own much smarter caching policies such as
LRU or some system which takes into account memory pressure such as a
SoftReference.
> codeToGID NPE
> -------------
>
> Key: PDFBOX-2310
> URL: https://issues.apache.org/jira/browse/PDFBOX-2310
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 2.0.0
> Reporter: simon steiner
> Assignee: John Hewson
> Fix For: 2.0.0
>
> Attachments: expected.pdf
>
>
> java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar PDFToImage
> expected.pdf
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.pdfbox.pdmodel.font.PDType0Font.codeToGID(PDType0Font.java:306)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)