[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-2370: -------------------------------- Fix Version/s: 2.0.0 > Move caching outside of PDResources > ----------------------------------- > > Key: PDFBOX-2370 > URL: https://issues.apache.org/jira/browse/PDFBOX-2370 > Project: PDFBox > Issue Type: Improvement > Components: PDModel > Affects Versions: 2.0.0 > Reporter: John Hewson > Fix For: 2.0.0 > > > *Note:* This issue is based on a discussion which occurred regarding > PDFBOX-2301 but is actually a separate issue. > Currently we cache the page resources in PDResources which belongs to a > specific PDPage. This causes two problems, 1) users who want to hold many > PDPage objects in memory will have high memory use (but this is often by > accident*). 2) By caching resources in PDPage we only get to keep that cache > for the lifetime of the page, which e.g. in PDFRenderer is a single page > only. That means that a font which appears on 40 pages has to be parsed 40 > times, which causes slow running times, but also memory thrashing as objects > are destroyed frequently only to be re-created. > What PDFRenderer really needs is not page-wide caching but document-wide > caching, so that it can cache fonts, cmaps, color profiles, etc. only once. > But that won't work for images, because they're too large. What we're > beginning to realise is that caching is use-case specific and probably > shouldn't be built-in to PDFBox's pdmodel. Instead we should removing > resource caching from PDPage/PDResources and implement custom caching in > PDFRenderer and other downstream classes such as PDFTextStripper. I'll > happily volunteer myself. The existing high-level PDFBox APIs will continue > to "just work" and power users will get a level of control that they > appreciate. > This strategy could be enhanced by removing memory-hungry methods on > PDResources such as getFonts() and getXObjects() which force all resources of > a particular type to be loaded, whether or not they are needed, or actually > used in the content stream. They would be replaced by methods to retrieve a > single resource, e.g. getFont(name). > --- > \* There probably isn't a legitimate use case for 1) any more, we've solved > the issues which we used to have with image caching (in fact, the > clearCache() method actually no longer needs to be called by PDFRenderer, > though it currently is). The real problem is that it's easy to accidentally > retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() > method is dangerous as looping over it will cause pages to be retained during > processing, like so: > {code} > for (PDPage page : document.getDocumentCatalog().getAllPages()) // > java.util.List > { > // ... this is idiomatic in PDFBox 1.8 > } > // List returned by getAllPages() kept in scope until here (bad) > {code} > I added of couple of methods a while ago to avoid this by fetching each > PDPage one at a time, and this is now used internally in PDFBox to avoid the > memory problems we used to have: > {code} > for (int i = 0; i < document.getNumberOfPages(); i++) > { > PDPage page = document.getPage(i); > // ... this is the new 2.0 way > // current page falls out of scope here (good) > } > {code} > To solve this problem, we could change getAllPages() so that instead of > returning a List it returns an Iterator<PDPage>, which would provide a nicer > API than getPage(int) and most existing code will continue to work. This is > also an opportunity to also fix type safety issues due to PDPageNode and > incorrect handling of the page tree (this is similar to the issue we had > recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)