[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2015-09-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2370:

Attachment: PDFBOX-2370-002701.pdf

I changed the test file (modified part of the page tree) so that only the first 
5 pages are shown. Size is the same because the effect would go away if I'd 
load and save the file, because resources are saved as direct objects by PDFBox.

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf, PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2015-09-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2370:

Attachment: (was: PDFBOX-2370-002701.pdf)

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2015-09-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2370:

Attachment: PDFBOX-2370-002701.pdf

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Priority: Blocker  (was: Critical)

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-10 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Component/s: PDModel

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-10 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Affects Version/s: 2.0.0

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-10 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Fix Version/s: 2.0.0

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)