[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2015-09-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2370:

Attachment: (was: PDFBOX-2370-002701.pdf)

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2015-09-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2370:

Attachment: PDFBOX-2370-002701.pdf

I changed the test file (modified part of the page tree) so that only the first 
5 pages are shown. Size is the same because the effect would go away if I'd 
load and save the file, because resources are saved as direct objects by PDFBox.

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf, PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2015-09-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2370:

Attachment: PDFBOX-2370-002701.pdf

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Assignee: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2370-002701.pdf
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Priority: Blocker  (was: Critical)

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Priority: Blocker
> Fix For: 2.0.0
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-13 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Priority: Critical  (was: Major)

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Priority: Critical
> Fix For: 2.0.0
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-10 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Fix Version/s: 2.0.0

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
> Fix For: 2.0.0
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-10 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Affects Version/s: 2.0.0

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
> Fix For: 2.0.0
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-10 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Component/s: PDModel

> Move caching outside of PDResources
> ---
>
> Key: PDFBOX-2370
> URL: https://issues.apache.org/jira/browse/PDFBOX-2370
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: John Hewson
> Fix For: 2.0.0
>
>
> *Note:* This issue is based on a discussion which occurred regarding 
> PDFBOX-2301 but is actually a separate issue.
> Currently we cache the page resources in PDResources which belongs to a 
> specific PDPage. This causes two problems, 1) users who want to hold many 
> PDPage objects in memory will have high memory use (but this is often by 
> accident*). 2) By caching resources in PDPage we only get to keep that cache 
> for the lifetime of the page, which e.g. in PDFRenderer is a single page 
> only. That means that a font which appears on 40 pages has to be parsed 40 
> times, which causes slow running times, but also memory thrashing as objects 
> are destroyed frequently only to be re-created.
> What PDFRenderer really needs is not page-wide caching but document-wide 
> caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
> But that won't work for images, because they're too large. What we're 
> beginning to realise is that caching is use-case specific and probably 
> shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
> resource caching from PDPage/PDResources and implement custom caching in 
> PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
> happily volunteer myself. The existing high-level PDFBox APIs will continue 
> to "just work" and power users will get a level of control that they 
> appreciate.
> This strategy could be enhanced by removing memory-hungry methods on 
> PDResources such as getFonts() and getXObjects() which force all resources of 
> a particular type to be loaded, whether or not they are needed, or actually 
> used in the content stream. They would be replaced by methods to retrieve a 
> single resource, e.g. getFont(name).
> ---
> \* There probably isn't a legitimate use case for 1) any more, we've solved 
> the issues which we used to have with image caching (in fact, the 
> clearCache() method actually no longer needs to be called by PDFRenderer, 
> though it currently is). The real problem is that it's easy to accidentally 
> retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
> method is dangerous as looping over it will cause pages to be retained during 
> processing, like so:
> {code}
> for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
> java.util.List
> {
>  // ... this is idiomatic in PDFBox 1.8
> } 
> // List returned by getAllPages() kept in scope until here (bad)
> {code}
> I added of couple of methods a while ago to avoid this by fetching each 
> PDPage one at a time, and this is now used internally in PDFBox to avoid the 
> memory problems we used to have:
> {code}
> for (int i = 0; i < document.getNumberOfPages(); i++)
> {
> PDPage page = document.getPage(i);
> // ... this is the new 2.0 way
> // current page falls out of scope here (good)
> }
> {code}
> To solve this problem, we could change getAllPages() so that instead of 
> returning a List it returns an Iterator, which would provide a nicer 
> API than getPage(int) and most existing code will continue to work. This is 
> also an opportunity to also fix type safety issues due to PDPageNode and 
> incorrect handling of the page tree (this is similar to the issue we had 
> recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)