[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2370: Attachment: PDFBOX-2370-002701.pdf I changed the test file (modified part of the page tree) so that only the first 5 pages are shown. Size is the same because the effect would go away if I'd load and save the file, because resources are saved as direct objects by PDFBox. > Move caching outside of PDResources > --- > > Key: PDFBOX-2370 > URL: https://issues.apache.org/jira/browse/PDFBOX-2370 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: John Hewson >Assignee: John Hewson >Priority: Blocker > Fix For: 2.0.0 > > Attachments: PDFBOX-2370-002701.pdf, PDFBOX-2370-002701.pdf > > > *Note:* This issue is based on a discussion which occurred regarding > PDFBOX-2301 but is actually a separate issue. > Currently we cache the page resources in PDResources which belongs to a > specific PDPage. This causes two problems, 1) users who want to hold many > PDPage objects in memory will have high memory use (but this is often by > accident*). 2) By caching resources in PDPage we only get to keep that cache > for the lifetime of the page, which e.g. in PDFRenderer is a single page > only. That means that a font which appears on 40 pages has to be parsed 40 > times, which causes slow running times, but also memory thrashing as objects > are destroyed frequently only to be re-created. > What PDFRenderer really needs is not page-wide caching but document-wide > caching, so that it can cache fonts, cmaps, color profiles, etc. only once. > But that won't work for images, because they're too large. What we're > beginning to realise is that caching is use-case specific and probably > shouldn't be built-in to PDFBox's pdmodel. Instead we should removing > resource caching from PDPage/PDResources and implement custom caching in > PDFRenderer and other downstream classes such as PDFTextStripper. I'll > happily volunteer myself. The existing high-level PDFBox APIs will continue > to "just work" and power users will get a level of control that they > appreciate. > This strategy could be enhanced by removing memory-hungry methods on > PDResources such as getFonts() and getXObjects() which force all resources of > a particular type to be loaded, whether or not they are needed, or actually > used in the content stream. They would be replaced by methods to retrieve a > single resource, e.g. getFont(name). > --- > \* There probably isn't a legitimate use case for 1) any more, we've solved > the issues which we used to have with image caching (in fact, the > clearCache() method actually no longer needs to be called by PDFRenderer, > though it currently is). The real problem is that it's easy to accidentally > retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() > method is dangerous as looping over it will cause pages to be retained during > processing, like so: > {code} > for (PDPage page : document.getDocumentCatalog().getAllPages()) // > java.util.List > { > // ... this is idiomatic in PDFBox 1.8 > } > // List returned by getAllPages() kept in scope until here (bad) > {code} > I added of couple of methods a while ago to avoid this by fetching each > PDPage one at a time, and this is now used internally in PDFBox to avoid the > memory problems we used to have: > {code} > for (int i = 0; i < document.getNumberOfPages(); i++) > { > PDPage page = document.getPage(i); > // ... this is the new 2.0 way > // current page falls out of scope here (good) > } > {code} > To solve this problem, we could change getAllPages() so that instead of > returning a List it returns an Iterator, which would provide a nicer > API than getPage(int) and most existing code will continue to work. This is > also an opportunity to also fix type safety issues due to PDPageNode and > incorrect handling of the page tree (this is similar to the issue we had > recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2370: Attachment: (was: PDFBOX-2370-002701.pdf) > Move caching outside of PDResources > --- > > Key: PDFBOX-2370 > URL: https://issues.apache.org/jira/browse/PDFBOX-2370 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: John Hewson >Assignee: John Hewson >Priority: Blocker > Fix For: 2.0.0 > > Attachments: PDFBOX-2370-002701.pdf > > > *Note:* This issue is based on a discussion which occurred regarding > PDFBOX-2301 but is actually a separate issue. > Currently we cache the page resources in PDResources which belongs to a > specific PDPage. This causes two problems, 1) users who want to hold many > PDPage objects in memory will have high memory use (but this is often by > accident*). 2) By caching resources in PDPage we only get to keep that cache > for the lifetime of the page, which e.g. in PDFRenderer is a single page > only. That means that a font which appears on 40 pages has to be parsed 40 > times, which causes slow running times, but also memory thrashing as objects > are destroyed frequently only to be re-created. > What PDFRenderer really needs is not page-wide caching but document-wide > caching, so that it can cache fonts, cmaps, color profiles, etc. only once. > But that won't work for images, because they're too large. What we're > beginning to realise is that caching is use-case specific and probably > shouldn't be built-in to PDFBox's pdmodel. Instead we should removing > resource caching from PDPage/PDResources and implement custom caching in > PDFRenderer and other downstream classes such as PDFTextStripper. I'll > happily volunteer myself. The existing high-level PDFBox APIs will continue > to "just work" and power users will get a level of control that they > appreciate. > This strategy could be enhanced by removing memory-hungry methods on > PDResources such as getFonts() and getXObjects() which force all resources of > a particular type to be loaded, whether or not they are needed, or actually > used in the content stream. They would be replaced by methods to retrieve a > single resource, e.g. getFont(name). > --- > \* There probably isn't a legitimate use case for 1) any more, we've solved > the issues which we used to have with image caching (in fact, the > clearCache() method actually no longer needs to be called by PDFRenderer, > though it currently is). The real problem is that it's easy to accidentally > retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() > method is dangerous as looping over it will cause pages to be retained during > processing, like so: > {code} > for (PDPage page : document.getDocumentCatalog().getAllPages()) // > java.util.List > { > // ... this is idiomatic in PDFBox 1.8 > } > // List returned by getAllPages() kept in scope until here (bad) > {code} > I added of couple of methods a while ago to avoid this by fetching each > PDPage one at a time, and this is now used internally in PDFBox to avoid the > memory problems we used to have: > {code} > for (int i = 0; i < document.getNumberOfPages(); i++) > { > PDPage page = document.getPage(i); > // ... this is the new 2.0 way > // current page falls out of scope here (good) > } > {code} > To solve this problem, we could change getAllPages() so that instead of > returning a List it returns an Iterator, which would provide a nicer > API than getPage(int) and most existing code will continue to work. This is > also an opportunity to also fix type safety issues due to PDPageNode and > incorrect handling of the page tree (this is similar to the issue we had > recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2370: Attachment: PDFBOX-2370-002701.pdf > Move caching outside of PDResources > --- > > Key: PDFBOX-2370 > URL: https://issues.apache.org/jira/browse/PDFBOX-2370 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: John Hewson >Assignee: John Hewson >Priority: Blocker > Fix For: 2.0.0 > > Attachments: PDFBOX-2370-002701.pdf > > > *Note:* This issue is based on a discussion which occurred regarding > PDFBOX-2301 but is actually a separate issue. > Currently we cache the page resources in PDResources which belongs to a > specific PDPage. This causes two problems, 1) users who want to hold many > PDPage objects in memory will have high memory use (but this is often by > accident*). 2) By caching resources in PDPage we only get to keep that cache > for the lifetime of the page, which e.g. in PDFRenderer is a single page > only. That means that a font which appears on 40 pages has to be parsed 40 > times, which causes slow running times, but also memory thrashing as objects > are destroyed frequently only to be re-created. > What PDFRenderer really needs is not page-wide caching but document-wide > caching, so that it can cache fonts, cmaps, color profiles, etc. only once. > But that won't work for images, because they're too large. What we're > beginning to realise is that caching is use-case specific and probably > shouldn't be built-in to PDFBox's pdmodel. Instead we should removing > resource caching from PDPage/PDResources and implement custom caching in > PDFRenderer and other downstream classes such as PDFTextStripper. I'll > happily volunteer myself. The existing high-level PDFBox APIs will continue > to "just work" and power users will get a level of control that they > appreciate. > This strategy could be enhanced by removing memory-hungry methods on > PDResources such as getFonts() and getXObjects() which force all resources of > a particular type to be loaded, whether or not they are needed, or actually > used in the content stream. They would be replaced by methods to retrieve a > single resource, e.g. getFont(name). > --- > \* There probably isn't a legitimate use case for 1) any more, we've solved > the issues which we used to have with image caching (in fact, the > clearCache() method actually no longer needs to be called by PDFRenderer, > though it currently is). The real problem is that it's easy to accidentally > retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() > method is dangerous as looping over it will cause pages to be retained during > processing, like so: > {code} > for (PDPage page : document.getDocumentCatalog().getAllPages()) // > java.util.List > { > // ... this is idiomatic in PDFBox 1.8 > } > // List returned by getAllPages() kept in scope until here (bad) > {code} > I added of couple of methods a while ago to avoid this by fetching each > PDPage one at a time, and this is now used internally in PDFBox to avoid the > memory problems we used to have: > {code} > for (int i = 0; i < document.getNumberOfPages(); i++) > { > PDPage page = document.getPage(i); > // ... this is the new 2.0 way > // current page falls out of scope here (good) > } > {code} > To solve this problem, we could change getAllPages() so that instead of > returning a List it returns an Iterator, which would provide a nicer > API than getPage(int) and most existing code will continue to work. This is > also an opportunity to also fix type safety issues due to PDPageNode and > incorrect handling of the page tree (this is similar to the issue we had > recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Priority: Blocker (was: Critical) Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Blocker Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Component/s: PDModel Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Affects Version/s: 2.0.0 Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Fix Version/s: 2.0.0 Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)