[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732099#comment-16732099 ] Gary Potagal commented on PDFBOX-4188: -- I saw some activity on this ticket so reviewed and have a couple of questions: 1. Am I correct in that without changing code, PDFBOX_LEGACY_MODE is going to be used? 2. With default PDFBOX_LEGACY_MODE, updated memory management presented in the ticket would still be hugely beneficial in merging large number of small files. Any plans to review it or change the default? 3. Does "Structure Tree" limitation still exist? Thank you! > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440811#comment-16440811 ] Gary Potagal commented on PDFBOX-4188: -- Hello [~msahyoun] and [~tilman] - should we continue to work on this patch for 2.0.10 or do you want to come back to this for 3.0? Thank you > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439923#comment-16439923 ] Gary Potagal commented on PDFBOX-4188: -- [~msahyoun] - Sorry, I just reviewed the code better. What I'm seeing is: - org.apache.pdfbox.io.MemoryUsageSetting#getPartitionedCopy was only used in the org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting) method - getPartitionedCopy creates a new instance of MemoryUsageSetting with limits determined by parallelUseCount. It is basically a copy constructor. As a utility method it will still function just as before > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4190) Allow caller to control openAction on merged documents
[ https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439881#comment-16439881 ] Gary Potagal commented on PDFBOX-4190: -- Unfortunately, the order of selected documents is controlled by the user. MergeOptions sounds like a great idea. Would this be for 3.0.0 or can it be added to 2.x? > Allow caller to control openAction on merged documents > -- > > Key: PDFBOX-4190 > URL: https://issues.apache.org/jira/browse/PDFBOX-4190 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > It would be great if openAction behavior was configurable. When documents are > merged, we are required to open on the first page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439858#comment-16439858 ] Gary Potagal commented on PDFBOX-4188: -- [~msahyoun] - Probably nothing good. In our code, we took that method out. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4190) Allow caller to control openAction on merged documents
[ https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439850#comment-16439850 ] Gary Potagal commented on PDFBOX-4190: -- If we're sending results of the merge to the ServletOutputStream, is there a way to set it that I'm missing? > Allow caller to control openAction on merged documents > -- > > Key: PDFBOX-4190 > URL: https://issues.apache.org/jira/browse/PDFBOX-4190 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > It would be great if openAction behavior was configurable. When documents are > merged, we are required to open on the first page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4190) Allow caller to control openAction on merged documents
[ https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439745#comment-16439745 ] Gary Potagal commented on PDFBOX-4190: -- [~tilman] org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting) - creates a new Destination inside of the method when you merge documents. > Allow caller to control openAction on merged documents > -- > > Key: PDFBOX-4190 > URL: https://issues.apache.org/jira/browse/PDFBOX-4190 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > It would be great if openAction behavior was configurable. When documents are > merged, we are required to open on the first page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4190) Allow caller to control openAction on merged documents
[ https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439745#comment-16439745 ] Gary Potagal edited comment on PDFBOX-4190 at 4/16/18 5:21 PM: --- [~tilman] org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting) - creates a new Destination inside of the method body when you merge documents. was (Author: gary.potagal): [~tilman] org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting) - creates a new Destination inside of the method when you merge documents. > Allow caller to control openAction on merged documents > -- > > Key: PDFBOX-4190 > URL: https://issues.apache.org/jira/browse/PDFBOX-4190 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > It would be great if openAction behavior was configurable. When documents are > merged, we are required to open on the first page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4190) Allow caller to control openAction on merged documents
[ https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4190: - Description: It would be great if openAction behavior was configurable. When documents are merged, we are required to open on the first page. > Allow caller to control openAction on merged documents > -- > > Key: PDFBOX-4190 > URL: https://issues.apache.org/jira/browse/PDFBOX-4190 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > It would be great if openAction behavior was configurable. When documents are > merged, we are required to open on the first page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4190) Allow caller to control openAction on merged documents
[ https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4190: - Description: (was: Am 06.04.2018 um 23:10 schrieb Gary Potagal: We wanted to address one more merge issue in org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). We need to merge a large number of small files. We use mixed mode, memory and disk for cache. Initially, we would often get "Maximum allowed scratch file memory exceeded.", unless we turned off the check by passing "-1" to org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this is what the users that opened PDFBOX-3721 where running into. Our research indicates that the core issue with the memory model is that instead of sharing a single cache, it breaks it up into equal sized fixed partitions based on the number of input + output files being merged. This means that each partition must be big enough to hold the final output file. When 400 1-page files are merged, this creates 401 partitions, but each of which needs to be big enough to hold the final 400 pages. Even worse, the merge algorithm needs to keep all files open until the end. Given this, near the end of the merge, we're actually caching 400 x 1-page input files, and 1 x 400-page output file, or 801 pages. However, with the partitioned cache, we need to declare room for 401 x 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This would be a very high number, usually in GIGs. Given the current limitation that we need to keep all the input files open until the output file is written (HUGE), we came up with 2 options. (See PDFBOX-4182) 1. Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end). 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on demand, release cache as documents are closed after merge. This is our current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are addressed. We would like to submit our current implementation as a Patch to 2.0.10 and 3.0.0, unless this is already addressed. Thank you) > Allow caller to control openAction on merged documents > -- > > Key: PDFBOX-4190 > URL: https://issues.apache.org/jira/browse/PDFBOX-4190 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-4190) Allow caller to control openAction on merged documents
Gary Potagal created PDFBOX-4190: Summary: Allow caller to control openAction on merged documents Key: PDFBOX-4190 URL: https://issues.apache.org/jira/browse/PDFBOX-4190 Project: PDFBox Issue Type: Improvement Affects Versions: 2.0.9, 3.0.0 PDFBox Reporter: Gary Potagal Am 06.04.2018 um 23:10 schrieb Gary Potagal: We wanted to address one more merge issue in org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). We need to merge a large number of small files. We use mixed mode, memory and disk for cache. Initially, we would often get "Maximum allowed scratch file memory exceeded.", unless we turned off the check by passing "-1" to org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this is what the users that opened PDFBOX-3721 where running into. Our research indicates that the core issue with the memory model is that instead of sharing a single cache, it breaks it up into equal sized fixed partitions based on the number of input + output files being merged. This means that each partition must be big enough to hold the final output file. When 400 1-page files are merged, this creates 401 partitions, but each of which needs to be big enough to hold the final 400 pages. Even worse, the merge algorithm needs to keep all files open until the end. Given this, near the end of the merge, we're actually caching 400 x 1-page input files, and 1 x 400-page output file, or 801 pages. However, with the partitioned cache, we need to declare room for 401 x 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This would be a very high number, usually in GIGs. Given the current limitation that we need to keep all the input files open until the output file is written (HUGE), we came up with 2 options. (See PDFBOX-4182) 1. Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end). 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on demand, release cache as documents are closed after merge. This is our current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are addressed. We would like to submit our current implementation as a Patch to 2.0.10 and 3.0.0, unless this is already addressed. Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439685#comment-16439685 ] Gary Potagal commented on PDFBOX-4188: -- [~msahyoun] - I've attached [^PDFBOX-4188_memory_diagram.png] that demonstrates problem. It's harder to diagram, but the real scope of the problem becomes a lot worth, the more files you add to merge. We hope you see that problem in the test that was submitted. - The problem starts in PDFMergerUtility when memory is partitioned (Line 288). We're eliminating memory partitioning, so the patch can't be split into two parts. There's one very important point - MemoryUsageSettings is a *single* object that's shared between all ScratchFiles. All ScratchFiles must reserve pages with MemoryUsageSettings, thus -- Pages (in main memory and on disk) are allocated only when they are needed -- Total limits are tracked in a single place, so whatever settings are passed into the PDFMergeUtility will be the Maximum Memory Limits used during the merge. - I'll open another ticket for openAction - MappedByteBuffer is used when there need to read the content of a file multiple times. Is that done during the merge? - If the patch is acceptable, we'll clean it up to meet coding conventions. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4188: - Attachment: PDFBOX-4188_memory_diagram.png > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437703#comment-16437703 ] Gary Potagal edited comment on PDFBOX-4188 at 4/13/18 6:26 PM: --- Added [^PDFBOX-4188-MemoryManagerPatch.zip]. It assumes that [^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test exists. - This should optimize both modes, but especially the LEGACY mode. - Java doc explains what was changed (Hopefully) - Test are passing with long defaultMemory = 1 * MEG; runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG); - It would be great if openAction behavior was configurable. When documents are merged, we would like for them to open on the first page. Please let us know what you think and if you have any questions. Thank you. was (Author: gary.potagal): Added [^PDFBOX-4188-MemoryManagerPatch]. It assumes that [^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test exists. - This should optimize both modes, but especially the LEGACY mode. - Java doc explains what was changed (Hopefully) - Test are passing with long defaultMemory = 1 * MEG; runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG); - It would be great if openAction behavior was configurable. When documents are merged, we would like for them to open on the first page. Please let us know what you think and if you have any questions. Thank you. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437703#comment-16437703 ] Gary Potagal commented on PDFBOX-4188: -- Added [^PDFBOX-4188-MemoryManagerPatch]. It assumes that [^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test exists. - This should optimize both modes, but especially the LEGACY mode. - Java doc explains what was changed (Hopefully) - Test are passing with long defaultMemory = 1 * MEG; runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG); - It would be great if openAction behavior was configurable. When documents are merged, we would like for them to open on the first page. Please let us know what you think and if you have any questions. Thank you. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4188: - Attachment: PDFBOX-4188-MemoryManagerPatch.zip > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434644#comment-16434644 ] Gary Potagal commented on PDFBOX-4188: -- I'm working on merging the patch that we did for 2.0.4 to current trunk. I'll try to have it available shortly for your review > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434416#comment-16434416 ] Gary Potagal commented on PDFBOX-4188: -- [~tilman] - we created a breaking test and it's attached [^PDFBOX-4188-breakingTest.zip]. The patch is binary, so you would need to apply it in the checked out trunk directory using the command: trunk> patch -p0 --binary -i PDFBOX-4188-breakingTest.diff patching file pdfbox/src/test/java/org/apache/pdfbox/multipdf/PdfMergeUtilityPagesTest.java patching file pdfbox/src/test/resources/input/merge/pages/pdf_sample_1.pdf The test does the following: # Creates four folders containing copies of one page simple pdf_sample_1.pdf file. Each folders contain increasing number of copies, starting with 100, so it's 100, 200, 300, 400 . Each file is about 8K # Merges all files in each folder. The numbers in test for maxStorageBytes are just enough to let the test pass. If you decrease them slightly, the Exception will be thrown. Output looks like this: Apr 11, 2018 2:25:32 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 0.781; Pages/Second: 128.041; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 74; Total Sources Size(K): 775; Merged File Size(K): 522; Ratio MaxStorageBytes/Merged File Size: 145 Apr 11, 2018 2:25:34 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 1.486; Pages/Second: 134.590; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 315; Total Sources Size(K): 1,551; Merged File Size(K): 1,042; Ratio MaxStorageBytes/Merged File Size: 309 Apr 11, 2018 2:25:37 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3.532; Pages/Second: 84.938; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 710; Total Sources Size(K): 2,327; Merged File Size(K): 1,562; Ratio MaxStorageBytes/Merged File Size: 465 Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4.677; Pages/Second: 85.525; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1,240; Total Sources Size(K): 3,103; Merged File Size(K): 2,082; Ratio MaxStorageBytes/Merged File Size: 609 Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest testPerformanceMerge INFO: Summary: Pages: 1000, Time(s): 10.476, Pages/Second: 95.456 As you can see, to merge 400 one page 8K files, We need to set maxStorageBytes to ~1.2 GIG. The resulting file is ~2000 K > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1.
[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4188: - Attachment: PDFBOX-4188-breakingTest.zip > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434191#comment-16434191 ] Gary Potagal commented on PDFBOX-4188: -- We don't know what PDFs we're going to get and are trying to make this generic > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened > https://issues.apache.org/jira/browse/PDFBOX-3721 > where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. See > (https://issues.apache.org/jira/browse/PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4188: - Description: Am 06.04.2018 um 23:10 schrieb Gary Potagal: We wanted to address one more merge issue in org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). We need to merge a large number of small files. We use mixed mode, memory and disk for cache. Initially, we would often get "Maximum allowed scratch file memory exceeded.", unless we turned off the check by passing "-1" to org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this is what the users that opened https://issues.apache.org/jira/browse/PDFBOX-3721 where running into. Our research indicates that the core issue with the memory model is that instead of sharing a single cache, it breaks it up into equal sized fixed partitions based on the number of input + output files being merged. This means that each partition must be big enough to hold the final output file. When 400 1-page files are merged, this creates 401 partitions, but each of which needs to be big enough to hold the final 400 pages. Even worse, the merge algorithm needs to keep all files open until the end. Given this, near the end of the merge, we're actually caching 400 x 1-page input files, and 1 x 400-page output file, or 801 pages. However, with the partitioned cache, we need to declare room for 401 x 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This would be a very high number, usually in GIGs. Given the current limitation that we need to keep all the input files open until the output file is written (HUGE), we came up with 2 options. See (https://issues.apache.org/jira/browse/PDFBOX-4182) 1. Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end). 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on demand, release cache as documents are closed after merge. This is our current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are addressed. We would like to submit our current implementation as a Patch to 2.0.10 and 3.0.0, unless this is already addressed. Thank you was: Am 06.04.2018 um 23:10 schrieb Gary Potagal: We wanted to address one more merge issue in org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). We need to merge a large number of small files. We use mixed mode, memory and disk for cache. Initially, we would often get "Maximum allowed scratch file memory exceeded.", unless we turned off the check by passing "-1" to org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this is what the users that opened https://issues.apache.org/jira/browse/PDFBOX-3721 where running into. Our research indicates that the core issue with the memory model is that instead of sharing a single cache, it breaks it up into equal sized fixed partitions based on the number of input + output files being merged. This means that each partition must be big enough to hold the final output file. When 400 1-page files are merged, this creates 401 partitions, but each of which needs to be big enough to hold the final 400 pages. Even worse, the merge algorithm needs to keep all files open until the end. Given this, near the end of the merge, we're actually caching 400 x 1-page input files, and 1 x 400-page output file, or 801 pages. However, with the partitioned cache, we need to declare room for 401 x 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This would be a very high number, usually in GIGs. Given the current limitation that we need to keep all the input files open until the output file is written (HUGE), we came up with 2 options. See (https://issues.apache.org/jira/browse/PDFBOX-4182) 1. Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end). 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on demand, release cache as documents are closed after merge. This is our current implementation till PDFBOX-3999 is addressed. We would like to submit our current implementation as a Patch to 2.0.10 and 3.0.0, unless this is already addressed. Thank you > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9,
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434177#comment-16434177 ] Gary Potagal commented on PDFBOX-4188: -- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, April 07, 2018 1:48 AM To: dev@pdfbox.apache.org Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs Hi, Please have also a look at the comments in https://issues.apache.org/jira/browse/PDFBOX-4182 Please submit your patch proposal there or in a new issue. It should be against the trunk. Note that this doesn't mean your patch will be accepted, it just means I'd like to see it because I haven't understood your post fully, and many attachment types don't get through here. A breaking test would be interesting: is it possible to use (or better, create) 400 identical small PDFs and merge them and does it break? Tilman > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened > https://issues.apache.org/jira/browse/PDFBOX-3721 > where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. See > (https://issues.apache.org/jira/browse/PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4188: - Description: Am 06.04.2018 um 23:10 schrieb Gary Potagal: We wanted to address one more merge issue in org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). We need to merge a large number of small files. We use mixed mode, memory and disk for cache. Initially, we would often get "Maximum allowed scratch file memory exceeded.", unless we turned off the check by passing "-1" to org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this is what the users that opened https://issues.apache.org/jira/browse/PDFBOX-3721 where running into. Our research indicates that the core issue with the memory model is that instead of sharing a single cache, it breaks it up into equal sized fixed partitions based on the number of input + output files being merged. This means that each partition must be big enough to hold the final output file. When 400 1-page files are merged, this creates 401 partitions, but each of which needs to be big enough to hold the final 400 pages. Even worse, the merge algorithm needs to keep all files open until the end. Given this, near the end of the merge, we're actually caching 400 x 1-page input files, and 1 x 400-page output file, or 801 pages. However, with the partitioned cache, we need to declare room for 401 x 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This would be a very high number, usually in GIGs. Given the current limitation that we need to keep all the input files open until the output file is written (HUGE), we came up with 2 options. See (https://issues.apache.org/jira/browse/PDFBOX-4182) 1. Good: Split the cache in ½, give ½ to the output file, and segment the other ½ across the input files. (Still keeping them open until then end). 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on demand, release cache as documents are closed after merge. This is our current implementation till PDFBOX-3999 is addressed. We would like to submit our current implementation as a Patch to 2.0.10 and 3.0.0, unless this is already addressed. Thank you was: I have been running some tests trying to merge large amounts (2618) of small pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) Memory consumption seems to be the main limitation. ScratchFileBuffer seems to consume the majority of the memory usage. (see screenshot from mat in attachment) (I would include the hprof in attachment so you can analyze yourselves but it's rather large) Note that it seems impossible to generate a large pdf using a small memory footprint. I personally thought that using MemorySettings with temporary file only would allow me to generate arbitrarily large pdf files but it doesn't seem to help. I've run the mergeDocuments with memory settings: * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 1024L) * MemoryUsageSetting.setupTempFileOnly() Refactored version completes with *4GB* heap: with temp file only completes 2618 documents in 1.760 min *VS* *8GB* heap: with temp file only completes 2618 documents in 2.0 min Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 8GB) It looks like the loop in the mergeDocuments accumulates PDDocument objects in a list which are closed after the merge is completed. Refactoring the code to close these as they are used, instead of accumulating them and closing all at the end, improves memory usage considerably.(although doesn't seem to be eliminated completed based on mat analysis.) Another change I've implemented is to only create the inputstream when the file needs to be read and to close it alongside the PDDocument. (Some inputstreams contain buffers and depending on the size of the buffers and or the stream type accumulating all the streams is a potential memory-hog.) These changes seems to have a beneficial improvement in the sense that I can process the same amount of pdfs with about half the memory. I'd appreciate it if you could roll these changes into the main codebase. (I've respected java 6 compatibility.) I've included in attachment the java files of the new implementation: * Suppliers * Supplier * PDFMergerUtilityUsingSupplier PDFMergerUtilityUsingSupplier can replace the previous version. No signature changes only internal code changes. (just rename the class to PDFMergerUtility if you decide to implemented the changes.) In attachment you can also find some screenshots from visualvm showing the memory usage of the original version and the refactored version as well as some info produced by mat after analysing the heap. If you know of any other means, without
[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4188: - Affects Version/s: 3.0.0 PDFBox > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
Gary Potagal created PDFBOX-4188: Summary: "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs Key: PDFBOX-4188 URL: https://issues.apache.org/jira/browse/PDFBOX-4188 Project: PDFBox Issue Type: Improvement Affects Versions: 2.0.9 Reporter: Gary Potagal I have been running some tests trying to merge large amounts (2618) of small pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) Memory consumption seems to be the main limitation. ScratchFileBuffer seems to consume the majority of the memory usage. (see screenshot from mat in attachment) (I would include the hprof in attachment so you can analyze yourselves but it's rather large) Note that it seems impossible to generate a large pdf using a small memory footprint. I personally thought that using MemorySettings with temporary file only would allow me to generate arbitrarily large pdf files but it doesn't seem to help. I've run the mergeDocuments with memory settings: * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 1024L) * MemoryUsageSetting.setupTempFileOnly() Refactored version completes with *4GB* heap: with temp file only completes 2618 documents in 1.760 min *VS* *8GB* heap: with temp file only completes 2618 documents in 2.0 min Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 8GB) It looks like the loop in the mergeDocuments accumulates PDDocument objects in a list which are closed after the merge is completed. Refactoring the code to close these as they are used, instead of accumulating them and closing all at the end, improves memory usage considerably.(although doesn't seem to be eliminated completed based on mat analysis.) Another change I've implemented is to only create the inputstream when the file needs to be read and to close it alongside the PDDocument. (Some inputstreams contain buffers and depending on the size of the buffers and or the stream type accumulating all the streams is a potential memory-hog.) These changes seems to have a beneficial improvement in the sense that I can process the same amount of pdfs with about half the memory. I'd appreciate it if you could roll these changes into the main codebase. (I've respected java 6 compatibility.) I've included in attachment the java files of the new implementation: * Suppliers * Supplier * PDFMergerUtilityUsingSupplier PDFMergerUtilityUsingSupplier can replace the previous version. No signature changes only internal code changes. (just rename the class to PDFMergerUtility if you decide to implemented the changes.) In attachment you can also find some screenshots from visualvm showing the memory usage of the original version and the refactored version as well as some info produced by mat after analysing the heap. If you know of any other means, without running into memory issues, to merge large sets of pdf files into a large single pdf I'd love to hear about it! I'd also suggest that there should be further improvements made in memory usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430445#comment-16430445 ] Gary Potagal commented on PDFBOX-4158: -- Yes, we completed testing Friday and are no longer seeing a memory leak / orphaned scratch files on disk. This ticket can be closed. Thank you. > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417321#comment-16417321 ] Gary Potagal edited comment on PDFBOX-4158 at 3/28/18 1:33 PM: --- [~msahyoun] - what about not having a nested try\{ } on line 294 and just having try, catch, finally. If IOException occurs anywhere in the try, it will be caught by catch and in finally, firstException will not be null. Otherwise, you might swallow an Exception that occurs in finally. It could be argued that method will not notify the caller if it gets errors closing assets, but than we're making presumptions on behalf of the caller. Thanks! was (Author: gary.potagal): [~msahyoun] - what about not having a nested try\{ } on line 94 and just having try, catch, finally. If IOException occurs anywhere in the try, it will be caught by catch and in finally, firstException will not be null. Otherwise, you might swallow an Exception that occurs in finally. It could be argued that method will not notify the caller if it gets errors closing assets, but than we're making presumptions on behalf of the caller. Thanks! > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417321#comment-16417321 ] Gary Potagal commented on PDFBOX-4158: -- [~msahyoun] - what about not having a nested try\{ } on line 94 and just having try, catch, finally. If IOException occurs anywhere in the try, it will be caught by catch and in finally, firstException will not be null. Otherwise, you might swallow an Exception that occurs in finally. It could be argued that method will not notify the caller if it gets errors closing assets, but than we're making presumptions on behalf of the caller. Thanks! > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416109#comment-16416109 ] Gary Potagal commented on PDFBOX-4158: -- [~msahyoun] - this is much cleaner - will copy your changes into our code. Thank you again. > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415831#comment-16415831 ] Gary Potagal edited comment on PDFBOX-4158 at 3/27/18 3:54 PM: --- [~msahyoun] - I see that you also added more info to the logging message. Thank you for the code review, I will merge your changes into our code was (Author: gary.potagal): [~msahyoun] - I see that you also added more logging to the logging message. Thank you for the code review, I will merge your changes into our code > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415831#comment-16415831 ] Gary Potagal commented on PDFBOX-4158: -- [~msahyoun] - I see that you also added more logging to the logging message. Thank you for the code review, I will merge your changes into our code > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414464#comment-16414464 ] Gary Potagal commented on PDFBOX-4158: -- [~msahyoun] - ended up following your first advice to keep code consistent with the project. Patch [^PDFBOX-4158.patch] is attached. Please let me know if I should do anything else. Thank you. > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4158: - Attachment: PDFBOX-4158.patch > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, > PDFBOX-4158.patch > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405110#comment-16405110 ] Gary Potagal edited comment on PDFBOX-4158 at 3/19/18 4:53 PM: --- Hi, changed the attached image: [^[^BiggestObjectAllocationGraph.png]] to show full picture. I was going to catch Throwable, log at WARN and throw IOException with the count of errors, something like "Encountered N errors in attempt to close m documents", but I can re-throw first Exception if you prefer. Will take about a week to complete and validate in our test environment. Thanks! was (Author: gary.potagal): Hi, changed the attached image: https://issues.apache.org/jira/secure/attachment/12915154/BiggestObjectAllocationGraph.png to show full picture. I was going to catch Throwable, log at WARN and throw IOException with the count of errors, something like "Encountered N errors in attempt to close m documents", but I can re-throw first Exception if you prefer. Will take about a week to complete and validate in our test environment. Thanks! > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405110#comment-16405110 ] Gary Potagal commented on PDFBOX-4158: -- Hi, changed the attached image: https://issues.apache.org/jira/secure/attachment/12915154/BiggestObjectAllocationGraph.png to show full picture. I was going to catch Throwable, log at WARN and throw IOException with the count of errors, something like "Encountered N errors in attempt to close m documents", but I can re-throw first Exception if you prefer. Will take about a week to complete and validate in our test environment. Thanks! > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4158: - Attachment: (was: BiggestObjectAllocationGraph.png) > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4158: - Attachment: BiggestObjectAllocationGraph.png > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Potagal updated PDFBOX-4158: - Attachment: BiggestObjectList.png BiggestObjectAllocationGraph.png > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png > > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails
[ https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405071#comment-16405071 ] Gary Potagal commented on PDFBOX-4158: -- We tried using PDFBox 2.0.4 to merge PDF Documents, in order to allow the user Print all documents at once from the Browser. The resulting document is sent to the print window. In order to improve customer experience, documents are merged to the Servlet HTTP Output Stream. It’s possible for the user to close the window or network to time out, resulting in Exception below. We used JProfiler to diagnose the memory leak and followed instructions: [https://www.ej-technologies.com/resources/jprofiler/help/doc/#jprofiler.heapWalker.memoryLeaks] BiggestObject and Allocation Screen Captures are attached. We also observed that Scratch File is not cleaned up. In reviewing the code: * org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(MemoryUsageSetting memUsageSetting) finally\{ } block loops over (PDDocument doc : tobeclosed) without catching Exception, probably causing the leak that we’re seeing. * org.apache.pdfbox.cos.COSDocument#close() if closing COSStream or COSObject results in IOException, close on scratchFile is never reached. As this is bringing down our production servers, I’ll try to add debugging and use try / catch / finally on every operation to see if the memory leak can be avoided and will submit a fix for your review == org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:396) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:426) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:283) at org.apache.catalina.connector.OutputBuffer.writeByte(OutputBuffer.java:440) at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:81) at org.apache.catalina.filters.ExpiresFilter$XServletOutputStream.write(ExpiresFilter.java:1016) at org.apache.pdfbox.pdfwriter.COSStandardOutputStream.write(COSStandardOutputStream.java:144) at org.apache.pdfbox.cos.COSName.writePDF(COSName.java:702) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromName(COSWriter.java:1155) at org.apache.pdfbox.cos.COSName.accept(COSName.java:672) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDictionary(COSWriter.java:995) at org.apache.pdfbox.cos.COSDictionary.accept(COSDictionary.java:1325) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096) at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419) at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367) at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232) at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) > COSDocument and PDFMerger may not close all IO resources if closing of one > fails > > > Key: PDFBOX-4158 > URL: https://issues.apache.org/jira/browse/PDFBOX-4158 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox >Reporter: Maruan Sahyoun >Priority: Minor > > As observed on the users mailing list {{COSDocument.close}} and > {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if > closing of one of the resources fails -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org