[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2019-01-02 Thread Gary Potagal (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732099#comment-16732099
 ] 

Gary Potagal commented on PDFBOX-4188:
--

I saw some activity on this ticket so reviewed and have a couple of questions:

1. Am I correct in that without changing code, PDFBOX_LEGACY_MODE is going to 
be used?
2. With default PDFBOX_LEGACY_MODE, updated memory management presented in the 
ticket would still be hugely beneficial in merging large number of small files. 
 Any plans to review it or change the default?
3. Does "Structure Tree" limitation still exist?  

Thank you!

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, 
> PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-17 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440811#comment-16440811
 ] 

Gary Potagal commented on PDFBOX-4188:
--

Hello [~msahyoun] and [~tilman] - should we continue to work on this patch for 
2.0.10 or do you want to come back to this for 3.0?  Thank you

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, 
> PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439923#comment-16439923
 ] 

Gary Potagal commented on PDFBOX-4188:
--

[~msahyoun] - Sorry, I just reviewed the code better.  What I'm seeing is:

- org.apache.pdfbox.io.MemoryUsageSetting#getPartitionedCopy was only used in 
the 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting)
 method

- getPartitionedCopy creates a new instance of MemoryUsageSetting with limits 
determined by parallelUseCount.  It is basically a copy constructor.  As a 
utility method it will still function just as before 

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, 
> PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439881#comment-16439881
 ] 

Gary Potagal commented on PDFBOX-4190:
--

Unfortunately, the order of selected documents is controlled by the user.  
MergeOptions sounds like a great idea.  Would this be for 3.0.0 or can it be 
added to 2.x?

> Allow caller to control openAction on merged documents
> --
>
> Key: PDFBOX-4190
> URL: https://issues.apache.org/jira/browse/PDFBOX-4190
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> It would be great if openAction behavior was configurable. When documents are 
> merged, we are required to open on the first page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439858#comment-16439858
 ] 

Gary Potagal commented on PDFBOX-4188:
--

[~msahyoun] - Probably nothing good.  In our code, we took that method out.

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, 
> PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439850#comment-16439850
 ] 

Gary Potagal commented on PDFBOX-4190:
--

If we're sending results of the merge to the ServletOutputStream, is there a 
way to set it that I'm missing?

> Allow caller to control openAction on merged documents
> --
>
> Key: PDFBOX-4190
> URL: https://issues.apache.org/jira/browse/PDFBOX-4190
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> It would be great if openAction behavior was configurable. When documents are 
> merged, we are required to open on the first page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439745#comment-16439745
 ] 

Gary Potagal commented on PDFBOX-4190:
--

[~tilman] 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting)
 - creates a new Destination inside of the method when you merge documents.  

> Allow caller to control openAction on merged documents
> --
>
> Key: PDFBOX-4190
> URL: https://issues.apache.org/jira/browse/PDFBOX-4190
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> It would be great if openAction behavior was configurable. When documents are 
> merged, we are required to open on the first page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439745#comment-16439745
 ] 

Gary Potagal edited comment on PDFBOX-4190 at 4/16/18 5:21 PM:
---

[~tilman] 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting)
 - creates a new Destination inside of the method body when you merge 
documents.  


was (Author: gary.potagal):
[~tilman] 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting)
 - creates a new Destination inside of the method when you merge documents.  

> Allow caller to control openAction on merged documents
> --
>
> Key: PDFBOX-4190
> URL: https://issues.apache.org/jira/browse/PDFBOX-4190
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> It would be great if openAction behavior was configurable. When documents are 
> merged, we are required to open on the first page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4190:
-
Description: 
It would be great if openAction behavior was configurable. When documents are 
merged, we are required to open on the first page.


> Allow caller to control openAction on merged documents
> --
>
> Key: PDFBOX-4190
> URL: https://issues.apache.org/jira/browse/PDFBOX-4190
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> It would be great if openAction behavior was configurable. When documents are 
> merged, we are required to open on the first page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4190:
-
Description: (was:  

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened PDFBOX-3721 where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  (See 
PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you)

> Allow caller to control openAction on merged documents
> --
>
> Key: PDFBOX-4190
> URL: https://issues.apache.org/jira/browse/PDFBOX-4190
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4190) Allow caller to control openAction on merged documents

2018-04-16 Thread Gary Potagal (JIRA)
Gary Potagal created PDFBOX-4190:


 Summary: Allow caller to control openAction on merged documents
 Key: PDFBOX-4190
 URL: https://issues.apache.org/jira/browse/PDFBOX-4190
 Project: PDFBox
  Issue Type: Improvement
Affects Versions: 2.0.9, 3.0.0 PDFBox
Reporter: Gary Potagal


 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened PDFBOX-3721 where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  (See 
PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-16 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439685#comment-16439685
 ] 

Gary Potagal commented on PDFBOX-4188:
--

[~msahyoun]

- I've attached [^PDFBOX-4188_memory_diagram.png] that demonstrates problem.  
It's harder to diagram, but the real scope of the problem becomes a lot worth, 
the more files you add to merge.  We hope you see that problem in the test that 
was submitted.
- The problem starts in PDFMergerUtility when memory is partitioned (Line 288). 
 We're eliminating memory partitioning,  so the patch can't be split into two 
parts.  There's one very important point - MemoryUsageSettings is a *single* 
object that's shared between all ScratchFiles.  All ScratchFiles must reserve 
pages with MemoryUsageSettings, thus
-- Pages (in main memory and on disk) are allocated only when they are needed
-- Total limits are tracked in a single place, so whatever settings are passed 
into the PDFMergeUtility will be the Maximum Memory Limits used during the 
merge.
- I'll open another ticket for openAction
- MappedByteBuffer is used when there need to read the content of a file 
multiple times.  Is that done during the merge? 
- If the patch is acceptable, we'll clean it up to meet coding conventions.  

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, 
> PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-16 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Attachment: PDFBOX-4188_memory_diagram.png

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, 
> PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-13 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437703#comment-16437703
 ] 

Gary Potagal edited comment on PDFBOX-4188 at 4/13/18 6:26 PM:
---

Added [^PDFBOX-4188-MemoryManagerPatch.zip].  It assumes that 
[^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test 
exists.
 
 - This should optimize both modes, but especially the LEGACY mode.
 - Java doc explains what was changed (Hopefully)
 -  Test are passing with 

long defaultMemory = 1 * MEG;

runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG);
runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG);
runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG);
runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG);

 - It would be great if openAction behavior was configurable.  When documents 
are merged, we would like for them to open on the first page.

Please let us know what you think and if you have any questions.  Thank you.



was (Author: gary.potagal):
Added [^PDFBOX-4188-MemoryManagerPatch].  It assumes that 
[^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test 
exists.
 
 - This should optimize both modes, but especially the LEGACY mode.
 - Java doc explains what was changed (Hopefully)
 -  Test are passing with 

long defaultMemory = 1 * MEG;

runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG);
runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG);
runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG);
runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG);

 - It would be great if openAction behavior was configurable.  When documents 
are merged, we would like for them to open on the first page.

Please let us know what you think and if you have any questions.  Thank you.


>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-13 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437703#comment-16437703
 ] 

Gary Potagal commented on PDFBOX-4188:
--

Added [^PDFBOX-4188-MemoryManagerPatch].  It assumes that 
[^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test 
exists.
 
 - This should optimize both modes, but especially the LEGACY mode.
 - Java doc explains what was changed (Hopefully)
 -  Test are passing with 

long defaultMemory = 1 * MEG;

runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG);
runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG);
runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG);
runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG);

 - It would be great if openAction behavior was configurable.  When documents 
are merged, we would like for them to open on the first page.

Please let us know what you think and if you have any questions.  Thank you.


>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-13 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Attachment: PDFBOX-4188-MemoryManagerPatch.zip

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip, 
> PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434644#comment-16434644
 ] 

Gary Potagal commented on PDFBOX-4188:
--

I'm working on merging the patch that we did for 2.0.4 to current trunk.  I'll 
try to have it available shortly for your review

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434416#comment-16434416
 ] 

Gary Potagal commented on PDFBOX-4188:
--

[~tilman] - we created a breaking test and it's attached 
[^PDFBOX-4188-breakingTest.zip].  

The patch is binary, so you would need to apply it in the checked out trunk 
directory using the command:

trunk> patch -p0 --binary -i PDFBOX-4188-breakingTest.diff

patching file 
pdfbox/src/test/java/org/apache/pdfbox/multipdf/PdfMergeUtilityPagesTest.java
patching file pdfbox/src/test/resources/input/merge/pages/pdf_sample_1.pdf

 

The test does the following:
 # Creates four folders containing copies of one page simple pdf_sample_1.pdf 
file. Each folders contain increasing number of copies, starting with 100, so 
it's 100, 200, 300, 400 .  Each file is about 8K
 # Merges all files in each folder.  The numbers in test for maxStorageBytes 
are just enough to let the test pass.  If you decrease them slightly, the 
Exception will be thrown.  

 

Output looks like this:

Apr 11, 2018 2:25:32 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 0.781; 
Pages/Second: 128.041; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 74; 
Total Sources Size(K): 775; Merged File Size(K): 522; Ratio 
MaxStorageBytes/Merged File Size: 145
Apr 11, 2018 2:25:34 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 1.486; 
Pages/Second: 134.590; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 315; 
Total Sources Size(K): 1,551; Merged File Size(K): 1,042; Ratio 
MaxStorageBytes/Merged File Size: 309
Apr 11, 2018 2:25:37 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3.532; 
Pages/Second: 84.938; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 710; 
Total Sources Size(K): 2,327; Merged File Size(K): 1,562; Ratio 
MaxStorageBytes/Merged File Size: 465
Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4.677; 
Pages/Second: 85.525; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1,240; 
Total Sources Size(K): 3,103; Merged File Size(K): 2,082; Ratio 
MaxStorageBytes/Merged File Size: 609
Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
testPerformanceMerge
INFO: Summary: Pages: 1000, Time(s): 10.476, Pages/Second: 95.456

As you can see, to merge 400 one page 8K files, We need to set maxStorageBytes 
to ~1.2 GIG.  The resulting file is ~2000 K

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  

[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Attachment: PDFBOX-4188-breakingTest.zip

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434191#comment-16434191
 ] 

Gary Potagal commented on PDFBOX-4188:
--

We don't know what PDFs we're going to get and are trying to make this generic

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened
> https://issues.apache.org/jira/browse/PDFBOX-3721 
> where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  See 
> (https://issues.apache.org/jira/browse/PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Description: 
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you

  was:
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999 is addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you


>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 

[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434177#comment-16434177
 ] 

Gary Potagal commented on PDFBOX-4188:
--

From: Tilman Hausherr [mailto:thaush...@t-online.de] 
 Sent: Saturday, April 07, 2018 1:48 AM
 To: dev@pdfbox.apache.org
 Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when 
merging large number of small PDFs

 

Hi,

 

Please have also a look at the comments in

https://issues.apache.org/jira/browse/PDFBOX-4182  

Please submit your patch proposal there or in a new issue. It should be against 
the trunk. Note that this doesn't mean your patch will be accepted, it just 
means I'd like to see it because I haven't understood your post fully, and many 
attachment types don't get through here.

 

A breaking test would be interesting: is it possible to use (or better,

create) 400 identical small PDFs and merge them and does it break?

 

Tilman

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened
> https://issues.apache.org/jira/browse/PDFBOX-3721 
> where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  See 
> (https://issues.apache.org/jira/browse/PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Description: 
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999 is addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you

  was:
I have been running some tests trying to merge large amounts (2618) of small 
pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)

Memory consumption seems to be the main limitation.

ScratchFileBuffer seems to consume the majority of the memory usage.

(see screenshot from mat in attachment)

(I would include the hprof in attachment so you can analyze yourselves but it's 
rather large)

Note that it seems impossible to generate a large pdf using a small memory 
footprint.

I personally thought that using MemorySettings with temporary file only would 
allow me to generate arbitrarily large pdf files but it doesn't seem to help.

I've run the mergeDocuments with  memory settings:
 * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 
1024L)

 * MemoryUsageSetting.setupTempFileOnly()

Refactored version completes with *4GB* heap:

with temp file only completes 2618 documents in 1.760 min

*VS*

*8GB* heap:

with temp file only completes 2618 documents in 2.0 min

Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 
8GB)

 It looks like the loop in the mergeDocuments accumulates PDDocument objects in 
a list which are closed after the merge is completed.

Refactoring the code to close these as they are used, instead of accumulating 
them and closing all at the end, improves memory usage considerably.(although 
doesn't seem to be eliminated completed based on mat analysis.)

Another change I've implemented is to only create the inputstream when the file 
needs to be read and to close it alongside the PDDocument.

(Some inputstreams contain buffers and depending on the size of the buffers and 
or the stream type accumulating all the streams is a potential memory-hog.)

These changes seems to have a beneficial improvement in the sense that I can 
process the same amount of pdfs with about half the memory.

 I'd appreciate it if you could roll these changes into the main codebase.

(I've respected java 6 compatibility.)

I've included in attachment the java files of the new implementation:
 * Suppliers
 * Supplier
 * PDFMergerUtilityUsingSupplier

PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
changes only internal code changes. (just rename the class to PDFMergerUtility 
if you decide to implemented the changes.)

 In attachment you can also find some screenshots from visualvm showing the 
memory usage of the original version and the refactored version as well as some 
info produced by mat after analysing the heap.

If you know of any other means, without 

[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Affects Version/s: 3.0.0 PDFBox

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)
Gary Potagal created PDFBOX-4188:


 Summary:  "Maximum allowed scratch file memory exceeded." 
Exception when merging large number of small PDFs
 Key: PDFBOX-4188
 URL: https://issues.apache.org/jira/browse/PDFBOX-4188
 Project: PDFBox
  Issue Type: Improvement
Affects Versions: 2.0.9
Reporter: Gary Potagal


I have been running some tests trying to merge large amounts (2618) of small 
pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)

Memory consumption seems to be the main limitation.

ScratchFileBuffer seems to consume the majority of the memory usage.

(see screenshot from mat in attachment)

(I would include the hprof in attachment so you can analyze yourselves but it's 
rather large)

Note that it seems impossible to generate a large pdf using a small memory 
footprint.

I personally thought that using MemorySettings with temporary file only would 
allow me to generate arbitrarily large pdf files but it doesn't seem to help.

I've run the mergeDocuments with  memory settings:
 * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 
1024L)

 * MemoryUsageSetting.setupTempFileOnly()

Refactored version completes with *4GB* heap:

with temp file only completes 2618 documents in 1.760 min

*VS*

*8GB* heap:

with temp file only completes 2618 documents in 2.0 min

Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 
8GB)

 It looks like the loop in the mergeDocuments accumulates PDDocument objects in 
a list which are closed after the merge is completed.

Refactoring the code to close these as they are used, instead of accumulating 
them and closing all at the end, improves memory usage considerably.(although 
doesn't seem to be eliminated completed based on mat analysis.)

Another change I've implemented is to only create the inputstream when the file 
needs to be read and to close it alongside the PDDocument.

(Some inputstreams contain buffers and depending on the size of the buffers and 
or the stream type accumulating all the streams is a potential memory-hog.)

These changes seems to have a beneficial improvement in the sense that I can 
process the same amount of pdfs with about half the memory.

 I'd appreciate it if you could roll these changes into the main codebase.

(I've respected java 6 compatibility.)

I've included in attachment the java files of the new implementation:
 * Suppliers
 * Supplier
 * PDFMergerUtilityUsingSupplier

PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
changes only internal code changes. (just rename the class to PDFMergerUtility 
if you decide to implemented the changes.)

 In attachment you can also find some screenshots from visualvm showing the 
memory usage of the original version and the refactored version as well as some 
info produced by mat after analysing the heap.

If you know of any other means, without running into memory issues, to merge 
large sets of pdf files into a large single pdf I'd love to hear about it!

I'd also suggest that there should be further improvements made in memory usage 
in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-04-09 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430445#comment-16430445
 ] 

Gary Potagal commented on PDFBOX-4158:
--

Yes, we completed testing Friday and are no longer seeing a memory leak / 
orphaned scratch files on disk.  This ticket can be closed.  Thank you.

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-28 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417321#comment-16417321
 ] 

Gary Potagal edited comment on PDFBOX-4158 at 3/28/18 1:33 PM:
---

[~msahyoun] - what about not having a nested try\{ } on line 294 and just 
having try, catch, finally.  If IOException occurs anywhere in the try, it will 
be caught by catch and in finally, firstException will not be null.  Otherwise, 
you might swallow an Exception that occurs in finally.  It could be argued that 
method will not notify the caller if it gets errors closing assets, but than 
we're making presumptions on behalf of the caller.  Thanks! 


was (Author: gary.potagal):
[~msahyoun] - what about not having a nested try\{ } on line 94 and just having 
try, catch, finally.  If IOException occurs anywhere in the try, it will be 
caught by catch and in finally, firstException will not be null.  Otherwise, 
you might swallow an Exception that occurs in finally.  It could be argued that 
method will not notify the caller if it gets errors closing assets, but than 
we're making presumptions on behalf of the caller.  Thanks! 

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-28 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417321#comment-16417321
 ] 

Gary Potagal commented on PDFBOX-4158:
--

[~msahyoun] - what about not having a nested try\{ } on line 94 and just having 
try, catch, finally.  If IOException occurs anywhere in the try, it will be 
caught by catch and in finally, firstException will not be null.  Otherwise, 
you might swallow an Exception that occurs in finally.  It could be argued that 
method will not notify the caller if it gets errors closing assets, but than 
we're making presumptions on behalf of the caller.  Thanks! 

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-27 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416109#comment-16416109
 ] 

Gary Potagal commented on PDFBOX-4158:
--

[~msahyoun] - this is much cleaner - will copy your changes into our code.  
Thank you again.

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-27 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415831#comment-16415831
 ] 

Gary Potagal edited comment on PDFBOX-4158 at 3/27/18 3:54 PM:
---

[~msahyoun] - I see that you also added more info to the logging message.  
Thank you for the code review, I will merge your changes into our code


was (Author: gary.potagal):
[~msahyoun] - I see that you also added more logging to the logging message.  
Thank you for the code review, I will merge your changes into our code

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-27 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415831#comment-16415831
 ] 

Gary Potagal commented on PDFBOX-4158:
--

[~msahyoun] - I see that you also added more logging to the logging message.  
Thank you for the code review, I will merge your changes into our code

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-26 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414464#comment-16414464
 ] 

Gary Potagal commented on PDFBOX-4158:
--

[~msahyoun] - ended up following your first advice to keep code consistent with 
the project.  Patch [^PDFBOX-4158.patch] is attached.  Please let me know if I 
should do anything else.  Thank you.

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-26 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4158:
-
Attachment: PDFBOX-4158.patch

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-19 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405110#comment-16405110
 ] 

Gary Potagal edited comment on PDFBOX-4158 at 3/19/18 4:53 PM:
---

Hi, changed the attached image:  [^[^BiggestObjectAllocationGraph.png]] to show 
full picture.

I was going to catch  Throwable, log at WARN and throw IOException with the 
count of errors, something like "Encountered N errors in attempt to close m 
documents", but I can re-throw first Exception if you prefer.

Will take about a week to complete and validate in our test environment. 

Thanks! 


was (Author: gary.potagal):
Hi, changed the attached image:  
https://issues.apache.org/jira/secure/attachment/12915154/BiggestObjectAllocationGraph.png
 to show full picture.

I was going to catch  Throwable, log at WARN and throw IOException with the 
count of errors, something like "Encountered N errors in attempt to close m 
documents", but I can re-throw first Exception if you prefer.

Will take about a week to complete and validate in our test environment. 

Thanks! 

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-19 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405110#comment-16405110
 ] 

Gary Potagal commented on PDFBOX-4158:
--

Hi, changed the attached image:  
https://issues.apache.org/jira/secure/attachment/12915154/BiggestObjectAllocationGraph.png
 to show full picture.

I was going to catch  Throwable, log at WARN and throw IOException with the 
count of errors, something like "Encountered N errors in attempt to close m 
documents", but I can re-throw first Exception if you prefer.

Will take about a week to complete and validate in our test environment. 

Thanks! 

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-19 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4158:
-
Attachment: (was: BiggestObjectAllocationGraph.png)

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-19 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4158:
-
Attachment: BiggestObjectAllocationGraph.png

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-19 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4158:
-
Attachment: BiggestObjectList.png
BiggestObjectAllocationGraph.png

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-03-19 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405071#comment-16405071
 ] 

Gary Potagal commented on PDFBOX-4158:
--

We tried using PDFBox 2.0.4 to merge PDF Documents, in order to allow the user 
Print all documents at once from the Browser.  The resulting document is sent 
to the print window.  In order to improve customer experience, documents are 
merged to the Servlet HTTP Output Stream.  It’s possible for the user to close 
the window or network to time out, resulting in Exception below.

We used JProfiler to diagnose the memory leak and followed instructions: 
[https://www.ej-technologies.com/resources/jprofiler/help/doc/#jprofiler.heapWalker.memoryLeaks]

 BiggestObject and Allocation Screen Captures are attached.   We also observed 
that Scratch File is not cleaned up. 

 In reviewing the code:

 
 * 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(MemoryUsageSetting 
memUsageSetting) finally\{ } block loops over (PDDocument doc : tobeclosed) 
without catching Exception, probably causing the leak that we’re seeing.
 * org.apache.pdfbox.cos.COSDocument#close()  if closing COSStream or COSObject 
results in IOException, close on scratchFile is never reached. 

As this is bringing down our production servers, I’ll try to add debugging and 
use try / catch / finally on every operation to see if the memory leak can be 
avoided and will submit a fix for your review

==

org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken 
pipe

    at 
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:396)

    at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:426)

    at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:283)

    at 
org.apache.catalina.connector.OutputBuffer.writeByte(OutputBuffer.java:440)

    at 
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:81)

    at 
org.apache.catalina.filters.ExpiresFilter$XServletOutputStream.write(ExpiresFilter.java:1016)

    at 
org.apache.pdfbox.pdfwriter.COSStandardOutputStream.write(COSStandardOutputStream.java:144)

    at org.apache.pdfbox.cos.COSName.writePDF(COSName.java:702)

    at 
org.apache.pdfbox.pdfwriter.COSWriter.visitFromName(COSWriter.java:1155)

    at org.apache.pdfbox.cos.COSName.accept(COSName.java:672)

    at 
org.apache.pdfbox.pdfwriter.COSWriter.visitFromDictionary(COSWriter.java:995)

    at org.apache.pdfbox.cos.COSDictionary.accept(COSDictionary.java:1325)

    at 
org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522)

    at 
org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460)

    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444)

    at 
org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096)

    at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419)

    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367)

    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)

    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232)

    at 
org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280)

 

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Priority: Minor
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org