[jira] [Created] (PDFBOX-4297) Allow to space efficiently analyse large PDFs

2018-08-21 Thread Ralf Hauser (JIRA)
Ralf Hauser created PDFBOX-4297:
---

 Summary: Allow to space efficiently analyse large PDFs
 Key: PDFBOX-4297
 URL: https://issues.apache.org/jira/browse/PDFBOX-4297
 Project: PDFBox
  Issue Type: Improvement
Reporter: Ralf Hauser


Assume you get a 300+MB large pdf and need to know

1) the file names of embedded files if any

2) whether it is encrypted (symmetric or asymmetric)

3) certification level (and whether it is signed)

This should not use more than 5 MB (extra) memory

 

P.S.: seems to an exampe of https://pdfbox.apache.org/ideas.html  "Handle large 
PDF files"

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4290) Memory Leak in SoftReferenceCache

2018-08-21 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587829#comment-16587829
 ] 

Tilman Hausherr commented on PDFBOX-4290:
-

[~galekseev] sorry, I don't know. Andreas does these. A good idea would be to 
do it with a new PDFBox release but there haven't been much changes (summer 
heat / vacation etc).

> Memory Leak in SoftReferenceCache
> -
>
> Key: PDFBOX-4290
> URL: https://issues.apache.org/jira/browse/PDFBOX-4290
> Project: PDFBox
>  Issue Type: Bug
>  Components: JBIG2
>Affects Versions: 3.0.0 JBIG2
>Reporter: Grigoriy Alekseev
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.2 JBIG2
>
>
> Keys in a HashMap are not garbage-collected because they are not wrapped in 
> weak references.
> For details please see [https://github.com/levigo/jbig2-imageio/issues/53]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4267) Incorrect rendering when /Matte entry

2018-08-21 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587790#comment-16587790
 ] 

ASF subversion and git services commented on PDFBOX-4267:
-

Commit 1838574 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1838574 ]

PDFBOX-4267: matte entry should use the color space of the image, not the color 
space of the SMask, by Jani Pehkonen

> Incorrect rendering when /Matte entry
> -
>
> Key: PDFBOX-4267
> URL: https://issues.apache.org/jira/browse/PDFBOX-4267
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.11, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: bugzilla888437.pdf, gs-bugzilla688797-reduced.pdf, 
> gs-bugzilla688797.pdf
>
>
> The image softmask in the attached file has a {{/Matte 0 0 0}} entry. PDFBox 
> displays the PDF differently than Adobe Reader, the reflection shown by 
> PDFBox is barely visible. When the /Matte entry is deleted, then it is barely 
> visible in Adobe Reader too. So the /Matte entry does make some difference, 
> although I don't understand how.
> In the PDF specification, the formula shown is c' = m + α x (c - m). So 0 
> should have no effect?!
> I looked at the code of PDF.js, they have a special handling when alpha is 0, 
> don't know why.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4267) Incorrect rendering when /Matte entry

2018-08-21 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587791#comment-16587791
 ] 

ASF subversion and git services commented on PDFBOX-4267:
-

Commit 1838575 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1838575 ]

PDFBOX-4267: matte entry should use the color space of the image, not the color 
space of the SMask, by Jani Pehkonen

> Incorrect rendering when /Matte entry
> -
>
> Key: PDFBOX-4267
> URL: https://issues.apache.org/jira/browse/PDFBOX-4267
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.11, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: bugzilla888437.pdf, gs-bugzilla688797-reduced.pdf, 
> gs-bugzilla688797.pdf
>
>
> The image softmask in the attached file has a {{/Matte 0 0 0}} entry. PDFBox 
> displays the PDF differently than Adobe Reader, the reflection shown by 
> PDFBox is barely visible. When the /Matte entry is deleted, then it is barely 
> visible in Adobe Reader too. So the /Matte entry does make some difference, 
> although I don't understand how.
> In the PDF specification, the formula shown is c' = m + α x (c - m). So 0 
> should have no effect?!
> I looked at the code of PDF.js, they have a special handling when alpha is 0, 
> don't know why.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4296) Question: Performance

2018-08-21 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587694#comment-16587694
 ] 

Tilman Hausherr commented on PDFBOX-4296:
-

I remember that we had the problem that we read images twice but I think this 
was fixed with the subsampling change. What comment do you mean re 
transparency? Anyway, that won't apply to you, transparency is relevant when 
rendering the whole PDF. There is no backlog re performance. Can you show a PDF 
where the image extraction is very slow? Just "slower than poppler" doesn't 
mean much, poppler is (AFAIK) in C++, while PDFBox is in Java. We're also 
dependent on the image handling libraries.

You can speed up jpeg extraction by using the images directly, see the 
ExtractImages tool.

> Question: Performance
> -
>
> Key: PDFBOX-4296
> URL: https://issues.apache.org/jira/browse/PDFBOX-4296
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.11
>Reporter: Daniel Persson
>Priority: Trivial
>  Labels: performance
>
> Hi Team.
> We use a tool we built using PDFBox to extract text for about 10k pages per 
> day. Then we have another tool to extract images using Poppler.
> We want to use PDFBox for both tasks but sadly we see a performance hit using 
> PDFBox in the order of 3 times.
> Do you have any backlog / technical dept / ideas on how to improve 
> performance?
> We have tried -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true 
> and that made image generation much slower.
> We have set System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider") in code.
> We use image libraries from twelvemonkeys, pdfbox and the standard jai 
> project.
> I've read in the code that we do double writes for images using transparency 
> which might be a culprit.
> I have been allowed to put some time into the project if we have some solid 
> leads or a roadmap to reach better performance.
> Hope it's okay to track this issue here instead of a question on the mailing 
> list.
> Best regards
> Daniel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4267) Incorrect rendering when /Matte entry

2018-08-21 Thread Jani Pehkonen (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587552#comment-16587552
 ] 

Jani Pehkonen commented on PDFBOX-4267:
---

One small problem remaining. The Matte entry should use the color space of the 
image, not the color space of the SMask. This line in method 
PDImageXObject.extractMatte():
{code:java}
matte = softMask.getColorSpace().toRGB(matte);{code}
I think it should be:
{code:java}
matte = getColorSpace().toRGB(matte);{code}
This probably won't cause any visible difference in PDF rendering because 
typically the Matte entry is an array of zeros, so toRGB(matte) will be black.

> Incorrect rendering when /Matte entry
> -
>
> Key: PDFBOX-4267
> URL: https://issues.apache.org/jira/browse/PDFBOX-4267
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.11, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: bugzilla888437.pdf, gs-bugzilla688797-reduced.pdf, 
> gs-bugzilla688797.pdf
>
>
> The image softmask in the attached file has a {{/Matte 0 0 0}} entry. PDFBox 
> displays the PDF differently than Adobe Reader, the reflection shown by 
> PDFBox is barely visible. When the /Matte entry is deleted, then it is barely 
> visible in Adobe Reader too. So the /Matte entry does make some difference, 
> although I don't understand how.
> In the PDF specification, the formula shown is c' = m + α x (c - m). So 0 
> should have no effect?!
> I looked at the code of PDF.js, they have a special handling when alpha is 0, 
> don't know why.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4296) Question: Performance

2018-08-21 Thread Daniel Persson (JIRA)
Daniel Persson created PDFBOX-4296:
--

 Summary: Question: Performance
 Key: PDFBOX-4296
 URL: https://issues.apache.org/jira/browse/PDFBOX-4296
 Project: PDFBox
  Issue Type: Improvement
  Components: Rendering
Affects Versions: 2.0.11
Reporter: Daniel Persson


Hi Team.

We use a tool we built using PDFBox to extract text for about 10k pages per 
day. Then we have another tool to extract images using Poppler.

We want to use PDFBox for both tasks but sadly we see a performance hit using 
PDFBox in the order of 3 times.

Do you have any backlog / technical dept / ideas on how to improve performance?

We have tried -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true and 
that made image generation much slower.
We have set System.setProperty("sun.java2d.cmm", 
"sun.java2d.cmm.kcms.KcmsServiceProvider") in code.

We use image libraries from twelvemonkeys, pdfbox and the standard jai project.

I've read in the code that we do double writes for images using transparency 
which might be a culprit.

I have been allowed to put some time into the project if we have some solid 
leads or a roadmap to reach better performance.

Hope it's okay to track this issue here instead of a question on the mailing 
list.

Best regards

Daniel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org