[jira] [Comment Edited] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

Tilman Hausherr (JIRA) Mon, 02 Jul 2018 08:54:07 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530102#comment-16530102
 ]


Tilman Hausherr edited comment on PDFBOX-4184 at 7/2/18 3:53 PM:
-----------------------------------------------------------------

I looked at the sizes of the PDF test result files. Have a look at 
bitmask4babgr.pdf and intargb.pdf. This isn't just space needed for the extra 
dictionary. In bitmask4babgr.pdf, the first image had a compressed size of 214 
and now it has a size of 701.

OTOH the file PDFBOX-4184-032163.pdf had a size of 36240 and now 31607, and 
only 27007 by modifying estCompressSum() to

sum += Math.abs(aDataRawRowSub);

I'm wondering about the logic of chooseDataRowToWrite(). You're choosing the 
compression method based on the result of estCompressSum() which is the sum of 
the byte values. How would this have any influence on compression? Why would a 
sequence of 00 have a different compression length than a sequence of FF? Your 
comment mentions "This is just the recommend algorithm in the spec" and 
surprisingly, this is true:

[https://medium.com/@duhroach/how-png-works-f1174e3cc7b7]
 that one recommends to use abs of signed values (which I tried above). I tried 
that but it doesn't make things better for the non photo files.

Same here with more details:
 [https://www.w3.org/TR/PNG-Encoders.html#E.Filter-selection]

I think we should count colors and/or consider the bit depth. Or the geometric 
size of the image, i.e. something below 25x25 is probably rather an icon than a 
photograph.

The current situation might have a negative impact on the openhtmltopdf 
project, because many web pages have small icons.


was (Author: tilman):
I looked at the sizes of the PDF test result files. Have a look at 
bitmask4babgr.pdf and intargb.pdf. This isn't just space needed for the extra 
dictionary. In bitmask4babgr.pdf, the first image had a compressed size of 214 
and now it has a size of 701.

OTOH the file PDFBOX-4184-032163.pdf had a size of 36240 and now 31607, and 
only 27007 by modifying estCompressSum() to

sum += Math.abs(aDataRawRowSub);

I'm wondering about the logic of chooseDataRowToWrite(). You're choosing the 
compression method based on the result of estCompressSum() which is the sum of 
the byte values. How would this have any influence on compression? Why would a 
sequence of 00 have a different compression length than a sequence of FF? Your 
comment mentions "This is just the recommend algorithm in the spec" and 
surprisingly, this is true:

[https://medium.com/@duhroach/how-png-works-f1174e3cc7b7]
 that one recommends to use abs of signed values (which I tried above). I tried 
that but it doesn't make things better for the non photo files.

Same here with more details:
 [https://www.w3.org/TR/PNG-Encoders.html#E.Filter-selection]

I think we should count colors and/or consider the bit depth. Or the geometric 
size of the image, i.e. something below 25x25 is probably rather an icon than a 
photograph.

The current situation might have a negative impact on the openhtmltopdf 
project, because many web page have small icons.

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -----------------------------------------------------------------
>
>                 Key: PDFBOX-4184
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4184
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.9
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>             Fix For: 2.0.12, 3.0.0 PDFBox
>
>         Attachments: 16bit.png, LoadGovdocs.java, 
> lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

Reply via email to