Re: Duplicate Resources - How to find them?

Tilman Hausherr Tue, 22 Jul 2025 07:01:31 -0700

Hi,

This can happen by having the resources in a parent of the page (you cansee this in PDFDebugger). You could get around this by using a set ofimages that you have handled. From the source code of ExtractImages:


                if (seen.contains(xobject.getCOSObject()))
                {
                    // skip duplicate image
                    return;
                }
                seen.add(xobject.getCOSObject());

Tilman


Am 22.07.2025 um 15:24 schrieb Richard Kwasnicki:

Hey,

I have a PDF File with 281 pages, each page is basically just one big image.
When I load it with PDFBox, my aim is to compress the images to make them 
smaller.
My approach is loading the Document, iterating over every page, checking for 
all resources on it if they are of type PDImageXObject. Then i do some 
compression.

The crazy thing is, my file somehow has on every page a resource to every 
resource. Seems, that all the images are somehow shared... So my program now 
does 281 * 281 compressions which is really slow.

Im not sure whats the best way to detect shared resources, is there some easy 
way? Also if you see other approaches serving the same purposes of compressing 
large images, i would be interested...

Best, Richard

Richard Kwasnicki
Softwareentwickler
Telefon: +49 351 215 908 34
E-Mail: rkwasni...@avantgarde-labs.de
Website<https://avantgarde-labs.com/> · 
LinkedIn<https://www.linkedin.com/company/avantgarde-labs-gmbh/> · 
Datenschutzbestimmungen<https://avantgarde-labs.com/de/datenschutzbestimmungen/>
Avantgarde Labs GmbH · Theresienstr. 9 · 01097 Dresden
Geschäftsführung: Robert Glaß, Torsten Hartmann, Sandy Lucka, Sven Rega
Sitz Dresden · Amtsgericht Dresden · HRB 31215 · USt-ID DE283937395
Avantgarde Labs · Wir lieben Entwicklung.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Duplicate Resources - How to find them?

Reply via email to