[
https://issues.apache.org/jira/browse/PDFBOX-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-6031.
-----------------------------------
Resolution: Won't Fix
> PDFStreamEngine: inconsistent processPage behaviour in multithreading
> ---------------------------------------------------------------------
>
> Key: PDFBOX-6031
> URL: https://issues.apache.org/jira/browse/PDFBOX-6031
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 3.0.5 PDFBox
> Reporter: Zer Jun Eng
> Priority: Blocker
> Attachments: Catalogo_Egitto_2025.pdf,
> image-2025-07-07-22-35-15-823.png
>
>
> Dear PDFBox developers,
> I modified the
> [PrintImageLocations.java|https://github.com/apache/pdfbox/blob/3.0.5/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java]
> example to count the number of unique images in a PDF document. The minimal
> reproducible code is below:
> {code:java}
> import java.io.File;
> import java.io.IOException;
> import java.util.List;
> import java.util.Set;
> import java.util.concurrent.Callable;
> import java.util.concurrent.ConcurrentHashMap;
> import java.util.concurrent.ExecutorService;
> import java.util.concurrent.Executors;
> import java.util.concurrent.TimeUnit;
> import org.apache.pdfbox.Loader;
> import org.apache.pdfbox.contentstream.PDFStreamEngine;
> import org.apache.pdfbox.contentstream.operator.DrawObject;
> import org.apache.pdfbox.contentstream.operator.Operator;
> import org.apache.pdfbox.contentstream.operator.OperatorName;
> import org.apache.pdfbox.contentstream.operator.state.Concatenate;
> import org.apache.pdfbox.contentstream.operator.state.Restore;
> import org.apache.pdfbox.contentstream.operator.state.Save;
> import
> org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
> import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
> import org.apache.pdfbox.cos.COSBase;
> import org.apache.pdfbox.cos.COSName;
> import org.apache.pdfbox.cos.COSObjectKey;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> /**
> * Adapted from
> *
> https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java
> */
> public class CountUniqueImages {
> private final Set<COSObjectKey> uniqueImageKeys =
> ConcurrentHashMap.newKeySet();
> public int countUniqueImages(File file, int nThreads) throws IOException,
> InterruptedException {
> try (PDDocument document = Loader.loadPDF(file);
> ExecutorService executor = Executors.newFixedThreadPool(nThreads)) {
> for (PDPage page : document.getPages()) {
> ImageEngine imageEngine = new ImageEngine(page);
> executor.submit(imageEngine);
> }
> executor.shutdown();
> executor.awaitTermination(1, TimeUnit.MINUTES);
> return uniqueImageKeys.size();
> }
> }
> final class ImageEngine extends PDFStreamEngine implements Callable<Object>
> {
> private static final Object DONE = new Object();
> private final PDPage page;
> public ImageEngine(PDPage page) {
> this.page = page;
> addOperator(new Concatenate(this));
> addOperator(new DrawObject(this));
> addOperator(new SetGraphicsStateParameters(this));
> addOperator(new Save(this));
> addOperator(new Restore(this));
> addOperator(new SetMatrix(this));
> }
> @Override
> protected void processOperator(Operator operator, List<COSBase> operands)
> throws IOException {
> String operation = operator.getName();
> if (OperatorName.DRAW_OBJECT.equals(operation)) {
> COSName objectName = (COSName) operands.get(0);
> PDXObject xobject = getResources().getXObject(objectName);
> if (xobject instanceof PDImageXObject) {
> PDImageXObject imageXObj = (PDImageXObject) xobject;
> COSObjectKey key = imageXObj.getCOSObject().getKey();
> uniqueImageKeys.add(key);
> } else if (xobject instanceof PDFormXObject) {
> PDFormXObject form = (PDFormXObject) xobject;
> showForm(form);
> }
> } else {
> super.processOperator(operator, operands);
> }
> }
> @Override
> public Object call() throws Exception {
> processPage(page);
> return DONE;
> }
> }
> }
> {code}
> Below is the JUnit test to verify the correctness of the multithreaded
> implementation. I have also attached the PDF file used for testing:
> {code:java}
> import static org.junit.jupiter.api.Assertions.*;
> import java.io.File;
> import java.io.IOException;
> import org.junit.jupiter.api.Test;
> class CountUniqueImagesTest {
> @Test
> void testSingleThreaded() throws IOException, InterruptedException {
> CountUniqueImages counter = new CountUniqueImages();
> int count =
> counter.countUniqueImages(new
> File("src/test/resources/Catalogo_Egitto_2025.pdf"), 1);
> assertEquals(122, count);
> }
> @Test
> void testMultiThreaded() throws IOException, InterruptedException {
> CountUniqueImages counter = new CountUniqueImages();
> int count =
> counter.countUniqueImages(new
> File("src/test/resources/Catalogo_Egitto_2025.pdf"), 4);
> assertEquals(122, count);
> }
> }
> {code}
> I am getting inconsistent results when using multithreading. The PDF file is
> expected to contain 122 unique images. Out of 100 test runs, the
> multithreaded test case fails 19 times. In those cases, the code does not
> correctly count the number of unique images.
> !image-2025-07-07-22-35-15-823.png!
> I have read the
> [FAQ|https://pdfbox.apache.org/3.0/faq.html#is-pdfbox-thread-safe%3F] and I
> understand that PDFBox is not thread-safe. Therefore, this issue might be
> related to or a duplicate of
> https://issues.apache.org/jira/browse/PDFBOX-5541 or
> https://issues.apache.org/jira/browse/PDFBOX-5542. However, I'm still
> wondering if this might be a bug, since my code only performs read-only
> operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]