Zer Jun Eng created PDFBOX-6031:
-----------------------------------
Summary: PDFStreamEngine: inconsistent processPage behaviour in
multithreading
Key: PDFBOX-6031
URL: https://issues.apache.org/jira/browse/PDFBOX-6031
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 3.0.5 PDFBox
Reporter: Zer Jun Eng
Attachments: Catalogo_Egitto_2025.pdf,
image-2025-07-07-22-35-15-823.png
Dear PDFBox developers,
I modified the
[PrintImageLocations.java|https://github.com/apache/pdfbox/blob/3.0.5/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java]
example to count the number of unique images in a PDF document. The minimal
reproducible code is below:
{code:java}
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.operator.OperatorName;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import
org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSObjectKey;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
/**
* Adapted from
*
https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java
*/
public class CountUniqueImages {
private final Set<COSObjectKey> uniqueImageKeys =
ConcurrentHashMap.newKeySet();
public int countUniqueImages(File file, int nThreads) throws IOException,
InterruptedException {
try (PDDocument document = Loader.loadPDF(file);
ExecutorService executor = Executors.newFixedThreadPool(nThreads)) {
for (PDPage page : document.getPages()) {
ImageEngine imageEngine = new ImageEngine(page);
executor.submit(imageEngine);
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.MINUTES);
return uniqueImageKeys.size();
}
}
final class ImageEngine extends PDFStreamEngine implements Callable<Object> {
private static final Object DONE = new Object();
private final PDPage page;
public ImageEngine(PDPage page) {
this.page = page;
addOperator(new Concatenate(this));
addOperator(new DrawObject(this));
addOperator(new SetGraphicsStateParameters(this));
addOperator(new Save(this));
addOperator(new Restore(this));
addOperator(new SetMatrix(this));
}
@Override
protected void processOperator(Operator operator, List<COSBase> operands)
throws IOException {
String operation = operator.getName();
if (OperatorName.DRAW_OBJECT.equals(operation)) {
COSName objectName = (COSName) operands.get(0);
PDXObject xobject = getResources().getXObject(objectName);
if (xobject instanceof PDImageXObject) {
PDImageXObject imageXObj = (PDImageXObject) xobject;
COSObjectKey key = imageXObj.getCOSObject().getKey();
uniqueImageKeys.add(key);
} else if (xobject instanceof PDFormXObject) {
PDFormXObject form = (PDFormXObject) xobject;
showForm(form);
}
} else {
super.processOperator(operator, operands);
}
}
@Override
public Object call() throws Exception {
processPage(page);
return DONE;
}
}
}
{code}
Below is the JUnit test to verify the correctness of the multithreaded
implementation. I have also attached the PDF file used for testing:
{code:java}
import static org.junit.jupiter.api.Assertions.*;
import java.io.File;
import java.io.IOException;
import org.junit.jupiter.api.Test;
class CountUniqueImagesTest {
@Test
void testSingleThreaded() throws IOException, InterruptedException {
CountUniqueImages counter = new CountUniqueImages();
int count =
counter.countUniqueImages(new
File("src/test/resources/Catalogo_Egitto_2025.pdf"), 1);
assertEquals(122, count);
}
@Test
void testMultiThreaded() throws IOException, InterruptedException {
CountUniqueImages counter = new CountUniqueImages();
int count =
counter.countUniqueImages(new
File("src/test/resources/Catalogo_Egitto_2025.pdf"), 4);
assertEquals(122, count);
}
}
{code}
I am getting inconsistent results when using multithreading. The PDF file is
expected to contain 122 unique images. Out of 100 test runs, the multithreaded
test case fails 19 times. In those cases, the code does not correctly count the
number of unique images.
!image-2025-07-07-22-35-15-823.png!
I have read the
[FAQ|https://pdfbox.apache.org/3.0/faq.html#is-pdfbox-thread-safe%3F] and I
understand that PDFBox is not thread-safe. Therefore, this issue might be
related to or a duplicate of https://issues.apache.org/jira/browse/PDFBOX-5541
or https://issues.apache.org/jira/browse/PDFBOX-5542. However, I'm still
wondering if this might be a bug, since my code only performs read-only
operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]