[
https://issues.apache.org/jira/browse/PDFBOX-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952029#comment-17952029
]
Tilman Hausherr edited comment on PDFBOX-6010 at 5/16/25 7:22 AM:
------------------------------------------------------------------
how-to questions should be asked on the users mailing list or on stackoverflow.
But I'll still answer that one; the best is to look at the source code of
{{ExtractImages.java}} and adjust it for your needs. This one passes through
the page content stream (and other streams) like a renderer would do. You
extend the {{PDFGraphicsStreamEngine}} class and implement the {{drawImage()}}.
The downside is that you may get some images many times (you need a {{Set}} to
avoid these), and you will miss orphan images.
was (Author: tilman):
how-to questions should be asked on the users mailing list or on stackoverflow.
But I'll still answer that one; the best is to look at the source code of
ExtractImages and adjust it for your needs. This one passes through the page
content stream (and other streams) like a renderer would do. You extend the
PDFGraphicsStreamEngine class and implement drawImage(). The downside is that
you may get some images many times, and you will miss orphan images.
> PDF Image Extraction resulting in an infinite recursion
> -------------------------------------------------------
>
> Key: PDFBOX-6010
> URL: https://issues.apache.org/jira/browse/PDFBOX-6010
> Project: PDFBox
> Issue Type: Bug
> Reporter: Kabir Soneja
> Priority: Major
> Labels: how-to
>
> Hi,
> I am working on extracting images from a PDF using pdfbox version 2.0.34.
> While doing so we have our own recursive logic to recurse through all
> PDResources for each page and within each page we check for all the objects
> to filter out images. This recursive logic has a max depth of 25 to avoid
> infinite recursion.
> When trying out the image extraction for the same PDF using the CLI, the
> image is extracted within a second indicating that the image extraction logic
> within the pdfbox source code is handling image extraction using an
> ImageGraphicsEngine defined within the source code.
> Can you help me understand:
> * To handle image extraction, are there are any API directly provided by
> PDFBox?
> * Is there any way to reuse the image extraction logic within the source
> code i.e is it exposed as a public API?
> * Any other suggestions to handle image extraction gracefully with/without
> recursion?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]