Thanks Peter! Now I get it.

Tilman

Am 14.09.2025 um 15:30 schrieb [email protected]:

> I don't understand the end of the sentence ("not by..."), but I'd say that the word "entirely" means your elements are missing.

The "… /not by nesting of the associated content items/" phrasing is referring to the way that marked content sequences (i.e., *BDC*/*BMC* and *EMC*) in content streams encapsulate graphical content and may be nested. So content stream ordering and nesting is NOT what defines the hierarchical relationship among structure elements.

Hope that is clearer and helps.

*From:*Tilman Hausherr <[email protected]>
*Sent:* Friday, 12 September 2025 7:18 PM
*To:* [email protected]
*Subject:* Re: Issue with Tagged PDF with Artifact elements in the Structure Tree - artifact not in its parent's list of children

Yeah it's weird, the effect in image2.pdf is with 17, 19 and 21. These elements are missing in the /K hierarchy. I don't know enough to decide whether it is a bug or not. I do have some understanding of the structure tree stuff, but it's not perfect.

The specification has this: "The hierarchical relationship among structure elements shall be represented entirely by the K entries of the structure element dictionaries, not by nesting of the associated content items."

I don't understand the end of the sentence ("not by..."), but I'd say that the word "entirely" means your elements are missing.

What I don't understand is why PAC doesn't complain.

It's definitively not a PDFBox bug. PDFBox just shows what is. If you suspect that the parser is broken, open the file with a different tool, e.g. RUPS.

I'm have written a tool to detect this problem, I wonder if it occurs with our test set.

Tilman

Am 11.09.2025 um 18:40 schrieb Mark Gibson:

    Hi

    We have some PDFs that are directly exported from Excel.  They
    export as accessible – tagged pdfs with structure tree.

    Within the structure tree are elements of type “Artifact”, often
    used for non-content aspects like background colors, etc.

    When PDFBox (both v2 and v3) reads these PDFs (visible using the
    PDFBox Debugger as seen in attached png, as well as just straight
    up in code), there seems to be a structure tree discrepancy with
    some parent-child relationships.  The Artifact element (found in
    the structure tree) has a pointer back to its parent.  That parent
    has a list of children.  I’d expect that list of children to
    include the Artifact.  However, artifacts are never in their
    parent’s list of children.

    I’m trying to find out if this is expected and part of the PDF
    spec, or a bug in PDFBox.  This is currently causing issues for us
    in FOP when we’re rendering accessible PDF outputs – when
    importing these PDF image files, the import fails and never show
    up in final PDF output.  Ultimately, I’m trying to understand if
    the fix should be in PDFBox or FOP.

    I’ve attached two example PDFs, along with an image of the
    structure tree of one of them highlighting the issue.

    Many thanks

    Mark



    ---------------------------------------------------------------------

    To unsubscribe, e-mail:[email protected]

    For additional commands, e-mail:[email protected]

Reply via email to