[jira] [Commented] (PDFBOX-4951) Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly

Tilman Hausherr (Jira) Tue, 02 Jun 2026 07:41:08 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085520#comment-18085520
 ]


Tilman Hausherr commented on PDFBOX-4951:
-----------------------------------------

Here's copilot:

====

My review: {*}promising feature work, but not ready to merge yet{*}.
 *  
[GlyphLayoutProcessor for correct glyph layout and support of DIN 91379 (Issue 
PDFBOX-4951)#436|https://github.com/apache/pdfbox/pull/436]
1apache/pdfbox · vk-github18 opened on Apr 5
 
h2. PR summary

 * *Intent:* add a new {{GlyphLayoutProcessor}} path for correct glyph 
shaping/layout, bidi handling, kerning/ligatures, DIN 91379 examples, and 
AcroForm integration.
 * *Scope:* fairly large. It adds new core classes in {{{}pdfbox{}}}, modifies 
{{PDAbstractContentStream}} and {{{}PDAcroForm{}}}/appearance generation, and 
adds many example/resource files.
 * *Discussion activity:* I found one PR comment from the author on *May 31, 
2026* listing follow-up fixes/refactors.
 * *Reviews / inline review comments:* none returned.
 * *CI / checks:* I couldn’t retrieve check-run results from the available 
calls, so merge readiness is assessed from code only.
 * *Risk:* {*}medium-high{*}, because it changes central text rendering 
behavior in {{{}PDAbstractContentStream.showText(){}}}.

h2. Core changes

This PR introduces a second rendering path for text:
 * If a {{GlyphLayoutProcessor}} is attached and the current font is a 
supported {{{}PDType0Font{}}}, {{showText()}} routes through glyph-layout logic.
 * That logic uses Java AWT font layout ({{{}layoutGlyphVector{}}}) to:
 ** shape glyphs,
 ** handle bidi runs,
 ** emit positioned glyph output using {{{}TJ{}}},
 ** and support kerning / ligature options through {{{}FontOptions{}}}.

It also wires the same processor into {*}AcroForm appearance generation{*}, 
which is a meaningful extension beyond normal content streams.

The examples/resources demonstrate:
 * Bengali shaping,
 * Arabic/bidi text,
 * DIN 91379 coverage,
 * ligatures/kerning,
 * SMP characters,
 * missing glyph behavior.

h2. Review findings
h3. 1. *Potential correctness bug in {{showTextPDType0Font}} range handling*

In {{{}GlyphLayoutProcessor{}}}, {{showTextPDType0Font}} calls:
apache / pdfbox / filesv1
 
int[] glyphCodes = glyphVector.getGlyphCodes(start, end - start, new int[end - 
start]);
contentStream.writeTextPDType0Font(glyphCodes);


The method contract is documented as {{start}} / {{{}end{}}}, but the 
implementation treats {{end}} as exclusive and computes {{{}end - start{}}}. 
That’s internally consistent for the current call site, but it is easy to 
misuse and the naming is misleading. Since this is a new helper in core 
rendering code, I’d flag this as a {*}maintainability/correctness risk{*}.

*Why it matters:* small API ambiguities in low-level text output code tend to 
become future bugs quickly.

*Suggested fix:* rename parameters to {{startInclusive}} / 
{{{}endExclusive{}}}, or change the helper to accept {{start}} + {{{}length{}}}.
----
h3. 2. *Likely bidi ordering bug for mixed-direction text*

The mixed bidi path in {{GlyphLayoutProcessor.showText()}} splits runs and 
emits each substring separately:
apache / pdfbox / filesv2
 
for (int i = 0; i < bidi.getRunCount(); i++) {
    int run = bidi.baseIsLeftToRight() ? i : bidi.getRunCount() - 1 - i;
    int start = bidi.getRunStart(run);
    int limit = bidi.getRunLimit(run);
    String part = text.substring(start, limit);
    showTextUni(part);
}


This is a red flag. Unicode bidi reordering is subtler than “reverse runs when 
base direction is RTL.” Mixed neutral characters, numbers, and embedding levels 
can make run concatenation incorrect when you simply substring and render 
sequentially.

*Why it matters:* this PR explicitly claims correct glyph layout and bidi 
support; mixed bidi is one of the hardest cases, and this implementation looks 
simplified.

*Recommendation:* ask for tests covering mixed Arabic/Latin/numbers/punctuation 
cases before merge.
----
h3. 3. *Possible resource-leak / null-handling issues in font loading helpers*

Multiple example/helper methods do:
Java
 
InputStream fontStream = this.getClass().getResourceAsStream(fontPath);
PDType0Font font = glyphLayoutProcessor.loadFont(pdDocument, fontStream);


There’s no null-check on {{{}getResourceAsStream(){}}}. In core loader code, 
{{GlyphLayoutFontLoader.loadFont(...)}} also assumes a non-null 
{{{}inputStream{}}}.

*Why it matters:* for examples this is less severe, but in the new reusable API 
surface a clearer failure message would help. Right now callers may get a 
generic exception path instead of an immediate, descriptive error.

*Suggested improvement:* use {{Objects.requireNonNull(inputStream, "...")}} 
early in loader methods.
h2. Important design concerns
h3. 4. {{GlyphLayoutProcessor}} is stateful and thread-hostile

The class stores mutable state:
 * current {{font}}
 * {{fontSize}}
 * {{contentStream}}

and comments say “Use an object of this class only in one thread.”

That’s acceptable for an internal helper, but the PR exposes it as a public API 
integrated into content streams and forms. This makes misuse easy, especially 
if a caller reuses one processor across streams/documents.

*Concern:* API ergonomics are fragile. A stateless or more tightly scoped 
design would be safer.
----
h3. 5. Behavioral change at a very central integration point

{{PDAbstractContentStream.showText()}} now conditionally diverts from the 
traditional path:
apache / pdfbox / files
 
if(glyphLayoutProcessor != null && glyphLayoutProcessor.supportsFont(font))
{
    glyphLayoutProcessor.showText(text);
}
else
{
    showTextInternal(text);            
    write(" ");
    writeOperator(OperatorName.SHOW_TEXT);
}


This is powerful, but it also means:
 * behavior depends on hidden stream state,
 * text output semantics can change substantially,
 * and debugging output differences may be harder for users.

I wouldn’t block solely on that, but it raises the bar for tests and docs.
h2. What looks good
 * Clean separation between font loading ({{{}GlyphLayoutFontLoader{}}}) and 
shaping/output ({{{}GlyphLayoutProcessor{}}}).
 * Good attempt to preserve fallback behavior when no processor is set.
 * AcroForm integration is thoughtful and likely valuable.
 * {{FontOptions}} for ligatures and kerning is a nice extensibility hook.
 * Examples are extensive and should help users validate behavior visually.

h2. Merge readiness

*Assessment: not ready yet.*
h3. Blockers I’d want addressed
 # *Add targeted tests* for:
 ** mixed bidi strings,
 ** glyph positioning adjustments,
 ** forms with {{{}embedSubset=false{}}},
 ** missing glyph error behavior.
 # *Clarify / harden low-level helper APIs*
 ** especially {{showTextPDType0Font(...)}} semantics.
 # *Add stronger null/error handling*
 ** for font/resource loading.
 # *Document rendering guarantees and limitations*
 ** especially what “correct glyph layout” covers.

h2. Possible improvements
 * Add unit tests for {{hasAdjustments()}} and {{{}GlyphsAndPositions{}}}.
 * Consider making {{GlyphsAndPositions}} package-private if it’s not meant as 
public API.
 * Replace {{replaceAll("\t", " ")}} with {{replace("\t", " ")}} in examples.
 * Close document/content stream resources with try-with-resources in more 
examples.

 

> Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-4951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.21
>            Reporter: Volker Kunert
>            Priority: Major
>         Attachments: DIN_SPEC_91379_Sequences-aa.pdf, 
> DIN_SPEC_91379_Sequences-ab.pdf, DIN_SPEC_91379_Sequences-ac.pdf, 
> DIN_SPEC_91379_Sequences.txt, DefaultScriptProcessor.java, DejaVuSans.ttf, 
> DoGlyphLayoutBidi.pdf, DoGlyphLayoutDinSpec91379.pdf, 
> DoGlyphLayoutDinSpec91379Form.pdf, DoGlyphPositionBengali.pdf, 
> ExamplePdfboxFopPos-By-Tilman.pdf, ExamplePdfboxFopPos.java, 
> ExamplePdfboxFopPos.pdf, ExamplePdfboxFopPosForm.java, 
> ExamplePdfboxFopPosForm.pdf, FiraCode-Regular.ttf, 
> FontForge-Lohit-Bengali.png, TestPdfbox.java, TestPdfboxFop2.java, 
> TestPdfboxFop2.pdf, TestPdfboxJava2D.java, TestPdfboxJava2D.pdf, bidi-1.png, 
> bidi-2.png, bidi.png, example-PDFBOX-3147-NotoSansThaiLooped-Regular.png, 
> image-2026-05-23-16-16-53-442.png, image-2026-05-23-16-17-28-172.png, 
> image-2026-05-26-16-49-45-529.png, ligatures-kerning.png, 
> patch-2020-10-02.txt, pdfbox.patch, pdfbox.pdf, screenshot-1.png
>
>
> Accented Letters composed of Unicode base letter and combining accent are 
> rendered wrong. E.g. with 0041 030B LATIN CAPITAL LETTER A WITH COMBINING 
> DOUBLE ACUTE ACCENT the accent appears at the right hand side of the letter 
> A, not above the letter A.
> The position is wrong for most of the sequences defined in the following spec:
> DIN SPEC 91379: Characters in Unicode for the electronic processing of names 
> and data 
>  exchange in Europe; with digital attachment
>  [https://www.xoev.de/downloads-2316#StringLatin]
>  [https://www.din.de/de/wdc-beuth:din21:301228458]
>  
> The correct rendering should look like the output of hb-view 2.6.8, see files 
> DIN_SPEC_91379_Sequences*.pdf.
> The output of PDFBox is appended in pdfbox.pdf, which is created by running 
> TestPdfbox.java. The sequences are read from file 
> DIN_SPEC_91379_Sequences.txt.
>  
> Font used for testing: NotoSansMono-Regular.ttf, see 
> [https://www.google.com/get/noto/] 
> download: 
> [https://noto-website-2.storage.googleapis.com/pkgs/NotoSansMono-hinted.zip]
>  See also FOP-2969
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4951) Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly

Reply via email to