[ 
https://issues.apache.org/jira/browse/PDFBOX-4975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073716#comment-18073716
 ] 

Stefan Ziegler commented on PDFBOX-4975:
----------------------------------------

{{PDPageContentStream.appendRawCommands()}} is {{@Deprecated}} but no viable 
replacement exists for raw byte-level text output (Type 3 fonts, raw-GID 
{{{}Tj{}}}, custom-encoded fonts)

We are building a PostScript-to-PDF converter on top of PDFBox 3.x. In three 
distinct situations we are forced to call 
{{contentStream.appendRawCommands(String)}} because no supported public API in 
{{PDAbstractContentStream}} / {{PDPageContentStream}} can do what is needed. 
The method is marked {{@Deprecated}} with the only Javadoc justification being 
_"Usage of this method is discouraged"_ — with no suggested alternative.
----
h3. Use Case 1 — Raw GID {{Tj}} for CIDFont / Type 0 fonts
{code:java}
contentStream.appendRawCommands(String.format("<%04X> Tj\n", gid));{{}}{code}
When a PostScript font is mapped to a {{PDType0Font}} backed by a CIDFont, the 
character selector in the PDF content stream must be a 2-byte hex string 
referring to the *glyph ID* directly (e.g. {{{}<0041> Tj{}}}). The 
{{showText(String)}} path in {{PDAbstractContentStream.showTextInternal()}} 
calls {{{}font.encode(text){}}}, which goes through Unicode-based encoding 
lookup. We already know the GID; there is *no public API to write a pre-encoded 
byte sequence as a {{Tj}} operand* without going through the Unicode→encoding 
round-trip.
----
h3. Use Case 2 — Type 3 font rendering
{code:java}
// renderType3String() and op_show() for Type3 fonts:
contentStream.appendRawCommands(hex.toString()); // e.g. "<41> Tj\n"{{}}{code}
{{PDType3Font.encode(int unicode)}} throws {{UnsupportedOperationException("Not 
implemented: Type3")}} by design. This means {{showText(String)}} *cannot be 
used at all* for Type 3 fonts — it will always throw. The only way to write the 
{{Tj}} operator with a raw charcode byte is via {{{}appendRawCommands{}}}. 
There is no alternative in the current API.
----
h3. Use Case 3 — Custom-encoded / raw-byte fonts ({{{}showTextBytes(){}}})
{code:java}
// showTextBytes(): writes (xx) Tj for Standard-14 / custom-encoded Type1 fonts
contentStream.appendRawCommands(sb.toString()); // e.g. "(Hello) Tj\n"{{}}{code}

PostScript uses raw charcode bytes for its text operators, not Unicode strings. 
The byte values in a PostScript string correspond directly to positions in the 
font's {{Encoding}} array. The PDFBox {{showText(String text)}} path calls 
{{{}font.encode(text){}}}, which performs a *Unicode → charcode* lookup. 
However, we have already resolved the correct charcode bytes from the 
PostScript execution context. Passing the bytes back through a Unicode→encode 
round-trip is lossy and incorrect for custom encodings where the mapping may 
not be bijective. We need to write the pre-resolved charcode bytes directly as 
a PDF string operand followed by {{{}Tj{}}}.
----
h3. Root cause

{{PDAbstractContentStream}} exposes:
 * {{protected void write(String)}} — package/protected, not accessible from 
outside
 * {{protected void writeBytes(byte[])}} — same
 * {{public void showText(String)}} — always calls {{{}font.encode(text){}}}, 
which throws for Type 3 and is incorrect for pre-encoded bytes

There is *no public method* to write a pre-encoded byte sequence as a PDF 
string operand (hex {{<...>}} or literal {{{}(...){}}}) followed by {{{}Tj{}}}, 
bypassing the Unicode encoding layer.
----
h3. Feature Request

Please either:
 # *Provide a public {{showEncodedText(byte[] encodedBytes)}} method* on 
{{PDAbstractContentStream}} that writes the bytes directly as a PDF string 
operand + {{{}Tj{}}}, bypassing {{font.encode()}} — similar to what 
{{appendRawCommands}} currently enables, but with proper operand formatting and 
without the deprecation stigma.
 # *Or, at minimum, make {{writeBytes(byte[])}} and {{write(String)}} public* 
(or {{package-protected}} with a published escape hatch) so that callers who 
understand the PDF content stream format can safely write pre-encoded text 
without relying on the deprecated method.

Without one of these, removing the {{appendRawCommands}} calls would require 
either reimplementing PDFBox's internal content stream writer or forking the 
library — neither of which is acceptable.

> Make PDPageContentStream non-final 
> -----------------------------------
>
>                 Key: PDFBOX-4975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4975
>             Project: PDFBox
>          Issue Type: Wish
>            Reporter: Richard
>            Priority: Major
>
> Currently {{PDPageContentStream}} is final.
> There are some situations where it would be useful to define our own 
> {{PDPageContentStream}}.
> For example, in my use-case I want to be able to try multiple fonts in case 
> the characters in the text are not all in one font. The library I'm using 
> accepts {{PDPageContentStream}} instances, so it would be much easier to pass 
> a subclass of {{PDPageContentStream}} with the desired behavior rather than 
> overhaul the library itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to