[
https://issues.apache.org/jira/browse/PDFBOX-4975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073716#comment-18073716
]
Stefan Ziegler commented on PDFBOX-4975:
----------------------------------------
{{PDPageContentStream.appendRawCommands()}} is {{@Deprecated}} but no viable
replacement exists for raw byte-level text output (Type 3 fonts, raw-GID
{{{}Tj{}}}, custom-encoded fonts)
We are building a PostScript-to-PDF converter on top of PDFBox 3.x. In three
distinct situations we are forced to call
{{contentStream.appendRawCommands(String)}} because no supported public API in
{{PDAbstractContentStream}} / {{PDPageContentStream}} can do what is needed.
The method is marked {{@Deprecated}} with the only Javadoc justification being
_"Usage of this method is discouraged"_ — with no suggested alternative.
----
h3. Use Case 1 — Raw GID {{Tj}} for CIDFont / Type 0 fonts
{code:java}
contentStream.appendRawCommands(String.format("<%04X> Tj\n", gid));{{}}{code}
When a PostScript font is mapped to a {{PDType0Font}} backed by a CIDFont, the
character selector in the PDF content stream must be a 2-byte hex string
referring to the *glyph ID* directly (e.g. {{{}<0041> Tj{}}}). The
{{showText(String)}} path in {{PDAbstractContentStream.showTextInternal()}}
calls {{{}font.encode(text){}}}, which goes through Unicode-based encoding
lookup. We already know the GID; there is *no public API to write a pre-encoded
byte sequence as a {{Tj}} operand* without going through the Unicode→encoding
round-trip.
----
h3. Use Case 2 — Type 3 font rendering
{code:java}
// renderType3String() and op_show() for Type3 fonts:
contentStream.appendRawCommands(hex.toString()); // e.g. "<41> Tj\n"{{}}{code}
{{PDType3Font.encode(int unicode)}} throws {{UnsupportedOperationException("Not
implemented: Type3")}} by design. This means {{showText(String)}} *cannot be
used at all* for Type 3 fonts — it will always throw. The only way to write the
{{Tj}} operator with a raw charcode byte is via {{{}appendRawCommands{}}}.
There is no alternative in the current API.
----
h3. Use Case 3 — Custom-encoded / raw-byte fonts ({{{}showTextBytes(){}}})
{code:java}
// showTextBytes(): writes (xx) Tj for Standard-14 / custom-encoded Type1 fonts
contentStream.appendRawCommands(sb.toString()); // e.g. "(Hello) Tj\n"{{}}{code}
PostScript uses raw charcode bytes for its text operators, not Unicode strings.
The byte values in a PostScript string correspond directly to positions in the
font's {{Encoding}} array. The PDFBox {{showText(String text)}} path calls
{{{}font.encode(text){}}}, which performs a *Unicode → charcode* lookup.
However, we have already resolved the correct charcode bytes from the
PostScript execution context. Passing the bytes back through a Unicode→encode
round-trip is lossy and incorrect for custom encodings where the mapping may
not be bijective. We need to write the pre-resolved charcode bytes directly as
a PDF string operand followed by {{{}Tj{}}}.
----
h3. Root cause
{{PDAbstractContentStream}} exposes:
* {{protected void write(String)}} — package/protected, not accessible from
outside
* {{protected void writeBytes(byte[])}} — same
* {{public void showText(String)}} — always calls {{{}font.encode(text){}}},
which throws for Type 3 and is incorrect for pre-encoded bytes
There is *no public method* to write a pre-encoded byte sequence as a PDF
string operand (hex {{<...>}} or literal {{{}(...){}}}) followed by {{{}Tj{}}},
bypassing the Unicode encoding layer.
----
h3. Feature Request
Please either:
# *Provide a public {{showEncodedText(byte[] encodedBytes)}} method* on
{{PDAbstractContentStream}} that writes the bytes directly as a PDF string
operand + {{{}Tj{}}}, bypassing {{font.encode()}} — similar to what
{{appendRawCommands}} currently enables, but with proper operand formatting and
without the deprecation stigma.
# *Or, at minimum, make {{writeBytes(byte[])}} and {{write(String)}} public*
(or {{package-protected}} with a published escape hatch) so that callers who
understand the PDF content stream format can safely write pre-encoded text
without relying on the deprecated method.
Without one of these, removing the {{appendRawCommands}} calls would require
either reimplementing PDFBox's internal content stream writer or forking the
library — neither of which is acceptable.
> Make PDPageContentStream non-final
> -----------------------------------
>
> Key: PDFBOX-4975
> URL: https://issues.apache.org/jira/browse/PDFBOX-4975
> Project: PDFBox
> Issue Type: Wish
> Reporter: Richard
> Priority: Major
>
> Currently {{PDPageContentStream}} is final.
> There are some situations where it would be useful to define our own
> {{PDPageContentStream}}.
> For example, in my use-case I want to be able to try multiple fonts in case
> the characters in the text are not all in one font. The library I'm using
> accepts {{PDPageContentStream}} instances, so it would be much easier to pass
> a subclass of {{PDPageContentStream}} with the desired behavior rather than
> overhaul the library itself.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]