[ 
https://issues.apache.org/jira/browse/PDFBOX-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073711#comment-18073711
 ] 

Stefan Ziegler commented on PDFBOX-2838:
----------------------------------------

*Use Case 3 — Custom-encoded / raw-byte fonts*

PostScript uses raw charcode bytes for its text operators, not Unicode strings. 
The byte values in a PostScript string correspond directly to positions in the 
font's {{Encoding}} array. The PDFBox {{showText(String text)}} path calls 
{{{}font.encode(text){}}}, which performs a Unicode → charcode lookup. However, 
we have already resolved the correct charcode bytes from the PostScript 
execution context. Passing the bytes back through a Unicode→encode round-trip 
is lossy and incorrect for custom encodings where the mapping may not be 
bijective. We need to write the pre-resolved charcode bytes directly as a PDF 
string operand followed by {{{}Tj{}}}.


h3. Root cause

{{PDAbstractContentStream}} exposes:
 * {{protected void write(String)}} — package/protected, not accessible from 
outside
 * {{protected void writeBytes(byte[])}} — same
 * {{public void showText(String)}} — always calls {{{}font.encode(text){}}}, 
which throws for Type 3 and is incorrect for pre-encoded bytes

There is *no public method* to write a pre-encoded byte sequence as a PDF 
string operand (hex {{<...>}} or literal {{{}(...){}}}) followed by {{{}Tj{}}}, 
bypassing the Unicode encoding layer.


h3. Feature Request

Please either:
 # *Provide a public {{showEncodedText(byte[] encodedBytes)}} method* on 
{{PDAbstractContentStream}} that writes the bytes directly as a PDF string 
operand + {{{}Tj{}}}, bypassing {{font.encode()}} — similar to what 
{{appendRawCommands}} currently enables, but with proper operand formatting and 
without the deprecation stigma.
 # *Or, at minimum, make {{writeBytes(byte[])}} and {{write(String)}} public* 
(or {{package-protected}} with a published escape hatch) so that callers who 
understand the PDF content stream format can safely write pre-encoded text 
without relying on the deprecated method.

Without one of these, removing the {{appendRawCommands}} calls would require 
either reimplementing PDFBox's internal content stream writer or forking the 
library — neither of which is acceptable.

> Please make PDPageContentStream non-final
> -----------------------------------------
>
>                 Key: PDFBOX-2838
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2838
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>    Affects Versions: 2.0.0
>            Reporter: Philip Helger
>            Assignee: John Hewson
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: PDPageContentStreamWithCache.java
>
>
> Please make PDPageContentStream non-final as in certain cases it might be 
> helpful to cache the last set data on an PDPageContentStream (such as the 
> last used Font) to avoid bloating the created PDF. Therefore the methods must 
> be overridable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to