[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

John Hewson (JIRA) Sat, 14 Jun 2014 14:51:23 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031714#comment-14031714
 ]


John Hewson commented on PDFBOX-922:
------------------------------------

{quote}
I meant the cmap table in TTF actually. They do have cmaps which map from some 
specific encoding's values to glyph indexes. I can understand that my phrasing 
was confusing.
{quote}

Ok, that makes more sense! When the font is subset the cmap table will get 
rewritten, but that's not going to be a problem. It's basically internal to the 
font.

{quote}
There must be some confusion about the 0x10000 CID limit. I simply meant that 
assuming a font contains a glyph which has unicode codepoint above 0x10000, it 
follows that rendering that glyph requires the CIDs to not be treated as UCS-2 
values, because there is no way to represent that codepoint in UCS-2. I was 
mostly trying to weigh between different alternatives. I still like identity 
mappings because that means that conversion from unicode to appropriate GID is 
the simplest possible, at least for TTF fonts with Windows Unicode cmap table.
{quote}

Perhaps we're making the same observation: that CIDs can't be used to represent 
all Unicode points, so identity mapping breaks at some point. The reason you 
can't really do an identity mapping to GID is that GID is the index of the 
glyph in the font, so if you had a font with a single Unicode character, say 
U+2265, you'd need 8,804 empty glyphs in the font prior to it. You can however 
do an identity mapping if you are willing to use GIDs in your strings _but_ 
you'd need to re-encode your strings after subsetting the font in order to do 
this, which is a major hassle.

{quote}
I know the standard says that PDF String encoding is controlled by a BOM 
appearing at the beginning, but this probably refers to other kinds of text, 
not the kind of text you print on a page! For instance, according to my 
testing, if you actually write text in CID keyed font, your BOM will be treated 
as CID and mapped to a character – or if you try to write with a font that is 
defined to have 8-bit characters, prepending it with a BOM just generates the 
BOM's characters in the text. It was this latter behavior that I spotted 
originally – I tried to generate the three dots ("…") character with 
PDFont.HELVETICA, and saw the BOM characters appear in the text string, along 
with extra spaces between glyphs that were the null bytes in UTF-16 encoding.
{quote}

Yeah, looking at the spec you're right that the BOM doesn't apply to content 
stream text - I hadn't realised that. However, it seems that composite fonts 
can use encodings are not fixed to 16-bit:

{quote}
"When the current font is composite, the text-showing operators shall behave 
differently than with simple fonts. For simple fonts, each byte of a string to 
be shown selects one glyph, whereas for composite fonts, a sequence  of one or 
more bytes are decoded to select a glyph from the descendant CIDFont."
{quote}

It looks like the (PDF) CMap controls the code length:

{quote}
"The codespace ranges in the CMap (delimited by begincodespacerange and 
endcodespacerange) specify 
how many bytes are extracted from the string for each successive character 
code. A codespace range shall be  specified by a pair of codes of some 
particular length giving the lower and upper bounds of that range. A code  
shall be considered to match the range if it is the same length as the bounding 
codes and the value of each of  its bytes lies between the corresponding bytes 
of the lower and upper bounds. The code length shall not be  greater than 4."
{quote}

I guess we just always generate 16-bit CMaps for composite fonts and be done 
with it.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
>               try {
>                       doc = new PDDocument();
>                       PDPage page = new PDPage();
>                       doc.addPage(page);
>                       // extract fonts for fields
>                       byte[] arialNorm = extractFont("arial.ttf");
>                       //byte[] arialBold = extractFont("arialbd.ttf"); 
>                       //PDFont font = PDType1Font.HELVETICA;
>                       PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>                       
>                       PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>                       contentStream.beginText();
>                       contentStream.setFont(font, 12);
>                       contentStream.moveTextPositionByAmount(100, 700);
>                       contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>                       contentStream.endText();
>                       contentStream.close();
>                       doc.save("pdfbox.pdf");
>                       System.out.println(" created!");
>               } catch (Exception ioe) {
>                       ioe.printStackTrace();
>               } finally {
>                       if (doc != null) {
>                               try { doc.close(); } catch (Exception e) {}
>                       }
>               }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Reply via email to