I'm trying to extract the plain text with podofo from a pdf thats using a
TrueType font, I attached a sample document. The font dictionary has no
encoding entry, here is an excerpt from Adobe's PDF ISO document about this
case:
*
A TrueType font program’s built-in encoding maps directly from character
codes to glyph descriptions by means
of an internal data structure called a “cmap” (not to be confused with the
CMap described in 9.7.5, "CMaps").
...
A “cmap” table may contain one or more subtables that represent multiple
encodings intended for use on
different platforms (such as Mac OS and Windows). Each subtable shall be
identified by the two numbers, such
as (3, 1), that represent a combination of a platform ID and a
platform-specific encoding ID, respectively. *
...
*When the font has no Encoding entry, or the font descriptor’s Symbolic
flag is set (in which case the Encoding
entry is ignored), this shall occur:
• If the font contains a (3, 0) subtable, the range of character codes
shall be one of these: 0x0000 - 0x00FF,
0xF000 - 0xF0FF, 0xF100 - 0xF1FF, or 0xF200 - 0xF2FF. Depending on the
range of codes, each byte
from the string shall be prepended with the high byte of the range, to form
a two-byte character, which shall
be used to select the associated glyph description from the subtable.
• Otherwise, if the font contains a (1, 0) subtable, single bytes from the
string shall be used to look up the
associated glyph descriptions from the subtable.*

In PdfFontFactory::CreateFont method this case is not handled, since both
font descriptor and encoding are required to create a TrueType font. I
would like to try doing this myself but I'm not sure where to start..
Obviously I need to get to the cmap table somehow first, but I have no idea
how. In the attached pdf, each text block's font dictionary has these
entries:

BaseFont=KAIXMV+Calibri-Bold
FirstChar=33
FontDescriptor dictionary
LastChar=59
Subtype=TrueType
ToUnicode dictionary
Type=Font
Widths array

ToUnicode dictionary has these entries:
Filter=FlateDecode
Length reference

Cmap doesn't seem to be there and PDF ISO doc doesn't provide any useful
details.. Does anyone have any hints on this?

Attachment: test page.pdf
Description: Adobe PDF document

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to