You need to dig into the font data/format itself.  Since you have access to 
FreeType, you should be able to use it's public APIs to get what you need.

Leonard

From: Filip Djumic <theprop...@gmail.com<mailto:theprop...@gmail.com>>
Date: Wednesday, July 17, 2013 9:02 PM
To: 
"podofo-users@lists.sourceforge.net<mailto:podofo-users@lists.sourceforge.net>" 
<podofo-users@lists.sourceforge.net<mailto:podofo-users@lists.sourceforge.net>>
Subject: [Podofo-users] Text extraction for TrueType fonts without encoding 
entry

I'm trying to extract the plain text with podofo from a pdf thats using a 
TrueType font, I attached a sample document. The font dictionary has no 
encoding entry, here is an excerpt from Adobe's PDF ISO document about this 
case:

A TrueType font program’s built-in encoding maps directly from character codes 
to glyph descriptions by means
of an internal data structure called a “cmap” (not to be confused with the CMap 
described in 9.7.5, "CMaps").
...
A “cmap” table may contain one or more subtables that represent multiple 
encodings intended for use on
different platforms (such as Mac OS and Windows). Each subtable shall be 
identified by the two numbers, such
as (3, 1), that represent a combination of a platform ID and a 
platform-specific encoding ID, respectively.
...
When the font has no Encoding entry, or the font descriptor’s Symbolic flag is 
set (in which case the Encoding
entry is ignored), this shall occur:
• If the font contains a (3, 0) subtable, the range of character codes shall be 
one of these: 0x0000 - 0x00FF,
0xF000 - 0xF0FF, 0xF100 - 0xF1FF, or 0xF200 - 0xF2FF. Depending on the range of 
codes, each byte
from the string shall be prepended with the high byte of the range, to form a 
two-byte character, which shall
be used to select the associated glyph description from the subtable.
• Otherwise, if the font contains a (1, 0) subtable, single bytes from the 
string shall be used to look up the
associated glyph descriptions from the subtable.

In PdfFontFactory::CreateFont method this case is not handled, since both font 
descriptor and encoding are required to create a TrueType font. I would like to 
try doing this myself but I'm not sure where to start.. Obviously I need to get 
to the cmap table somehow first, but I have no idea how. In the attached pdf, 
each text block's font dictionary has these entries:

BaseFont=KAIXMV+Calibri-Bold
FirstChar=33
FontDescriptor dictionary
LastChar=59
Subtype=TrueType
ToUnicode dictionary
Type=Font
Widths array

ToUnicode dictionary has these entries:
Filter=FlateDecode
Length reference

Cmap doesn't seem to be there and PDF ISO doc doesn't provide any useful 
details.. Does anyone have any hints on this?
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to