Re: [Podofo-users] Text extraction for TrueType fonts without encoding entry

Filip Djumic Wed, 31 Jul 2013 10:27:37 -0700

Thank you for your reply.

If I understood correctly, I need to use the currently unused FT_Library*
parameter of the PdfFontFactory::CreateFont function to access the FreeType
api for that font.
FreeType api should then provide me with all the data needed to encode and
extract the text in this font in this case.
Is this a correct outline of how it should be done?


F.


On Thu, Jul 18, 2013 at 3:08 AM, Leonard Rosenthol <lrose...@adobe.com>wrote:

> You need to dig into the font data/format itself.  Since you have access
> to FreeType, you should be able to use it's public APIs to get what you
> need.
>
> Leonard
>
> From: Filip Djumic <theprop...@gmail.com>
> Date: Wednesday, July 17, 2013 9:02 PM
> To: "podofo-users@lists.sourceforge.net" <
> podofo-users@lists.sourceforge.net>
> Subject: [Podofo-users] Text extraction for TrueType fonts without
> encoding entry
>
> I'm trying to extract the plain text with podofo from a pdf thats using a
> TrueType font, I attached a sample document. The font dictionary has no
> encoding entry, here is an excerpt from Adobe's PDF ISO document about this
> case:
> *
> A TrueType font program’s built-in encoding maps directly from character
> codes to glyph descriptions by means
> of an internal data structure called a “cmap” (not to be confused with the
> CMap described in 9.7.5, "CMaps").
> ...
> A “cmap” table may contain one or more subtables that represent multiple
> encodings intended for use on
> different platforms (such as Mac OS and Windows). Each subtable shall be
> identified by the two numbers, such
> as (3, 1), that represent a combination of a platform ID and a
> platform-specific encoding ID, respectively. *
> ...
> *When the font has no Encoding entry, or the font descriptor’s Symbolic
> flag is set (in which case the Encoding
> entry is ignored), this shall occur:
> • If the font contains a (3, 0) subtable, the range of character codes
> shall be one of these: 0x0000 - 0x00FF,
> 0xF000 - 0xF0FF, 0xF100 - 0xF1FF, or 0xF200 - 0xF2FF. Depending on the
> range of codes, each byte
> from the string shall be prepended with the high byte of the range, to
> form a two-byte character, which shall
> be used to select the associated glyph description from the subtable.
> • Otherwise, if the font contains a (1, 0) subtable, single bytes from the
> string shall be used to look up the
> associated glyph descriptions from the subtable.*
>
> In PdfFontFactory::CreateFont method this case is not handled, since both
> font descriptor and encoding are required to create a TrueType font. I
> would like to try doing this myself but I'm not sure where to start..
> Obviously I need to get to the cmap table somehow first, but I have no idea
> how. In the attached pdf, each text block's font dictionary has these
> entries:
>
> BaseFont=KAIXMV+Calibri-Bold
> FirstChar=33
> FontDescriptor dictionary
> LastChar=59
> Subtype=TrueType
> ToUnicode dictionary
> Type=Font
> Widths array
>
> ToUnicode dictionary has these entries:
> Filter=FlateDecode
> Length reference
>
> Cmap doesn't seem to be there and PDF ISO doc doesn't provide any useful
> details.. Does anyone have any hints on this?
>

------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Text extraction for TrueType fonts without encoding entry

Reply via email to