Re: [Podofo-users] Text extraction for TrueType fonts without encoding entry

Dominik Seichter Wed, 04 Sep 2013 23:13:13 -0700

See PdfFont::GetObject ()-> GetStream ()-> GetData () and similar methods


Cheers
Am 05.09.2013 01:34 schrieb "Filip Djumic" <theprop...@gmail.com>:

> I'm now trying to use FreeType API to access the font's cmap table. To
> create a font face, I either need a font file filename or a buffer
> containing the font data. How to get a hold of one of these?
> If I understood correctly, font file is embed in the pdf, but how get to
> it?
> In podofo, only PdfFontMetricsFreetype seems to create new font faces, and
> it does so by using the font filename or font data buffer that are passed
> as constructor arguments. Since I don't have the filename, I guess that I
> need an in-memory buffer of the font data, but PdfFontMetricsFreetype gets
> that in GetWin32Font function which seems to deal with windows known fonts
> only. My font name is something like "TT1.1" and BaseFont value is
> "KAIXMV+Calibri-Bold", so GetWin32Font doesn't return anything...
> Can anyone help me out with this, I'm completely stuck. I just can't
> figure out how to create a FreeType font face by using data from the pdf
> and podofo..
>
> Filip
>
>
> On Wed, Jul 31, 2013 at 7:25 PM, Filip Djumic <theprop...@gmail.com>wrote:
>
>> Thank you for your reply.
>>
>> If I understood correctly, I need to use the currently unused FT_Library*
>> parameter of the PdfFontFactory::CreateFont function to access the FreeType
>> api for that font.
>> FreeType api should then provide me with all the data needed to encode
>> and extract the text in this font in this case.
>> Is this a correct outline of how it should be done?
>>
>> F.
>>
>>
>> On Thu, Jul 18, 2013 at 3:08 AM, Leonard Rosenthol <lrose...@adobe.com>wrote:
>>
>>> You need to dig into the font data/format itself.  Since you have access
>>> to FreeType, you should be able to use it's public APIs to get what you
>>> need.
>>>
>>> Leonard
>>>
>>> From: Filip Djumic <theprop...@gmail.com>
>>> Date: Wednesday, July 17, 2013 9:02 PM
>>> To: "podofo-users@lists.sourceforge.net" <
>>> podofo-users@lists.sourceforge.net>
>>> Subject: [Podofo-users] Text extraction for TrueType fonts without
>>> encoding entry
>>>
>>> I'm trying to extract the plain text with podofo from a pdf thats using
>>> a TrueType font, I attached a sample document. The font dictionary has no
>>> encoding entry, here is an excerpt from Adobe's PDF ISO document about this
>>> case:
>>> *
>>> A TrueType font program’s built-in encoding maps directly from character
>>> codes to glyph descriptions by means
>>> of an internal data structure called a “cmap” (not to be confused with
>>> the CMap described in 9.7.5, "CMaps").
>>> ...
>>> A “cmap” table may contain one or more subtables that represent multiple
>>> encodings intended for use on
>>> different platforms (such as Mac OS and Windows). Each subtable shall be
>>> identified by the two numbers, such
>>> as (3, 1), that represent a combination of a platform ID and a
>>> platform-specific encoding ID, respectively. *
>>> ...
>>> *When the font has no Encoding entry, or the font descriptor’s Symbolic
>>> flag is set (in which case the Encoding
>>> entry is ignored), this shall occur:
>>> • If the font contains a (3, 0) subtable, the range of character codes
>>> shall be one of these: 0x0000 - 0x00FF,
>>> 0xF000 - 0xF0FF, 0xF100 - 0xF1FF, or 0xF200 - 0xF2FF. Depending on the
>>> range of codes, each byte
>>> from the string shall be prepended with the high byte of the range, to
>>> form a two-byte character, which shall
>>> be used to select the associated glyph description from the subtable.
>>> • Otherwise, if the font contains a (1, 0) subtable, single bytes from
>>> the string shall be used to look up the
>>> associated glyph descriptions from the subtable.*
>>>
>>> In PdfFontFactory::CreateFont method this case is not handled, since
>>> both font descriptor and encoding are required to create a TrueType font. I
>>> would like to try doing this myself but I'm not sure where to start..
>>> Obviously I need to get to the cmap table somehow first, but I have no idea
>>> how. In the attached pdf, each text block's font dictionary has these
>>> entries:
>>>
>>> BaseFont=KAIXMV+Calibri-Bold
>>> FirstChar=33
>>> FontDescriptor dictionary
>>> LastChar=59
>>> Subtype=TrueType
>>> ToUnicode dictionary
>>> Type=Font
>>> Widths array
>>>
>>> ToUnicode dictionary has these entries:
>>> Filter=FlateDecode
>>> Length reference
>>>
>>> Cmap doesn't seem to be there and PDF ISO doc doesn't provide any useful
>>> details.. Does anyone have any hints on this?
>>>
>>
>>
>
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
> _______________________________________________
> Podofo-users mailing list
> Podofo-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/podofo-users
>
>

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Text extraction for TrueType fonts without encoding entry

Reply via email to