I'm trying to extract the plain text with podofo from a pdf thats using a TrueType font, I attached a sample document. The font dictionary has no encoding entry, here is an excerpt from Adobe's PDF ISO document about this case: * A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in 9.7.5, "CMaps"). ... A “cmap” table may contain one or more subtables that represent multiple encodings intended for use on different platforms (such as Mac OS and Windows). Each subtable shall be identified by the two numbers, such as (3, 1), that represent a combination of a platform ID and a platform-specific encoding ID, respectively. * ... *When the font has no Encoding entry, or the font descriptor’s Symbolic flag is set (in which case the Encoding entry is ignored), this shall occur: • If the font contains a (3, 0) subtable, the range of character codes shall be one of these: 0x0000 - 0x00FF, 0xF000 - 0xF0FF, 0xF100 - 0xF1FF, or 0xF200 - 0xF2FF. Depending on the range of codes, each byte from the string shall be prepended with the high byte of the range, to form a two-byte character, which shall be used to select the associated glyph description from the subtable. • Otherwise, if the font contains a (1, 0) subtable, single bytes from the string shall be used to look up the associated glyph descriptions from the subtable.*
In PdfFontFactory::CreateFont method this case is not handled, since both font descriptor and encoding are required to create a TrueType font. I would like to try doing this myself but I'm not sure where to start.. Obviously I need to get to the cmap table somehow first, but I have no idea how. In the attached pdf, each text block's font dictionary has these entries: BaseFont=KAIXMV+Calibri-Bold FirstChar=33 FontDescriptor dictionary LastChar=59 Subtype=TrueType ToUnicode dictionary Type=Font Widths array ToUnicode dictionary has these entries: Filter=FlateDecode Length reference Cmap doesn't seem to be there and PDF ISO doc doesn't provide any useful details.. Does anyone have any hints on this?
test page.pdf
Description: Adobe PDF document
------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users