Hi Peter,
>When I use 1.7.0 NO text is written. Instead the characters are replaced by outline glyphs using <svg:path>. The visual layout is effectively the same as the input PDF >but there are no explicit characters. Wow! They managed to implement it like Adobe suggested! >I guess that in 1.7.0 NO characters are transmitted to drawString and that everything is drawShape(), with the precomputed glyphs. Yes, you have to trace from where g.draw(Shape) / g.fill(Shape) is coming. Not easy task however, since paths are also drawn with same methods. May be I'll download new version and look deeper. Andrey Von: [email protected] [mailto:[email protected]] Im Auftrag von Peter Murray-Rust Gesendet: Dienstag, 8. Mai 2012 00:59 An: Andrey Kuznetsov Cc: [email protected] Betreff: Re: Extracting vector graphics from PDF Thanks On Mon, May 7, 2012 at 11:35 PM, Andrey Kuznetsov <[email protected]> wrote: Peter, the parser is like a platypus - it doing not much - just parse CFF font and create some CFF-specific objects. As I already said, this is only half of work - I have to implement Type1 font writer to make that work. OK Regarding hack, I think that PdfBox already has it. You may get encoding and font metrics from it. I think that's probably true. What I really don't understand is - what is exactly does not working? The AWT commands are being captured by SVGGraphics2D (from Batik, but also pre-intercepted by a shell from me to capture any font stuff) I have now run this twice, once with pdfbox-1.6.0 (from maven) and once with pdfbox-1.7.0-SNAPSHOT. The 1.6.0 captures the characters (e.g. "T" "h" "e" and their coordinates as <svg:text>. When I display them the page is laid out exactly except for the font which is some default. When I use 1.7.0 NO text is written. Instead the characters are replaced by outline glyphs using <svg:path>. The visual layout is effectively the same as the input PDF but there are no explicit characters. If Font is working on "normal" Graphics it should also work on your "hacked" graphics. So what is your problem??? My guess is that in 1.6.0 the characters are transmitted to the g.drawString() command without the Font having been transmitted. That would result in readable text without the correct font. Ideally i need the font for the metrics. I guess that in 1.7.0 NO characters are transmitted to drawString and that everything is drawShape(), with the precomputed glyphs. However the system should know the characters at that stage as they are known to the1.6.0 system! If I know how to get them I could combine the information and that would be fine as I could then create the glyph table. I have tried another publisher - apart from the first 2 are these fonts any better? COSDictionary{(COSName{BaseFont}:COSName{Arial-BoldMT}) COSDictionary{(COSName{BaseFont}:COSName{ArialMT}) COSDictionary{(COSName{BaseFont}:COSName{FEDNDC+AdvOTbdfd27ae.B}) COSDictionary{(COSName{BaseFont}:COSName{FEDNED+AdvOTb65e897d.B}) COSDictionary{(COSName{BaseFont}:COSName{FEDNEE+AdvOT1ef757c0}) COSDictionary{(COSName{BaseFont}:COSName{FEDNFF+AdvP7DB7}) COSDictionary{(COSName{BaseFont}:COSName{FEDNGG+AdvOT7d6df7ab.I}) COSDictionary{(COSName{BaseFont}:COSName{FEDNHG+AdvP414BFB}) COSDictionary{(COSName{BaseFont}:COSName{FEDNII+AdvOTc8fb9ce9}) COSDictionary{(COSName{BaseFont}:COSName{FEDOMF+AdvP4DD222}) COSDictionary{(COSName{BaseFont}:COSName{FEDPLG+AdvOT6f8dc4dc.I}) COSDictionary{(COSName{BaseFont}:COSName{FEEAKB+AdvP3EAA99}) COSDictionary{(COSName{BaseFont}:COSName{FEEALC+AdvP44E6F4}) f I assume that the PDF has transmitted Andrey Von: [email protected] [mailto:[email protected]] Im Auftrag von Peter Murray-Rust Gesendet: Montag, 7. Mai 2012 15:24 An: Andrey Kuznetsov Cc: [email protected] Betreff: Re: Extracting vector graphics from PDF On Mon, May 7, 2012 at 1:31 PM, Andrey Kuznetsov <[email protected]> wrote: Peter, The COS output is horrible formatted, so I read only first line ;-) Sorry - that is what COSDictionary.toString() gave. It uses FontFile3 stream. FontFile3 stream contains font either in Compact Font Format ( CFF) or OpenType Format (OTF) which are not supported by java. The font name is "FKAJPF+AdvOT3b30f6db.B" which means that it is subset font of font named "AdvOT3b30f6db.B". I am ignorant about fonts so please correct any errors. I don't know exactly how PdfBox handles CFF/OpenType fonts, probably they just search for surrogate font (by name) or some kind default font (since I never saw such horrible font name in system fonts). I have no idea where the font came from. It's probably created by the publisher or bought from a supplier. I don't know if this is really useful for you. It's very useful! First it explains why I had problems and gives me confidence in the process. I also have no idea why font name/style are not set. It may be nevertheless valid font. BTW The only way to make java understand CFF/OTF fonts is to convert them to Type1 fonts. I doubt that there are any free java program which could do it. Thanks for the information. / (I managed to write parser for CFF fonts, but still have to dig into Type1 font format, however my to do list is really long and Type1 format in not on first place ;-)) What does the parser do? Best Regards I shall probably create a hack of some kind. I can find a san-serif and serif which are "fairly close" and substitute them. How would I get a system COSDictionary I could substitute? I am mainly interested in: * the identity of the characters * the font metrics of the characters. In this way I can guess the words and the spaces between them. Andrey -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069 <tel:%2B44-1223-763069> -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

