I will look at generating appropriate patches for the two separate changes I mentioned today. I have noticed one minor issue with text extraction after the char and word spacing fix which results in an extra space being added in one word in one of three pdfs i have been working with.
I have also made a third change relating to identification of fonts embeded in a PDF after it was unable to extract the contained TTF as it failed to load properly when AWT was called, the "NAME" table was missing which i assume indicated a invalid/corrupt PDF (but not knowing too much about TTF etc I'm not 100% sure). I may need to build specific sample PDF to submit as the contents of the PDFs i'm working can not be circulated. Tony. 2009/9/7 Andreas Lehmkühler <[email protected]> > Hi Tony, > > is it possible to provide us with a sample document to test your patch? > As attachements aren't allowed on the list, you have to create a new > issue on JIRA and attach your sample. > > Thnkas in advance, > Andreas Lehmkühler > > Tony Scerri schrieb: > > Not sure if this is a possible cause for issues others have reported. I > > found that when creating images from PDFs I was getting a lot of jumbled > > text, bits overlapping others etc, and generaly it looked wrong. Turns > out > > after much digging and tinkering that the FontManager was returning the > > wrong font even for standard fonts available in most environments. > > > > The fix I put in was inside the iterations of the available AWT fonts > inside > > the loadFonts method of FontManager. The last line of the for loop I > added: > > > > envFonts.put(normalizeFontname(font.getPSName()),font); > > > > This puts in the post script name which is quite often used inside PDFs > from > > what I have been seeing lately on my work. This now has a much better > chance > > of looking up the correct font. I now dont have overlapped words etc > because > > the font has a much better metric with what was expected. > > > > I think this problem may be more prevelant on PDFs where the text has > been > > fully justified. I have run into a subsequent issues still plodding my > way > > through. Which is that I'm now left with large gaps in lines in the > middle > > of words because PDF box isnt rendering the word spacing correctly (might > > also be character spacing) which is all down to the use of AWT rendering > of > > fonts which as far as I can tell wont allow for the kinds of control > > required when rendering a whole string, the alternative seems to be to > have > > to render each character one by one with the appropriate displacement > > between each glyph. > > > > Tony > > > > On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA) > > <[email protected]>wrote: > > > >> [ > >> > https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > >> > >> Andreas Lehmkühler resolved PDFBOX-302. > >> --------------------------------------- > >> > >> Resolution: Fixed > >> Fix Version/s: 0.8.0-incubator > >> > >> AFAIK there aren't any issues with this improvement, so that I'll set > this > >> to resolved. > >> > >> For now there aren't any mappings mssing. If we find some later, it'll > be > >> no problem to add them. > >> > >>> Improve font handling (was: layout print problem) > >>> ------------------------------------------------- > >>> > >>> Key: PDFBOX-302 > >>> URL: https://issues.apache.org/jira/browse/PDFBOX-302 > >>> Project: PDFBox > >>> Issue Type: Improvement > >>> Components: PDFReader > >>> Reporter: Jukka Zitting > >>> Assignee: Andreas Lehmkühler > >>> Priority: Minor > >>> Fix For: 0.8.0-incubator > >>> > >>> > >>> [imported from SourceForge] > >>> > >> > http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501 > >>> Originally submitted by gjniewenhuijse on 2007-09-04 00:24. > >>> When i print the attached file, some things are not printed well. > >>> - The gray box at the top > >>> - and the fonts are printed bold and thats not right. > >>> Is there any solution for now, or for later? > >>> When i open and print this file with adobe reader, everything is fine, > >> but with pdfbox i've got a layout problem. > >>> I used the newest pdfbox version (also tested the nightly build) > >>> [attachment on SourceForge] > >>> > >> > http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104 > >>> orarrp.pdf (application/pdf), 7871 bytes > >>> pdf with print problem > >> -- > >> This message is automatically generated by JIRA. > >> - > >> You can reply to this email to add a comment to the issue online. > >> > >> > > >
