Following on from this I now have the character spacing and word spacing
being done in image writing and the output looks almost identical to the PDF
viewed in Adobe Reader (wrt to text rendering including layout). It was a
bit of a desperate approach but shows the results can be achieved. It
appears to be a similar fix to that suggested in Jira PDFBOX-508, but I only
needed to modify the PDFStreamEngine.java class. I changed the
processEncodedText method to simply process the text position of each
character found in the stream.

The only undesirable consequence would have to be performance as this will
trigger one call back to processTextPosition for each character rather than
a sequence, but given this would appear to be the only reliable way to
establish where each character should be placed I'm not sure what the
alternative would be.

Like I said I didnt modify anything else get this going, and text extraction
wasnt effected when sorting by position for horizontal text. For diagonal
text going up from bottom left to top right things changed, but the original
wasnt perfect and it came from text pieces in an embed image (EPS). What I
got out after the change was the text being read from bottom to top, going
left to right, so a vertical read and the characters came out in the right
order by position in that orientation, so that would be a differen problem
to solve.

On Mon, Sep 7, 2009 at 4:40 PM, Tony Scerri <[email protected]> wrote:

> Not sure if this is a possible cause for issues others have reported. I
> found that when creating images from PDFs I was getting a lot of jumbled
> text, bits overlapping others etc, and generaly it looked wrong. Turns out
> after much digging and tinkering that the FontManager was returning the
> wrong font even for standard fonts available in most environments.
>
> The fix I put in was inside the iterations of the available AWT fonts
> inside the loadFonts method of FontManager. The last line of the for loop I
> added:
>
>             envFonts.put(normalizeFontname(font.getPSName()),font);
>
> This puts in the post script name which is quite often used inside PDFs
> from what I have been seeing lately on my work. This now has a much better
> chance of looking up the correct font. I now dont have overlapped words etc
> because the font has a much better metric with what was expected.
>
> I think this problem may be more prevelant on PDFs where the text has been
> fully justified. I have run into a subsequent issues still plodding my way
> through. Which is that I'm now left with large gaps in lines in the middle
> of words because PDF box isnt rendering the word spacing correctly (might
> also be character spacing) which is all down to the use of AWT rendering of
> fonts which as far as I can tell wont allow for the kinds of control
> required when rendering a whole string, the alternative seems to be to have
> to render each character one by one with the appropriate displacement
> between each glyph.
>
> Tony
>
>
> On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA) <[email protected]
> > wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Andreas Lehmkühler resolved PDFBOX-302.
>> ---------------------------------------
>>
>>       Resolution: Fixed
>>    Fix Version/s: 0.8.0-incubator
>>
>> AFAIK there aren't any issues with this improvement, so that I'll set this
>> to resolved.
>>
>> For now there aren't any mappings mssing. If we find some later, it'll be
>> no problem to add them.
>>
>> > Improve font handling (was: layout print problem)
>> > -------------------------------------------------
>> >
>> >                 Key: PDFBOX-302
>> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
>> >             Project: PDFBox
>> >          Issue Type: Improvement
>> >          Components: PDFReader
>> >            Reporter: Jukka Zitting
>> >            Assignee: Andreas Lehmkühler
>> >            Priority: Minor
>> >             Fix For: 0.8.0-incubator
>> >
>> >
>> > [imported from SourceForge]
>> >
>> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
>> > Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
>> > When i print the attached file, some things are not printed well.
>> > - The gray box at the top
>> > - and the fonts are printed bold and thats not right.
>> > Is there any solution for now, or for later?
>> > When i open and print this file with adobe reader, everything is fine,
>> but with pdfbox i've got a layout problem.
>> > I used the newest pdfbox version (also tested the nightly build)
>> > [attachment on SourceForge]
>> >
>> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
>> > orarrp.pdf (application/pdf), 7871 bytes
>> > pdf with print problem
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>

Reply via email to