Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Dominika Tkaczyk Thu, 20 Sep 2012 04:25:50 -0700

> ok - I just committed code with a new getCharacterRenderInfos() method.  I
> have one test case as a sanity check, but definitely let me know if you
> have
> additional use cases that are failing, etc... (providing a test case helps
> immensely if that does happen).
>
> I have removed the character and word spacing getters for the time being
> (they really shouldn't be necessary now that we have per-glyph metrics).
>
> Let me know what you guys think,


Hello,

I'd like to thank you all for doing such a great job with iText and in
particular for adding getCharacterRenderInfos method. It will be very
helpful in my project, which is a system for extracting metadata and
content from scientific articles in PDF format. This is done by a sequence
of steps, the first of which is extracting individual characters along
with their positions on the page and dimensions. Futher steps include
among others grouping individual characters into larger objects like
words, lines and zones and also determining the full reading order. The
implementation of the first step is based on iText and until now I have
been using my own methods added to TextRenderInfo to obtain neccessary
information about the individual characters.

I have tested getCharacterRenderInfos method on some example PDFs and I
have noticed two potential problems.

The first one is about the TextRenderInfo's constructor, in which
textMatrix passed as an argument is multiplied by gs.ctm. As a result,
subTextMatrix is multiplied by gs.ctm again for every new TextRenderInfo
object created for an individual character, if I am not mistaken this
shouldn't be happening. In some cases I observed strange incorrect
characters coordinates caused by this.

The second problem is about character width, which i compute like this:
charTri.getDescentLine().getLength(). This method relies on getStringWidth
method, which adds character spacing to every character of the string,
including the last one. It seems to me that as a result I get the distance
from the beginning of the character to the beginning of the next one,
instead of the actual character width (this can be easily observed when
character spacing is large). I tried correcting the value returned from
getStringWidth, but it produced incorrect results because getStringWidth
is used also in PdfContentStreamProcessor to compute the distance from the
beginning of the character to the beginning of the next one. Instead I
have experimented with getUnscaledBaselineWithOffset and when I changed
this line:

return new LineSegment(new Vector(0, yOffset, 1), new
Vector(getUnscaledWidth(), yOffset, 1));

to

return new LineSegment(new Vector(0, yOffset, 1), new
Vector(getUnscaledWidth() - gs.characterSpacing * gs.horizontalScaling,
yOffset, 1));

the results seem to be correct.

I would be very grateful if you could add some comments on these issues.

Best regards,
Dominika



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Reply via email to