Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Dominika Tkaczyk Sun, 23 Sep 2012 10:19:45 -0700

> I will be thrilled to have the tests - a week or two delay is well worth
it
> to have test coverage.
>
> As a starting point, there is a unit test for TextRenderInfo that
includes
> code for creating a PDF.  The LocationTextExtractionStrategyTest has
quite
> a
> few different scenarios involving rotating text, etc...
>
> My biggest hangup here is that I just have no idea what the various
use-cases are that involve this functionality, so I could create tests, but
> I'd have no idea if they were doing what you wanted or not.
>
> I suppose one simple test would be to check that the baseline now
doesn't
> include the extra character space at the end - but what's the best way
to
> test this?  I suppose we could set a really large character space value,
render a word, then compute the length of the baseline.  But am I
comparing
> it against just a constant (i.e. 12.724)??  How do I even know what that
distance *should* be?  I'm sure there is a good answer (and when you tell
> me, I'll feel a bit foolish)!


Very sorry for this delay, I have been also quite busy these days.

I have tested my sample files by simply extracting all the characters with
coordinates and checking if they match the characters positions in the
image of the PDF file itself. I have used our own tool that displays the
image of the PDF file and the generated structure (lines, words and
characters) on top of it. As I mentioned, we are working only with
scientific articles, which rarely include rotating, etc.

So one possibility would be to test returned values against coordinates
captured from the PDF images. I think I could use our tool to capture
those values, although it would require a small improvement of the tool
itself. Of course we would have to use a small tolerance value during the
comparison.

I have been also thinking of an alternative. What I observed in my test
cases was for example a huge overlap between neighbouring characters, or
no space between different words (this happened when character spacing was
used for generating spaces). So maybe we could generate small PDFs
containing exactly the same sentence in the same position but generated
differently (eg. large character spacing used as spaces between words,
spaces written directly by Tj operator, text matrix set after writing each
character, etc.), and then check: if the characters in the same word are
close but not overlap much, if neighboring words are separated by a gap,
etc. And if our files looked the same, we could also check if we get
similar coordinates in every case. Of course here also we would have to
use some small tolerance values.

In both cases, the bugs that occurred initially should be detected,
assuming that we use a few very different ways to render text in sample
PDFs.

Best regards,
Dominika



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Reply via email to