Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Kevin Day Fri, 07 Sep 2012 16:37:15 -0700

I'm not sure that leading and rise is going to be super helpful - let me know
how you plan to use it.   Would have expected that getBaseline(),
getAscentLine() and getDescentLine() would be what you need.

Superscript and subscript - that's going to be quite tricky. You could
probably come up with some mechanism for intuiting superscript (maybe if the
vertical position of the glyph start point is between the baselines??? and
if the font of the glyph is smaller than some percentage of the surrounding
text??). It may be sufficient to just adjust the fuzziness that
LocationTextExtractionStrategy uses for determining if two draw operations
are on the same line or not... Chunk.sameLine() right now does strict
integer comparison, perhaps you could make that a bit fuzzier, like take the
smaller of the heights of the two chunks being compared and see if the
vertical distance is within that? There's going to be a bit of black art in
getting that right, but it should work. If you get something working, send
a patch.

The SimpleTextExtractionStrategy will never be able to properly handle
sub/super-scripts - it is too simple. As a bit of history,
SimpleTextExtractionStrategy was the first strategy I created as a proof of
concept. The idea was to prove that we could do text extraction at all.
LocationTextExtractionStrategy came next to provide page layout awareness.
For certain rendering situations, the unintelligent approach used by
SimpleXXXX is fine, but for real text extraction, use LocationXXXXX. The
reason I made these pluggable was so that folks could play with alternative
strategies - and if someone finds one that is better, they can contribute it
and it can be easily rolled into the library. I think that SimpleXXXX still
has relevance because it is easy to understand and helps developers see how
things work. LocationXXXX is pretty complicated.

Re: your question about how to handle word and character spacing - I haven't
read this part of the PDF spec (and I'm pretty slammed with other work right
now). I'd suggest reading the spec (the operators are Tc and Tw) and noodle
out some logic about how to use those values - then post back with what you
are thinking and I'll think about how to incorporate it (naturally, if you
want to provide a patch, that would be the best! But with some of this
stuff, it's often helpful to bounce ideas, and I'm definitely available for
that).

The iText SVN repo moved a few weeks ago - be sure you are using the new
one:

svn checkout svn://svn.code.sf.net/p/itext/code/trunk itext-code

Cheers,

--
View this message in context:
http://itext-general.2136553.n4.nabble.com/PdfContentStreamProcessor-not-handling-TJ-operator-correctly-maybe-tp4656117p4656226.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Reply via email to