I'm not sure that leading and rise is going to be super helpful - let me know how you plan to use it. Would have expected that getBaseline(), getAscentLine() and getDescentLine() would be what you need.
Superscript and subscript - that's going to be quite tricky. You could probably come up with some mechanism for intuiting superscript (maybe if the vertical position of the glyph start point is between the baselines??? and if the font of the glyph is smaller than some percentage of the surrounding text??). It may be sufficient to just adjust the fuzziness that LocationTextExtractionStrategy uses for determining if two draw operations are on the same line or not... Chunk.sameLine() right now does strict integer comparison, perhaps you could make that a bit fuzzier, like take the smaller of the heights of the two chunks being compared and see if the vertical distance is within that? There's going to be a bit of black art in getting that right, but it should work. If you get something working, send a patch. The SimpleTextExtractionStrategy will never be able to properly handle sub/super-scripts - it is too simple. As a bit of history, SimpleTextExtractionStrategy was the first strategy I created as a proof of concept. The idea was to prove that we could do text extraction at all. LocationTextExtractionStrategy came next to provide page layout awareness. For certain rendering situations, the unintelligent approach used by SimpleXXXX is fine, but for real text extraction, use LocationXXXXX. The reason I made these pluggable was so that folks could play with alternative strategies - and if someone finds one that is better, they can contribute it and it can be easily rolled into the library. I think that SimpleXXXX still has relevance because it is easy to understand and helps developers see how things work. LocationXXXX is pretty complicated. Re: your question about how to handle word and character spacing - I haven't read this part of the PDF spec (and I'm pretty slammed with other work right now). I'd suggest reading the spec (the operators are Tc and Tw) and noodle out some logic about how to use those values - then post back with what you are thinking and I'll think about how to incorporate it (naturally, if you want to provide a patch, that would be the best! But with some of this stuff, it's often helpful to bounce ideas, and I'm definitely available for that). The iText SVN repo moved a few weeks ago - be sure you are using the new one: svn checkout svn://svn.code.sf.net/p/itext/code/trunk itext-code Cheers, K -- View this message in context: http://itext-general.2136553.n4.nabble.com/PdfContentStreamProcessor-not-handling-TJ-operator-correctly-maybe-tp4656117p4656226.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
