Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Shujaat Wed, 12 Sep 2012 09:12:53 -0700

OK. I'm currently working on what should be called a
SimpleTableTextExtractionStrategy. As the name suggests, it is a text
extraction strategy for tables in a PDF. Owing to the magnitude and
complexity of work it would take to handle all kinds of tables, I have
restricted my work to a somewhat simpler subset with the following
conditions:


 

.         A page should not have more than one table.

.         A simple M x N layout, so no merged cells, half or broken lines
etc.

 

After handling operators like re, l, m, h etc, I do some maths to get the
list of horizontal and vertical lines that form the table. From this point
onwards, I need EXACT locations of all text chunks to find out which cells
they fall in. Once I figure that out, appending them into table cells is a
simple process.

 

About your question, I really do need to know X,Y position in user space of
each character within a text chunk. For now, I have managed to compute that
by using a combination of GetBaseLine(), word-spacing, character-spacing,
scaling and other factors in a way that is similar to GetStringWidth()
function. I'm basically splitting the text chunk on SPACE character and then
doing location computation for each "word". I had to add another public
function to TextRenderInfo like this to make it work:

 

public float GetWidth(string str)

{

return ConvertWidthFromTextSpaceToUserSpace(gs.horizontalScaling *
((gs.font.GetWidth(str) / 1000f) * gs.fontSize));

}

 

I have just found out (by checking another PDF file) that this method may
not be sufficient enough to get the exact location of text chunks in case
PDF tries to add distance between logical words using character-spacing
instead of word-spacing. Let me give an example: the PDF file shows the
following string: "1            Ch. 10". In the RenderText() method, I get
two text chunks; the first one is "1C" (no space in between) and the second
one is "h. 10". Now since I'm splitting chunks on SPACE, the first chunk
goes in as a single word and is placed far away from the other one like
this: "1C            h.10".

 

So from where I see, an array of PointF type within TextRenderInfo showing
the actual position of each character in user-space would be very useful.
Equivalently, we could have a function that takes char index and returns its
location in user-space. I guess that second approach would save some memory
and unnecessary processing. A superset of this approach could be a function
which returns the bounding box of each character; that would even solve the
subscript/superscript problem.

 

Best,

Shujaat





--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/PdfContentStreamProcessor-not-handling-TJ-operator-correctly-maybe-tp4656117p4656298.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] PdfContentStreamProcessor not handling TJ operator correctly (maybe)

Reply via email to