As usual, my ignorance is the problem.  I really appreciate you pointing out 
this mistake.  I have been confused about where 0,0 was in PDF for a while.  I 
was looking at it the other day with Acrobat and Enfocus PitStop and totally 
confused myself.  So I guess it depends on the tools you're using.

Again, Thanks!
Darren


On Monday, September 8, 2014 4:13 AM, mkl <m...@wir-sind-cool.org> wrote:
 


Darren,

FDnC Red wrote
> I'm attaching the sample PDF that I'm parsing, the Program.cs that I'm
> using, and an XML file which is the output of MuPDF's mudraw.exe which I'm
> using as "ground truth" data because the stroke_path matrix exactly
> matches where the lines are in the PDF.

I am predominantly working on the Java side, so I had to translate your
program to Java. Then I looked at its output, and that output matches the
MuPDF output exactly (looking at it from the correct angle):

FindPdfLines output:

Start X,Y= 19.96,538.9747 Length=716.89 Height=0.0
1.0    0.0    0.0
0.0    1.0    0.0
19.96    538.9747    1.0
Start X,Y= 19.96,399.63 Length=716.89 Height=0.0
1.0    0.0    0.0
0.0    1.0    0.0
19.96    399.63    1.0
Start X,Y= 19.96,268.3525 Length=716.89 Height=0.0
1.0    0.0    0.0
0.0    1.0    0.0
19.96    268.3525    1.0
Start X,Y= 19.96,141.3561 Length=716.89 Height=0.0
1.0    0.0    0.0
0.0    1.0    0.0
19.96    141.3561    1.0
Start X,Y= 184.01,538.4
 Length=0.0 Height=509.96
1.0    0.0    0.0
0.0    1.0    0.0
184.01    538.4    1.0
Start X,Y= 368.6952,659.88 Length=0.0 Height=631.44
1.0    0.0    0.0
0.0    1.0    0.0
368.6952    659.88    1.0
Start X,Y= 561.25,538.4 Length=0.0 Height=509.96
1.0    0.0    0.0
0.0    1.0    0.0
561.25    538.4    1.0

MuPDF output:

<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 0.5"
 matrix="1 0 0 -1 19.96 199.025">
<moveto x="0" y="0"/>
<lineto x="716.89" y="0"/>
</stroke_path>
<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 338.37">
<moveto x="0" y="0"/>
<lineto x="716.89" y="0"/>
</stroke_path>
<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 469.647">
<moveto x="0" y="0"/>
<lineto x="716.89" y="0"/>
</stroke_path>
<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1
 19.96 596.644">
<moveto x="0" y="0"/>
<lineto x="716.89" y="0"/>
</stroke_path>
<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 1" matrix="1 0 0 -1 184.01 199.6">
<moveto x="0" y="0"/>
<lineto x="0" y="-509.96"/>
</stroke_path>
<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 1" matrix="1 0 0 -1 368.695 78.12">
<moveto x="0" y="0"/>
<lineto x="0" y="-631.44"/>
</stroke_path>
<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 1" matrix="1 0 0 -1 561.25 199.6">
<moveto x="0" y="0"/>
<lineto x="0" y="-509.96"/>
</stroke_path>

Let's look at the first line to resolve the ostensible differences:

Start X,Y= 19.96,538.9747 Length=716.89 Height=0.0
1.0    0.0    0.0
0.0    1.0    0.0
19.96    538.9747    1.0

<stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0"
colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 199.025">
<moveto x="0" y="0"/>
<lineto x="716.89" y="0"/>
</stroke_path>

The obvious difference is that FindPdfLines already applied the
transformation matrix while MuPDF has not yet applied it.After applying it
for the MuPDF data, the starting point is at (x,y) = (19.96, 199.025).

So the x coordinates already match, but the y coordinates seem to not match
at all.

But they only /seem/ to not match. As soon as one realizes that the outputs
are given in different coordinate systems, they do match! iTextSharp gives
you the coordinates in the native PDF default user space coordinates, i.e.
it uses the PDF page media box (in your file [0.0 0.0 756.0 738.0]) with
/0,0 being the lower left corner and 756,738 being the upper right/. MuPDF,
on the other hand, uses a different coordinate system more common in other
image formats with /0,0 being the upper left corner and 756,738 being the
lower right/.

To transform the coordinates of an individual point between these coordinate
systems, you keep the x coordinate and subtract the y coordinate from 738.

After doing that transformation (and allowing for minor differences due to
the lossy float arithmetic), the coordinate match:

FindPdfLines: 19.96,538.9747
MuPDF: 19.96,538.975 (=738-199.025)

The same is true for the other lines.

That being said your code will work for very special documents only because

1) You assume the code for lines to always be that identical sequence of
operations with differences only in the cm and l operands. In general there
can be other operations in-between (e.g. operations
 setting the color or
rendering mode, or even whole text blocks). Furthermore the operands of the
m operator need not be 0 0. And, of course, some of your operands are nor
required, e.g. there need not be q, Q, or cm operators at all.

2) You process cm and Q only if they are preceded by operands according to
your assumed sequence of operators of a line. Thus, you only process some of
the concatenated transformation matrix and you only undo (restore state)
some transformation matrix changes.

3) By applying Math.Abs to the l operands, you throw away the information
whether the line goes left or right from the starting point, and whether it
goes up or down.

Thus, your code may serve as a proof of concept but not for general use.

Regards,   Michael

Regards,   Michael



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Detect-Lines-in-PDF-tp4660295p4660349.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to