Coordinates of Images and Text

vishaal jatav Sun, 08 Sep 2013 07:43:40 -0700

Hi,

We have been scratching heads for a long time on this and are still unable to 
move ahead.

The task we want to perform is as follows. We want to remove all the images
from a PDF. The tricky part kicks in when the images, themselves, have some
text on them (geometrically). Or, the bounding rectangle of the Image may
contain some text in them.

The most trivial way of doing that is:

1. Find all images in the document (assume a 1-page document) - Done.

2. Find the coordinates of these images - Done. Using the
PrintImageLocations.java as a base, we could find the bottom-left coordinates
of the image, along with its height and width. Assuming this is correct, we are
moving forward.

3. Find all the text in the document - Done.

4. Find the coordinates of these text - Done. Using the
PrintTextLocations.java as a base.

5. Remove all the text fragments whose coordinates intersect with any of
the images in the page - Done.

Theoretically, the above 5 steps should have given us the desired results.
However, same wasn't the case. We discovered a lot of things that were causing
problems:

1. The coordinate system for Text and Images are not the same. For texts,
the Y is increasing downwards. Whereas, for images, it is supposed to increase
upwards. We believe this is the only difference.

2. The width and height of the image are being reported incorrectly (we
think so). On the PDF, when we compare the coordinates of some of the text
fragments and the height and width of the image, we find that the widths and
heights that are being reported are shorter than what they should be!

Looking at these evidences, following are some questions that came up:

1. We know that the Device space and the User space are different. And we
found that the Text space and the Image space are different too. What is the
relationship between the Text space and the Image space? Or, in other words,
how could I get the coordinates of all the PDXObjects in a PDF with respect to
just one coordinate system?

2. Looks like the PDF may store the original image with its original
dimensions (at the time of the PDF creation). And that while rendering, it
performs all sorts of scaling, with the changing DPIs. Is there a way PDFBox
could give the start coordinate (bottom-left) and the end-coordinate
(top-right) of the rendered image at a given resolution? No heights, no widths,
just the simple end-coordinates.

3. Are we barking the wrong tree? I know this should have been the first
question. Is PDFBox capable of giving us what we want to achieve? I know the
answer to the previous question, but are there other APIs that could make our
task easier? Or, is the 5-step process that we have defined even correct in the
first place?

We are using PDFBox 1.8.2 on a windows environment for this.

Thanks and Regards.
Vishaal Jatav.

________________________________

*****************************************************************************************************************************************************************************************************************************************************************

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named
person(s). If you are not the intended recipient, notify the sender
immediately, delete this email from your system and do not disclose or use for
any purpose. All emails are scanned for any virus and monitored as per the
Company information security policies and practices.

Coordinates of Images and Text

Reply via email to