Hi,

We have been scratching heads for a long time on this and are still unable to 
move ahead.

The task we want to perform is as follows. We want to remove all the images 
from a PDF. The tricky part kicks in when the images, themselves, have some 
text on them (geometrically). Or, the bounding rectangle of the Image may 
contain some text in them.

The most trivial way of doing that is:

1.       Find all images in the document (assume a 1-page document) - Done.

2.       Find the coordinates of these images - Done. Using the 
PrintImageLocations.java as a base, we could find the bottom-left coordinates 
of the image, along with its height and width. Assuming this is correct, we are 
moving forward.

3.       Find all the text in the document - Done.

4.       Find the coordinates of these text - Done. Using the 
PrintTextLocations.java as a base.

5.       Remove all the text fragments whose coordinates intersect with any of 
the images in the page - Done.

Theoretically, the above 5 steps should have given us the desired results. 
However, same wasn't the case. We discovered a lot of things that were causing 
problems:

1.       The coordinate system for Text and Images are not the same. For texts, 
the Y is increasing downwards. Whereas, for images, it is supposed to increase 
upwards. We believe this is the only difference.

2.       The width and height of the image are being reported incorrectly (we 
think so). On the PDF, when we compare the coordinates of some of the text 
fragments and the height and width of the image, we find that the widths and 
heights that are being reported are shorter than what they should be!

Looking at these evidences, following are some questions that came up:

1.       We know that the Device space and the User space are different. And we 
found that the Text space and the Image space are different too. What is the 
relationship between the Text space and the Image space? Or, in other words, 
how could I get the coordinates of all the PDXObjects in a PDF with respect to 
just one coordinate system?

2.       Looks like the PDF may store the original image with its original 
dimensions (at the time of the PDF creation). And that while rendering, it 
performs all sorts of scaling, with the changing DPIs. Is there a way PDFBox 
could give the start coordinate (bottom-left) and the end-coordinate 
(top-right) of the rendered image at a given resolution? No heights, no widths, 
just the simple end-coordinates.

3.       Are we barking the wrong tree? I know this should have been the first 
question. Is PDFBox capable of giving us what we want to achieve? I know the 
answer to the previous question, but are there other APIs that could make our 
task easier? Or, is the 5-step process that we have defined even correct in the 
first place?

We are using PDFBox 1.8.2 on a windows environment for this.

Thanks and Regards.
Vishaal Jatav.

________________________________

*****************************************************************************************************************************************************************************************************************************************************************

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. All emails are scanned for any virus and monitored as per the 
Company information security policies and practices.


*************************************************************************************************************************************************************************************************************************************************************

Reply via email to