Hi, We have been scratching heads for a long time on this and are still unable to move ahead.
The task we want to perform is as follows. We want to remove all the images from a PDF. The tricky part kicks in when the images, themselves, have some text on them (geometrically). Or, the bounding rectangle of the Image may contain some text in them. The most trivial way of doing that is: 1. Find all images in the document (assume a 1-page document) - Done. 2. Find the coordinates of these images - Done. Using the PrintImageLocations.java as a base, we could find the bottom-left coordinates of the image, along with its height and width. Assuming this is correct, we are moving forward. 3. Find all the text in the document - Done. 4. Find the coordinates of these text - Done. Using the PrintTextLocations.java as a base. 5. Remove all the text fragments whose coordinates intersect with any of the images in the page - Done. Theoretically, the above 5 steps should have given us the desired results. However, same wasn't the case. We discovered a lot of things that were causing problems: 1. The coordinate system for Text and Images are not the same. For texts, the Y is increasing downwards. Whereas, for images, it is supposed to increase upwards. We believe this is the only difference. 2. The width and height of the image are being reported incorrectly (we think so). On the PDF, when we compare the coordinates of some of the text fragments and the height and width of the image, we find that the widths and heights that are being reported are shorter than what they should be! Looking at these evidences, following are some questions that came up: 1. We know that the Device space and the User space are different. And we found that the Text space and the Image space are different too. What is the relationship between the Text space and the Image space? Or, in other words, how could I get the coordinates of all the PDXObjects in a PDF with respect to just one coordinate system? 2. Looks like the PDF may store the original image with its original dimensions (at the time of the PDF creation). And that while rendering, it performs all sorts of scaling, with the changing DPIs. Is there a way PDFBox could give the start coordinate (bottom-left) and the end-coordinate (top-right) of the rendered image at a given resolution? No heights, no widths, just the simple end-coordinates. 3. Are we barking the wrong tree? I know this should have been the first question. Is PDFBox capable of giving us what we want to achieve? I know the answer to the previous question, but are there other APIs that could make our task easier? Or, is the 5-step process that we have defined even correct in the first place? We are using PDFBox 1.8.2 on a windows environment for this. Thanks and Regards. Vishaal Jatav. ________________________________ ***************************************************************************************************************************************************************************************************************************************************************** NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. All emails are scanned for any virus and monitored as per the Company information security policies and practices. *************************************************************************************************************************************************************************************************************************************************************

