How to use PDFBox to extract all text on a page that is NOT behind an image?

Orit Prince Mon, 15 Nov 2021 01:47:33 -0800

Hi
 
I want to extract only characters which are visible, i.e. not covered by an 
image. 
Here is a link to one page PDF sample:
https://drive.google.com/file/d/14qy_GPS3dzXI-meJiCKkvqwUb59Q1yWk/view?usp=sharing
 
It has some text which is covered by the image at the right top corner: ANNUAL 
REPORT 2018
All other characters are printed on top of the image.
 
I tried running the code in here:
https://stackoverflow.com/questions/66607663/how-to-use-pdfbox-to-extract-all-text-on-a-page-that-is-not-behind-an-image#
 
And the code in here:
https://stackoverflow.com/questions/69703154/differ-between-text-above-image-and-text-covered-by-image
 
At both options, I cannot get the string "ANNUAL REPORT 2018" to be detected as 
hidden (= covered), and the string "Destination2050" to be detected as visible 
= on top of image.
 
Any help would be much appriciated !!!
Thanks
Orit


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

How to use PDFBox to extract all text on a page that is NOT behind an image?

Reply via email to