Hi,

I'm using PDFbox in my client's project to:


- set crop/media boxes to automatically crop whitespace and/or unwanted content (the actual cut points are calculated with ghostscript bbox and text extraction from "suspect" unwanted areas)

- extract individual pages from foreign documents

- add overlays to existing documents (like a stamp "COPY" on an invoice PDF, highlighting a particular area on a page, or "underlay" a page with another document [eg. 'business paper'])

- extract text from foreign documents (or parts of such documents) for full-text-search

- "convert" images to PDF documents (in that case, one image per page)



What I would like to do is to "optimize" a document in a way that removes everything that is not related to the currently "visible" (possibly cropped) area of the document, including metadata. I once asked about metadata removal on the mailing list (see http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201307.mbox/%[email protected]%3E ) but since that is still "only" a nice-to-have for my project, I have yet to look further into how to "write back the [modified] PDmetadata stream" (and then supply a patch ;-] ) .


Anyways, for me PDFbox has always been a very valuable tool. This survey is a perfect occasion to say THANK YOU to the busy community!


Best regards,

        -hannes erven

Reply via email to