Hi,
I'm using PDFbox in my client's project to:
- set crop/media boxes to automatically crop whitespace and/or unwanted
content
(the actual cut points are calculated with ghostscript bbox and text
extraction from "suspect" unwanted areas)
- extract individual pages from foreign documents
- add overlays to existing documents (like a stamp "COPY" on an invoice
PDF, highlighting a particular area on a page, or "underlay" a page with
another document [eg. 'business paper'])
- extract text from foreign documents (or parts of such documents) for
full-text-search
- "convert" images to PDF documents (in that case, one image per page)
What I would like to do is to "optimize" a document in a way that
removes everything that is not related to the currently "visible"
(possibly cropped) area of the document, including metadata. I once
asked about metadata removal on the mailing list (see
http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201307.mbox/%[email protected]%3E
) but since that is still "only" a nice-to-have for my project, I have
yet to look further into how to "write back the [modified] PDmetadata
stream" (and then supply a patch ;-] ) .
Anyways, for me PDFbox has always been a very valuable tool. This survey
is a perfect occasion to say THANK YOU to the busy community!
Best regards,
-hannes erven