Re: Having trouble removing elements from a PDF (part II)

Maruan Sahyoun Mon, 04 Feb 2013 08:10:50 -0800

Hi,

the typical approach is that you create a new pdf from the source copying over 
the elements you are interested in. PDFMergerUtility gives you some hint how to 
do that. Now to the PDStream question. The contents of a PDF are expressed 
using basic objects such as Strings, Numbers, Arrays …. These can be included 
in a literal fashion or as part of a kind of byte array - a PDStream (which can 
be compressed ….). A stream is typically used for larger "collections" of PDF 
objects or where the object itself contains a large amount of data e.g. for 
images. So in order to handle a stream you need to get to the contents of a 
stream and then male a decision if you are interested in all objects within it 
or you can simply copy the entire stream copying all contents.


It's a simplified description of the model but maybe you may want to read the 
start of the PDF spec to make it clearer how a PDF is organized. Section 7.3. 
of the ISO32000 spec describes the basic objects as well as streams.

Kind regards

Maruan Sahyoun

Am 04.02.2013 um 15:56 schrieb Dominic Jacobssen <[email protected]>:

> Hi all,
> 
> (Sorry for the non-threaded follow-up mail; I didn't receive my own
> mail from the list, so I couldn't answer to it, thereby preserving the
> thread).
> 
> After a few more hours of bashing against the problem, I've got the
> following code. The key insight (which had completely eluded me, I'm
> afraid) is that the contents of each page are inside the dictionary
> value corresponding to the "/Contents" key, and that this is a
> compressed stream. (I've also since worked out that
> PDPage.getContents() is a simpler way of getting hold of this).
> 
> I'm successfully finding all textual elements on the page, among which
> the footnote string is visible. So there is hope.
> 
> However, this code is currently "read only": since I'm calling
> PDStream.getStream(), and the page at:
> 
>    http://pdfbox.apache.org/userguide/index.html
> 
> says, "A stream of data, typically compressed. This is used for page
> content.", I presume that I can't modify the stream in-place, so I
> presume that I'd need to create a new stream and add items to it one
> by one.
> 
> Is this the right approach? Can I recreate the original page by
> creating a new page, copying across the metadata, then writing objects
> from the "source" page to the "destination" page's stream one by one?
> 
> Many thanks,
> 
> Dominic
> 
> -- Code snippet starts here --
> 
> PDFMergerUtility merger = new PDFMergerUtility();
> 
> for (PDFDocument inputDocument : inputDocuments ) {
>    byte[] buffer = inputDocument.getBytes();
> 
>    ByteArrayInputStream bais = new ByteArrayInputStream ( buffer);
>    PDDocument pdDocument = PDDocument.load ( bais );
> 
>    if (pdDocument.isEncrypted()) {
>       try {
>           DecryptionMaterial dm = new StandardDecryptionMaterial("foo");
>           pdDocument.openProtection(dm);
>           System.out.println("Successfully decrypted file!");
>       } catch (CryptographyException e) {
>           e.printStackTrace();
>       } catch (BadSecurityHandlerException e) {
>           e.printStackTrace();
>       }
>    }
> 
>    @SuppressWarnings("unchecked")
>    List<PDPage> allPages = pdDocument.getDocumentCatalog().getAllPages();
>    int nPages = pdDocument.getNumberOfPages();
>    for (PDPage onePage : allPages) {
>       PDStream contents = onePage.getContents();
>       COSStream cosStream = contents.getStream();
> 
>       // This allows me to loop over the page's string content ...
>       for (Object token : cosStream.getStreamTokens()) {
>           if (token instanceof COSString) {
>               COSString cosString = (COSString) token;
>               String s = cosString.getString();
>               System.out.println("COSString: [" + s + "]");
>           }
>       }
> 
>       // ... but as this is a stream, it's not obvious how to
>       // modify the stream. Do I need to create a new one?
>       COSDocument doc = pdDocument.getDocument();
>       ByteArrayOutputStream baos = new ByteArrayOutputStream();
>       COSWriter cosWriter = new COSWriter(baos);
>       try {
>           cosWriter.write(doc);
>       } catch (COSVisitorException e) {
>           e.printStackTrace();
>       }
>       byte[] bytes = baos.toByteArray();
>       ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>       merger.addSource(bais);
>    }
> }
> 
> try {
>    merger.setDestinationFileName("cat.pdf");
>    merger.mergeDocuments();
> } catch (COSVisitorException e) {
>    e.printStackTrace();
> }

Re: Having trouble removing elements from a PDF (part II)

Reply via email to