Hi, the typical approach is that you create a new pdf from the source copying over the elements you are interested in. PDFMergerUtility gives you some hint how to do that. Now to the PDStream question. The contents of a PDF are expressed using basic objects such as Strings, Numbers, Arrays …. These can be included in a literal fashion or as part of a kind of byte array - a PDStream (which can be compressed ….). A stream is typically used for larger "collections" of PDF objects or where the object itself contains a large amount of data e.g. for images. So in order to handle a stream you need to get to the contents of a stream and then male a decision if you are interested in all objects within it or you can simply copy the entire stream copying all contents.
It's a simplified description of the model but maybe you may want to read the start of the PDF spec to make it clearer how a PDF is organized. Section 7.3. of the ISO32000 spec describes the basic objects as well as streams. Kind regards Maruan Sahyoun Am 04.02.2013 um 15:56 schrieb Dominic Jacobssen <[email protected]>: > Hi all, > > (Sorry for the non-threaded follow-up mail; I didn't receive my own > mail from the list, so I couldn't answer to it, thereby preserving the > thread). > > After a few more hours of bashing against the problem, I've got the > following code. The key insight (which had completely eluded me, I'm > afraid) is that the contents of each page are inside the dictionary > value corresponding to the "/Contents" key, and that this is a > compressed stream. (I've also since worked out that > PDPage.getContents() is a simpler way of getting hold of this). > > I'm successfully finding all textual elements on the page, among which > the footnote string is visible. So there is hope. > > However, this code is currently "read only": since I'm calling > PDStream.getStream(), and the page at: > > http://pdfbox.apache.org/userguide/index.html > > says, "A stream of data, typically compressed. This is used for page > content.", I presume that I can't modify the stream in-place, so I > presume that I'd need to create a new stream and add items to it one > by one. > > Is this the right approach? Can I recreate the original page by > creating a new page, copying across the metadata, then writing objects > from the "source" page to the "destination" page's stream one by one? > > Many thanks, > > Dominic > > -- Code snippet starts here -- > > PDFMergerUtility merger = new PDFMergerUtility(); > > for (PDFDocument inputDocument : inputDocuments ) { > byte[] buffer = inputDocument.getBytes(); > > ByteArrayInputStream bais = new ByteArrayInputStream ( buffer); > PDDocument pdDocument = PDDocument.load ( bais ); > > if (pdDocument.isEncrypted()) { > try { > DecryptionMaterial dm = new StandardDecryptionMaterial("foo"); > pdDocument.openProtection(dm); > System.out.println("Successfully decrypted file!"); > } catch (CryptographyException e) { > e.printStackTrace(); > } catch (BadSecurityHandlerException e) { > e.printStackTrace(); > } > } > > @SuppressWarnings("unchecked") > List<PDPage> allPages = pdDocument.getDocumentCatalog().getAllPages(); > int nPages = pdDocument.getNumberOfPages(); > for (PDPage onePage : allPages) { > PDStream contents = onePage.getContents(); > COSStream cosStream = contents.getStream(); > > // This allows me to loop over the page's string content ... > for (Object token : cosStream.getStreamTokens()) { > if (token instanceof COSString) { > COSString cosString = (COSString) token; > String s = cosString.getString(); > System.out.println("COSString: [" + s + "]"); > } > } > > // ... but as this is a stream, it's not obvious how to > // modify the stream. Do I need to create a new one? > COSDocument doc = pdDocument.getDocument(); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > COSWriter cosWriter = new COSWriter(baos); > try { > cosWriter.write(doc); > } catch (COSVisitorException e) { > e.printStackTrace(); > } > byte[] bytes = baos.toByteArray(); > ByteArrayInputStream bais = new ByteArrayInputStream(bytes); > merger.addSource(bais); > } > } > > try { > merger.setDestinationFileName("cat.pdf"); > merger.mergeDocuments(); > } catch (COSVisitorException e) { > e.printStackTrace(); > }

