Re: [iText-questions] PDF Javascript Stripper

Mark Storer Fri, 09 Oct 2009 10:58:12 -0700

It looks like you have all the bases covered.  The only way you might be more 
thorough would be to iterate through all the indirect objects by object number:


int numObjs = reader.getXrefSize();
for (int i = 0; i < numObjs; ++i) {
  PdfObject curObj = reader.getPdfObject( i ); // no need to worry about 
indirect references this way
  // if curObj is a dict, traverse it as a dict
  // if it's an array, traverse it as an array
}

You can then modify your traverse functions to ignore indirect references 
(because you either will or have already addressed them).  You don't need to 
worry about revisiting objects either, so you can ditch the "traversed" set.

Following these suggestions (plus a little code cleanup), traversePdfDictionary 
would look like this:
  163     public static boolean traversePdfDictionary(final PdfDictionary dict, 
/*final Set traversed,*/ boolean containsJs) {
  164         /*  if (traversed.contains(dict)) {
  165             return containsJs;
  166         } else {
  167             traversed.add(dict);
  168         }*/
  169 
  170         for (Object key : dict.getKeys()) {
  171             PdfObject data = (PdfObject) dict.get((PdfName) key);
  172             /** removed per above suggestion  if (data instanceof 
PRIndirectReference) {
  173                 data = resolveReference((PRIndirectReference) data);
  174             } **/
  175 
  176             if (PdfName.JS.equals( key )) {
  177                 dict.put(PdfName.JS, new PdfString("")); // you could use 
a single empty pdfString over and over here, throughout the entire document
  178                 containsJs = true;
  179             } 
  180             // Parents are ALWAYS indirect references, so the 
object-by-object stepping will find them
  181             /* (PdfName.PARENT.equals( key )) { none the less, here's 
some code cleanup to give you ideas in the future
  182                 if (data.isDictionary()) { 
  183                     containsJs |= 
traversePdfDictionary((PdfDictionary)data, traversed,containsJs);
  184                 } else if (data.isArray(){
  185                     containsJs |= traversePdfArray((PdfArray)data, 
traversed,containsJs);
  186                 }
  187             } */
  188         }
  189         return containsJs;
  190     }

Furthermore, the only reason I can think of to traverse arrays would be to 
stomp on the document level javascripts, but that can be done more directly:

reader.getCatalog().getAsDictionary( PdfName.NAMES ).remove( PdfName.JAVASCRIPT 
);

There's a potential NPE in there (/Names is optional), but I'll leave that as 
an exercise for the reader.

You could also remove any "AA" references (from pages and annotations).

Finally, you'll want to call reader.removeUnusedObjects() prior to 
stamper.close() (line 149-ish).  Even if you don't change anything else, the 
value of a /JS can be a stream reference.  With all these other suggestions, 
you're looking at Quite A Few orphaned objects loitering around your file.

We end up sticking around 100k of boilerplate script into every pdf form we 
generate, plus a fair amout of overhead from the objects wrapping all that 
script... (a test I just ran shows the total savings at 75kb) without the call 
to removeUnusedObjects(), your program wouldn't noticably change the size of 
the file when those scripts are in streams.

--Mark Storer 
  Senior Software Engineer 
  Cardiff.com

#include <disclaimer> 
typedef std::Disclaimer<Cardiff> DisCard; 



> -----Original Message-----
> From: Andrea Lombardoni [mailto:andrea.lombard...@oneoverzero.net]
> Sent: Friday, October 09, 2009 8:10 AM
> To: itext-questions@lists.sourceforge.net
> Subject: [iText-questions] PDF Javascript Stripper
> 
> 
> I just released a small project that uses iText: PDF 
> Javascript Stripper
> 
> It takes a PDF as input and tries to remove (better, nullify) 
> all the Javascript
> code inside it.
> 
> My company uses it to be sure that we do not produced/deliver 
> PDF document with
> malicious content.
> 
> I hope it can be interesting to other people as well.
> You can download it here:
> 
> https://sourceforge.net/projects/pdfjavascriptst/
> 
> 
> 
> --------------------------------------------------------------
> ----------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. 
> Jumpstart your
> developing skills, take BlackBerry mobile applications to 
> market and stay 
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.1t3xt.com/docs/book.php
> Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] PDF Javascript Stripper

Reply via email to