Re: PDF: finding a blank image

2009-07-14 Thread DrLeif
On Jul 13, 6:22 pm, Scott David Daniels scott.dani...@acm.org wrote:
 DrLeif wrote:
  I have about 6000 PDF files which have been produced using a scanner
  with more being produced each day.  The PDF files contain old paper
  records which have been taking up space.   The scanner is set to
  detect when there is information on the backside of the page (duplex
  scan).  The problem of course is it's not the always reliable and we
  wind up with a number of PDF files containingblankpages.

  What I would like to do is have python detect a blank pages in a PDF
  file and remove it.  Any suggestions?

 I'd check into ReportLab's commercial product, it may well be easily
 capable of that.  If no success, you might contact PJ at Groklaw, she
 has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
 in bulk).

 --Scott David Daniels
 scott.dani...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PDF: finding a blank image

2009-07-14 Thread DrLeif
On Jul 13, 6:22 pm, Scott David Daniels scott.dani...@acm.org wrote:
 DrLeif wrote:
  I have about 6000 PDF files which have been produced using a scanner
  with more being produced each day.  The PDF files contain old paper
  records which have been taking up space.   The scanner is set to
  detect when there is information on the backside of the page (duplex
  scan).  The problem of course is it's not the always reliable and we
  wind up with a number of PDF files containingblankpages.

  What I would like to do is have python detect a blank pages in a PDF
  file and remove it.  Any suggestions?

 I'd check into ReportLab's commercial product, it may well be easily
 capable of that.  If no success, you might contact PJ at Groklaw, she
 has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
 in bulk).

 --Scott David Daniels
 scott.dani...@acm.org


Thanks everyone for the quick reply.

I had considered using ReportLab however, was uncertain about it's
ability to detect a blank page.

Scott, I'll drop an email to ReportLab and PJ

Thanks again,
DrLeif
-- 
http://mail.python.org/mailman/listinfo/python-list


PDF: finding a blank image

2009-07-13 Thread DrLeif
I have about 6000 PDF files which have been produced using a scanner
with more being produced each day.  The PDF files contain old paper
records which have been taking up space.   The scanner is set to
detect when there is information on the backside of the page (duplex
scan).  The problem of course is it's not the always reliable and we
wind up with a number of PDF files containing blank pages.

What I would like to do is have python detect a blank pages in a PDF
file and remove it.  Any suggestions?


Thanks,
DrL
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PDF: finding a blank image

2009-07-13 Thread David Bolen
DrLeif l.lensg...@gmail.com writes:

 What I would like to do is have python detect a blank pages in a PDF
 file and remove it.  Any suggestions?

The odds are good that even a blank page is being rendered within
the PDF as having some small bits of data due to scanner resolution,
imperfections on the page, etc..  So I suspect you won't be able to just
look for a well-defined pattern in the resulting PDF or anything.

Unless you're using OCR, the odds are good that the scanner is
rendering the PDF as an embedded image.  What I'd probably do is
extract the image of the page, and then use image processing on it to
try to identify blank pages.  I haven't had the need to do this
myself, and tool availability would depend on platform, but for
example, I'd probably try ImageMagick's convert operation to turn the
PDF into images (like PNGs).  I think Gimp can also do a similar
conversion, but you'd probably have to script it yourself.

Once you have an image of a page, you could then use something like
OpenCV to process the page (perhaps a morphology operation to remove
small noise areas, then a threshold or non-zero counter to judge
blankness), or probably just something like PIL depending on
complexity of the processing needed.

Once you identify a blank page, removing it could either be with pure
Python (there have been other posts recently about PDF libraries) or
with external tools (such as pdftk under Linux for example).

-- David
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PDF: finding a blank image

2009-07-13 Thread Scott David Daniels

DrLeif wrote:

I have about 6000 PDF files which have been produced using a scanner
with more being produced each day.  The PDF files contain old paper
records which have been taking up space.   The scanner is set to
detect when there is information on the backside of the page (duplex
scan).  The problem of course is it's not the always reliable and we
wind up with a number of PDF files containing blank pages.

What I would like to do is have python detect a blank pages in a PDF
file and remove it.  Any suggestions?


I'd check into ReportLab's commercial product, it may well be easily
capable of that.  If no success, you might contact PJ at Groklaw, she
has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
in bulk).

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list