Re: PDF: finding a blank image
On Jul 13, 6:22 pm, Scott David Daniels scott.dani...@acm.org wrote: DrLeif wrote: I have about 6000 PDF files which have been produced using a scanner with more being produced each day. The PDF files contain old paper records which have been taking up space. The scanner is set to detect when there is information on the backside of the page (duplex scan). The problem of course is it's not the always reliable and we wind up with a number of PDF files containingblankpages. What I would like to do is have python detect a blank pages in a PDF file and remove it. Any suggestions? I'd check into ReportLab's commercial product, it may well be easily capable of that. If no success, you might contact PJ at Groklaw, she has dealt with a _lot_ of PDFs (and knows people who deal with PDFs in bulk). --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: PDF: finding a blank image
On Jul 13, 6:22 pm, Scott David Daniels scott.dani...@acm.org wrote: DrLeif wrote: I have about 6000 PDF files which have been produced using a scanner with more being produced each day. The PDF files contain old paper records which have been taking up space. The scanner is set to detect when there is information on the backside of the page (duplex scan). The problem of course is it's not the always reliable and we wind up with a number of PDF files containingblankpages. What I would like to do is have python detect a blank pages in a PDF file and remove it. Any suggestions? I'd check into ReportLab's commercial product, it may well be easily capable of that. If no success, you might contact PJ at Groklaw, she has dealt with a _lot_ of PDFs (and knows people who deal with PDFs in bulk). --Scott David Daniels scott.dani...@acm.org Thanks everyone for the quick reply. I had considered using ReportLab however, was uncertain about it's ability to detect a blank page. Scott, I'll drop an email to ReportLab and PJ Thanks again, DrLeif -- http://mail.python.org/mailman/listinfo/python-list
PDF: finding a blank image
I have about 6000 PDF files which have been produced using a scanner with more being produced each day. The PDF files contain old paper records which have been taking up space. The scanner is set to detect when there is information on the backside of the page (duplex scan). The problem of course is it's not the always reliable and we wind up with a number of PDF files containing blank pages. What I would like to do is have python detect a blank pages in a PDF file and remove it. Any suggestions? Thanks, DrL -- http://mail.python.org/mailman/listinfo/python-list
Re: PDF: finding a blank image
DrLeif l.lensg...@gmail.com writes: What I would like to do is have python detect a blank pages in a PDF file and remove it. Any suggestions? The odds are good that even a blank page is being rendered within the PDF as having some small bits of data due to scanner resolution, imperfections on the page, etc.. So I suspect you won't be able to just look for a well-defined pattern in the resulting PDF or anything. Unless you're using OCR, the odds are good that the scanner is rendering the PDF as an embedded image. What I'd probably do is extract the image of the page, and then use image processing on it to try to identify blank pages. I haven't had the need to do this myself, and tool availability would depend on platform, but for example, I'd probably try ImageMagick's convert operation to turn the PDF into images (like PNGs). I think Gimp can also do a similar conversion, but you'd probably have to script it yourself. Once you have an image of a page, you could then use something like OpenCV to process the page (perhaps a morphology operation to remove small noise areas, then a threshold or non-zero counter to judge blankness), or probably just something like PIL depending on complexity of the processing needed. Once you identify a blank page, removing it could either be with pure Python (there have been other posts recently about PDF libraries) or with external tools (such as pdftk under Linux for example). -- David -- http://mail.python.org/mailman/listinfo/python-list
Re: PDF: finding a blank image
DrLeif wrote: I have about 6000 PDF files which have been produced using a scanner with more being produced each day. The PDF files contain old paper records which have been taking up space. The scanner is set to detect when there is information on the backside of the page (duplex scan). The problem of course is it's not the always reliable and we wind up with a number of PDF files containing blank pages. What I would like to do is have python detect a blank pages in a PDF file and remove it. Any suggestions? I'd check into ReportLab's commercial product, it may well be easily capable of that. If no success, you might contact PJ at Groklaw, she has dealt with a _lot_ of PDFs (and knows people who deal with PDFs in bulk). --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list