This may be a dumb thought, but I built a game a couple of years ago which tracked results on a map (on an HTML canvas, with the map set as a background with objects drawn on top of it) by counting the pixels of a certain color and comparing them as a percentage against the pixels in the whole map. You could do something similar, by comparing black or gray beyond a particular threshold against total pixels. That would be a pretty rough and ready approach, but it might be worth a shot. If the missing sections have a significantly different color than the rest of the image, that could be another metric to use.
Best regards, *Jason Bengtson, MLIS, MA* Innovation Architect *Houston Academy of MedicineThe Texas Medical Center Library* 1133 John Freeman Blvd Houston, TX 77030 http://library.tmc.edu/ www.jasonbengtson.com On Tue, Dec 1, 2015 at 2:07 PM, Christine Mayo <ma...@bc.edu> wrote: > Hi all, > > I have an interesting assessment issue with some recently digitized > newspapers that I wondered if anyone could shed some light on. We sent a > batch of 19th century newspapers off to a vendor knowing they weren't in > great shape, and now we have to decide whether the resultant images (TIFFs) > are usable or we should be looking for alternative copies and/or microfilm. > > A lot of the images are in decent shape, but the first few pages of each > issue are heavily creased and generally missing a smallish piece from the > center of the page where the folds met. I'm looking for a way to > programmatically identify how much text is missing/unusable for each page. > We haven't run OCR yet, part of this assessment is to figure out whether we > should bother sending these items out for OCR and METS/ALTO creation, but I > suspect we could run a quick and dirty in-house OCR if that would help. > > We can go through the images by hand and try to measure and/or count, but > if anyone's worked on something like this or has thoughts, I'd love to hear > them! > > Thanks, > Christine > > -- > Christine Mayo > Digital Production Librarian > Thomas P. O'Neill, Jr. Library > Boston College > 140 Commonwealth Avenue > Chestnut Hill, MA 02467 > christine.m...@bc.edu >