Hi everyone, I am about to embark on an exciting adventure into the land of original character recognition, processing nearly 1,000 documents and extracting numbers from them. I am interested in any anecdotal wisdom regarding:
1. efficient scanning parameters: DPI color / BW / grayscale 2. pre-processing steps one might do with imagemagick 3. any filtering that one might do to get ready for the OCR I plan to use Google's new OCR project, ocropus, which currently uses the 'tesseract' engine. Naive attempts to OCR these documents is resulting in marginal accuracy, so any help is appreciated. Vertical and horizontal lines on the original documents are confusing the OCR, so removing them might be a start. I have thought about extracting each 'cell' of data with imagemagick, and then running the resulting mini-images though the OCR... that might be a last resort though... thanks! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 _______________________________________________ vox-tech mailing list vox-tech@lists.lugod.org http://lists.lugod.org/mailman/listinfo/vox-tech