Here's a quick list of commands for first scanning in a receipt at the recommended 300 DPI (customize per your scanner), then using ImageMagick's convert for converting (and for some odd reason I failed to document these doh! But likely you can skip these convert commands), then using Tesseract OCR for creating a text file of the possible recognized text.
# First scan in the image, at 300 DPI is recommended and 450 DPI I think is the # optimal DPI when attempt OCR. $ scanimage > ./receipt.tif # Or a more extravagant method: $ scanimage --format=tiff --progress --custom-gamma=no --source Flatbed --resolution=300 --icc-profile=${HOME}/ICC/CanoScan9000F/CNSR0D.ICC > receipt.tif # The below attempt to auto crop the background from the receipt, but due to # the scanner's white background, the commands fail to detect the background # with white paper. The commands should work with a black background, after # some adjustment. (eg. Use black paper from a hobby shop for providing a # black background during scanning.) $ convert -trim -fuzz 55% /tmp/receipt.tif /tmp/receipt-trim.tif $ convert -verbose -border 10x10 -trim +repage -fuzz 75% receipt.tif receipt-trim.tif # If I recall correctly, just remove "stdout" and a receipt.txt should be # automatically created within the immediate folder. $ tesseract receipt.tif stdout # As extensively described here, this creates a PDF with included OCR text. # The included text within the PDF file is written in binary and cannot be # simply grepped! $ tesseract receipt.tiff receipt.pdf There are two resulting end results: 1) A scanned image (eg. receipt.tif) and a text file (eg. receipt.txt) containing possibly recognized text. If you archive data, this is probably your best method for preserving image detail and preventing FUD and extravagant proprietary formats. Searching simple text files are extremely easy. Maintaining two separate files can be troublesome. 2) A scanned image (eg. receipt.tif) imported into a PDF file containing the OCR text. Using the latest versions of Tesseract, I believe the default is to provide a PDF file including the image and text file, while older versions output a text file. Choose the PDF file method if you like simplicity and care less about details. The downside, the image is further significantly compressed. I prefer the first solution, as this provides me with a high resolution TIF/JPEG image versus after creating the PDF file, the image is further compressed drastically. On the flip, the one PDF file includes both the image and text files rather than having to deal with two separate files. (eg. receipt.tif and receipt.txt) The final incantation of find will search a PDF file containing OCR text or general text. # Search multiple PDF files for TEXT find /tmp -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "TEXT"' \; Last but not least, somebody actively maintains gscan2pdf (http://gscan2pdf.sourceforge.net/), containing a GUI front-end making scanning to PDF simple and easy, written in Python. I've installed & tried it, but am extremely bias with command line utilities versus troublesome clicky front-ends. -- Roger http://rogerx.freeshell.org/
signature.asc
Description: Digital signature
-- sane-devel mailing list: sane-devel@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel Unsubscribe: Send mail with subject "unsubscribe your_password" to sane-devel-requ...@lists.alioth.debian.org