On 25 Dec 2008, Hugo Vanwoerkom wrote: [snip]
>> The OCR is tesseract-ocr. These steps: >> >> 1. apt-get install tesseract-ocr >> 2. apt-get install tesseract-eng >> 3. use xsane to scan a page at 300 dpi and save as .tif >> 4. but that will be depth 16 which tesseract can't handle so reduce the >> depth: convert foo.tif -depth 8 foo.x1.tif >> 5. run tesseract: tesseract foo.x1.tif foo -l eng >> 6. text will show up as foo.txt. >> >> Works faultlessly with me: I have problems with single quotes and >> dashes but he recognizes all words perfectly. >> [snip] I agree that tesseract does work remarkably well. However, I omit the 'convert' step because for me this gives an error: "convert: Caution: quantization tables are too coarse for baseline JPEG.`JPEGLib'." However, it seems to be unnecessary here. For me, xsane gives a 24-depth image (not 16-depth) and tesseract seems to be happy with this. I also omit "-l eng" since I didn't include any other languages when I installed tesseract. As suggested in the documentation, I put 'export TESSDATA_PREFIX="/usr/share/tesseract-ocr/" in .bashrc (note the final /). To make things work now I just do "tesseract foo.tif foo". I'm impressed. I mentioned ocrad a few posts ago here; that works too, but there are more errors than with tesseract. Anthony -- Anthony Campbell - a...@acampbell.org.uk Microsoft-free zone - Using Debian GNU/Linux http://www.acampbell.org.uk (blog, book reviews, and sceptical articles) -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org