Can anyone back that up? IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus builds on Tesseract". Can you confirm that Vikram has a point?
Shashi ----- Original Message ---- From: Vikram Kumar <vikrambku...@gmail.com> To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu> Sent: Thursday, February 26, 2009 9:21:07 PM Subject: Re: Use of scanned documents for text extraction and indexing Tesseract is pure OCR. Ocropus builds on Tesseract. Vikram On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <shashi_k...@yahoo.com> wrote: > Another project worth investigating is Tesseract. > > http://code.google.com/p/tesseract-ocr/ > > > > > ----- Original Message ---- > From: Hannes Carl Meyer <m...@hcmeyer.com> > To: solr-user@lucene.apache.org > Sent: Thursday, February 26, 2009 11:35:14 AM > Subject: Re: Use of scanned documents for text extraction and indexing > > Hi Sithu, > > there is a project called ocropus done by the DFKI, check the online demo > here: http://demo.iupr.org/cgi-bin/main.cgi > > And also http://sites.google.com/site/ocropus/ > > Regards > > Hannes > > m...@hcmeyer.com > http://mimblog.de > > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. < > sithu.sudar...@fda.hhs.gov> wrote: > > > > > Hi All: > > > > Is there any study / research done on using scanned paper documents as > > images (may be PDF), and then use some OCR or other technique for > > extracting text, and the resultant index quality? > > > > > > Thanks in advance, > > Sithu D Sudarsan > > > > sithu.sudar...@fda.hhs.gov > > sdsudar...@ualr.edu > > > > > > > >