Re: Use of scanned documents for text extraction and indexing

Shashi Kant Thu, 26 Feb 2009 20:36:18 -0800

Can anyone back that up?

IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus 
builds on Tesseract".
Can you confirm that Vikram has a point?


Shashi




----- Original Message ----
From: Vikram Kumar <vikrambku...@gmail.com>
To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu>
Sent: Thursday, February 26, 2009 9:21:07 PM
Subject: Re: Use of scanned documents for text extraction and indexing

Tesseract is pure OCR. Ocropus builds on Tesseract.
Vikram

On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <shashi_k...@yahoo.com> wrote:

> Another project worth investigating is Tesseract.
>
> http://code.google.com/p/tesseract-ocr/
>
>
>
>
> ----- Original Message ----
> From: Hannes Carl Meyer <m...@hcmeyer.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, February 26, 2009 11:35:14 AM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Hi Sithu,
>
> there is a project called ocropus done by the DFKI, check the online demo
> here: http://demo.iupr.org/cgi-bin/main.cgi
>
> And also http://sites.google.com/site/ocropus/
>
> Regards
>
> Hannes
>
> m...@hcmeyer.com
> http://mimblog.de
>
> On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> sithu.sudar...@fda.hhs.gov> wrote:
>
> >
> > Hi All:
> >
> > Is there any study / research done on using scanned paper documents as
> > images (may be PDF), and then use some OCR or other technique for
> > extracting text, and the resultant index quality?
> >
> >
> > Thanks in advance,
> > Sithu D Sudarsan
> >
> > sithu.sudar...@fda.hhs.gov
> > sdsudar...@ualr.edu
> >
> >
> >
>
>

Re: Use of scanned documents for text extraction and indexing

Reply via email to