Re: [CODE4LIB] best OCR package? [SEC=UNCLASSIFIED]

2009-02-04 Thread Dyer, Renata
Emanuel,
I have used Microsoft Office Document Imaging that works really well with tiff 
files. Most, if not all scanners, will scan into tiffs which you can then 
convert into text, rtf or word files easily.
The other one I used was Pro Millennium which is compatible with ms word, excel 
etc.
I would highly recommend both of them.

Renata

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of 
Emmanuel Di Pretoro
Sent: Tuesday, 3 February 2009 7:54 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] best OCR package?

Hi,

It wasn't a recommendation since I never try it, but I've heard a lot of good 
about tesseract. It was currently developed by Google, but I don't know if they 
use it.

Some link :
 - http://code.google.com/p/tesseract-ocr/
 - http://en.wikipedia.org/wiki/Tesseract_%28software%29

Hope this help,

Emmanuel Di Pretoro

2009/2/3 Alberto Accomazzi aaccoma...@cfa.harvard.edu

 Sorry if this is a bit off-topic, but I was wondering if any of you
 clever fellows have a recommendation for an OCR package, possibly with
 a native linux port.  I know about OCRopus but I have a feeling that
 commercial products still have a significant edge over public domain
 packages.  So what are you using and/or do you know what the big guys
 (google, IA, microsoft) are using?

 Thanks,
 -- Alberto


 --
 Dr. Alberto Accomazzi  aaccomazzi(at)cfa harvard edu
 Project Manager
 NASA Astrophysics Data Systemads.harvard.edu
 Harvard-Smithsonian Center for Astrophysics  www.cfa.harvard.edu
 60 Garden St, MS 67, Cambridge, MA 02138, USA


**
Please Note: The information contained in this e-mail message 
and any attached files may be confidential information and 
may also be the subject of legal professional privilege.  If you are
not the intended recipient, any use, disclosure or copying of this
e-mail is unauthorised.  If you have received this e-mail by error
please notify the sender immediately by reply e-mail and delete all
copies of this transmission together with any attachments.
**


Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Randy Stern
Abbyy Finereader and Nuance Omnipage are the two leading commercial OCR 
products. Both can achieve 98% + character accuracy on most book-like 
material scanned at 300 dpi.


- Randy Stern (who formerly worked in the OCR industry)

At 07:37 AM 2/3/2009 -0500, Nicole Engard wrote:

I'm with Christian - I loved Abbyy FineReader when I used it at both
my previous libraries.  It's very accurate and it's affordable if
you're not using it for mass digitization :) but we never got the
server contract because like Christian said - it is quite expensive.

---

Nicole C. Engard
Open Source Evangelist, LibLime
(888) Koha ILS (564-2457) ext. 714
n...@liblime.com
AIM/Y!/Skype: nengard

http://liblime.com
http://blogs.liblime.com/open-sesame/



On Tue, Feb 3, 2009 at 6:23 AM, MJ Ray m...@phonecoop.coop wrote:
 Alberto Accomazzi aaccoma...@cfa.harvard.edu wrote:
 [...] I know about OCRopus but I have a feeling that
 commercial products still have a significant edge over public domain
 packages. [...]

 OCRopus is released under the Apache License 2.0, which allows
 commercial development.  It is not a public domain package.
 Feel free to use it as a commercial product without fear.

 Hope that helps,
 --
 MJ Ray (slef)
 Webmaster for hire, statistician and online shop builder for a small
 worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
 (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237



Re: [CODE4LIB] best OCR package?

2009-02-03 Thread MJ Ray
Alberto Accomazzi aaccoma...@cfa.harvard.edu wrote:
 [...] I know about OCRopus but I have a feeling that 
 commercial products still have a significant edge over public domain 
 packages. [...]

OCRopus is released under the Apache License 2.0, which allows
commercial development.  It is not a public domain package.
Feel free to use it as a commercial product without fear.

Hope that helps,
-- 
MJ Ray (slef)
Webmaster for hire, statistician and online shop builder for a small
worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
(Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237


Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Walter Lewis

Randy Stern wrote:
Abbyy Finereader and Nuance Omnipage are the two leading commercial 
OCR products. Both can achieve 98% + character accuracy on most 
book-like material scanned at 300 dpi.


At 07:37 AM 2/3/2009 -0500, Nicole Engard wrote:

I'm with Christian - I loved Abbyy FineReader when I used it at both
my previous libraries.  It's very accurate and it's affordable if
you're not using it for mass digitization :) but we never got the
server contract because like Christian said - it is quite expensive.
Abbyy's engine is actually quite affordable for mass digitization 
efforts as well.  Indeed, if you look closely at the outputs from the 
Internet Archive you'll see they use it extensively.  The desktop model 
requires bodies to handle the inputs and outputs; the server version can 
be built into a workflow.  Once you get past the time to set it up, the 
cost per page is *very* low ( from memory ~1 to 2 cents per page).


Walter Lewis


Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Karen Coyle

Randy Stern wrote:
Abbyy Finereader and Nuance Omnipage are the two leading commercial 
OCR products. Both can achieve 98% + character accuracy on most 
book-like material scanned at 300 dpi.


I know that 98% is impressive, but I always like to remember that with 
an average of 2000 characters per page that means 40 potential errors 
per book page. Just to give us some perspective on the level of cleanup 
that will be needed for books being digitized today.


kc

--
---
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234



Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Gabriel Farrell
On Tue, Feb 03, 2009 at 10:09:54AM -0500, Walter Lewis wrote:
 If we had to correct it all: a) it would never get done and b) it would  
 be better than some of the originals which are rife with typographic 
 errors.

Hence the genius of Distributed Proofreaders [1] and reCAPTCHA [2].

[1] http://www.pgdp.net/c/
[2] http://recaptcha.net/learnmore.html


Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Nicole Engard
I'm with Christian - I loved Abbyy FineReader when I used it at both
my previous libraries.  It's very accurate and it's affordable if
you're not using it for mass digitization :) but we never got the
server contract because like Christian said - it is quite expensive.

---

Nicole C. Engard
Open Source Evangelist, LibLime
(888) Koha ILS (564-2457) ext. 714
n...@liblime.com
AIM/Y!/Skype: nengard

http://liblime.com
http://blogs.liblime.com/open-sesame/



On Tue, Feb 3, 2009 at 6:23 AM, MJ Ray m...@phonecoop.coop wrote:
 Alberto Accomazzi aaccoma...@cfa.harvard.edu wrote:
 [...] I know about OCRopus but I have a feeling that
 commercial products still have a significant edge over public domain
 packages. [...]

 OCRopus is released under the Apache License 2.0, which allows
 commercial development.  It is not a public domain package.
 Feel free to use it as a commercial product without fear.

 Hope that helps,
 --
 MJ Ray (slef)
 Webmaster for hire, statistician and online shop builder for a small
 worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
 (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237



Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Walter Lewis

Gabriel Farrell wrote:

On Tue, Feb 03, 2009 at 10:09:54AM -0500, Walter Lewis wrote:
  
If we had to correct it all: a) it would never get done and b) it would  
be better than some of the originals which are rife with typographic 
errors.



Hence the genius of Distributed Proofreaders [1] and reCAPTCHA [2].

[1] http://www.pgdp.net/c/
[2] http://recaptcha.net/learnmore.html
  
I have tremendous respect for the genius behind these projects, but the 
Victorian four page village newspapers have enough text for a your 
average government report.  Put four together and you get a three-decker 
novel. The folks in the Distributed Proofreaders rarely sign up for the 
labours of Hercules (and, according to my sources, he only hung in there 
for twelve tasks).


Then you have to deal with the fact that OCRing some of the microfilm 
I've seen is probably not statistically different from invoking a random 
token generator ...


Walter


Re: [CODE4LIB] best OCR package?

2009-02-03 Thread Walter Lewis

Karen Coyle wrote:
I know that 98% is impressive, but I always like to remember that with 
an average of 2000 characters per page that means 40 potential errors 
per book page. Just to give us some perspective on the level of 
cleanup that will be needed for books being digitized today.
The good news from the perspective of searching is that a reasonable 
percentage of those errors will affect terms that are either rarely used 
in searching or are repeated correctly in the vicinity. 

The bad news:  phrase search is compromised. Screen readers for the 
visually impaired are compromised. Relevance that depends on term 
clustered is compromised.


If we had to correct it all: a) it would never get done and b) it would 
be better than some of the originals which are rife with typographic errors.


Walter
 so still regrets the Swedish Chef OCR of most microfilm newspaper projects