Re: [CODE4LIB] "best" OCR package?
On February 2, Walter Lewis wrote: > The "good" news from the perspective of searching is that a > reasonable percentage of those errors will affect terms that are > either rarely used in searching or are repeated correctly in the > vicinity. This is why OCR should be done by a search engine company (such as Google), which has statistics on what real people really search for, and can improve the OCR process as it goes. Software developing companies such as ABBYY or Omnipage never get that kind of feedback from actual users. They only represent a fraction of the entire feedback loop. All my experience of scanning old Swedish and Danish books with ABBYY Finereader, never got back to ABBYY, they never asked for any of that feedback. I have no idea to what degree Google Book Search does this right, but by controlling the entire scan-search loop they have one excuse less to fail. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se
Re: [CODE4LIB] "best" OCR package? [SEC=UNCLASSIFIED]
Emanuel, I have used Microsoft Office Document Imaging that works really well with tiff files. Most, if not all scanners, will scan into tiffs which you can then convert into text, rtf or word files easily. The other one I used was Pro Millennium which is compatible with ms word, excel etc. I would highly recommend both of them. Renata -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Emmanuel Di Pretoro Sent: Tuesday, 3 February 2009 7:54 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] "best" OCR package? Hi, It wasn't a recommendation since I never try it, but I've heard a lot of good about tesseract. It was currently developed by Google, but I don't know if they use it. Some link : - http://code.google.com/p/tesseract-ocr/ - http://en.wikipedia.org/wiki/Tesseract_%28software%29 Hope this help, Emmanuel Di Pretoro 2009/2/3 Alberto Accomazzi > Sorry if this is a bit off-topic, but I was wondering if any of you > clever fellows have a recommendation for an OCR package, possibly with > a native linux port. I know about OCRopus but I have a feeling that > commercial products still have a significant edge over public domain > packages. So what are you using and/or do you know what the big guys > (google, IA, microsoft) are using? > > Thanks, > -- Alberto > > > -- > Dr. Alberto Accomazzi aaccomazzi(at)cfa harvard edu > Project Manager > NASA Astrophysics Data Systemads.harvard.edu > Harvard-Smithsonian Center for Astrophysics www.cfa.harvard.edu > 60 Garden St, MS 67, Cambridge, MA 02138, USA > ** Please Note: The information contained in this e-mail message and any attached files may be confidential information and may also be the subject of legal professional privilege. If you are not the intended recipient, any use, disclosure or copying of this e-mail is unauthorised. If you have received this e-mail by error please notify the sender immediately by reply e-mail and delete all copies of this transmission together with any attachments. **
Re: [CODE4LIB] "best" OCR package?
Gabriel Farrell wrote: On Tue, Feb 03, 2009 at 10:09:54AM -0500, Walter Lewis wrote: If we had to correct it all: a) it would never get done and b) it would be better than some of the originals which are rife with typographic errors. Hence the genius of Distributed Proofreaders [1] and reCAPTCHA [2]. [1] http://www.pgdp.net/c/ [2] http://recaptcha.net/learnmore.html I have tremendous respect for the genius behind these projects, but the Victorian four page village newspapers have enough text for a your average government report. Put four together and you get a three-decker novel. The folks in the Distributed Proofreaders rarely sign up for the labours of Hercules (and, according to my sources, he only hung in there for twelve tasks). Then you have to deal with the fact that OCRing some of the microfilm I've seen is probably not statistically different from invoking a random token generator ... Walter
Re: [CODE4LIB] "best" OCR package?
On Tue, Feb 03, 2009 at 10:09:54AM -0500, Walter Lewis wrote: > If we had to correct it all: a) it would never get done and b) it would > be better than some of the originals which are rife with typographic > errors. Hence the genius of Distributed Proofreaders [1] and reCAPTCHA [2]. [1] http://www.pgdp.net/c/ [2] http://recaptcha.net/learnmore.html
Re: [CODE4LIB] "best" OCR package?
Karen Coyle wrote: I know that 98% is impressive, but I always like to remember that with an average of 2000 characters per page that means 40 potential errors per book page. Just to give us some perspective on the level of cleanup that will be needed for books being digitized today. The "good" news from the perspective of searching is that a reasonable percentage of those errors will affect terms that are either rarely used in searching or are repeated correctly in the vicinity. The bad news: phrase search is compromised. Screen readers for the visually impaired are compromised. Relevance that depends on term clustered is compromised. If we had to correct it all: a) it would never get done and b) it would be better than some of the originals which are rife with typographic errors. Walter so still regrets the Swedish Chef OCR of most microfilm newspaper projects
Re: [CODE4LIB] "best" OCR package?
Randy Stern wrote: Abbyy Finereader and Nuance Omnipage are the two leading commercial OCR products. Both can achieve 98% + character accuracy on most book-like material scanned at 300 dpi. I know that 98% is impressive, but I always like to remember that with an average of 2000 characters per page that means 40 potential errors per book page. Just to give us some perspective on the level of cleanup that will be needed for books being digitized today. kc -- --- Karen Coyle / Digital Library Consultant kco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234
Re: [CODE4LIB] "best" OCR package?
Randy Stern wrote: Abbyy Finereader and Nuance Omnipage are the two leading commercial OCR products. Both can achieve 98% + character accuracy on most book-like material scanned at 300 dpi. At 07:37 AM 2/3/2009 -0500, Nicole Engard wrote: I'm with Christian - I loved Abbyy FineReader when I used it at both my previous libraries. It's very accurate and it's affordable if you're not using it for mass digitization :) but we never got the server contract because like Christian said - it is quite expensive. Abbyy's engine is actually quite affordable for mass digitization efforts as well. Indeed, if you look closely at the outputs from the Internet Archive you'll see they use it extensively. The desktop model requires bodies to handle the inputs and outputs; the server version can be built into a workflow. Once you get past the time to set it up, the cost per page is *very* low ( from memory ~1 to 2 cents per page). Walter Lewis
Re: [CODE4LIB] "best" OCR package?
Abbyy Finereader and Nuance Omnipage are the two leading commercial OCR products. Both can achieve 98% + character accuracy on most book-like material scanned at 300 dpi. - Randy Stern (who formerly worked in the OCR industry) At 07:37 AM 2/3/2009 -0500, Nicole Engard wrote: I'm with Christian - I loved Abbyy FineReader when I used it at both my previous libraries. It's very accurate and it's affordable if you're not using it for mass digitization :) but we never got the server contract because like Christian said - it is quite expensive. --- Nicole C. Engard Open Source Evangelist, LibLime (888) Koha ILS (564-2457) ext. 714 n...@liblime.com AIM/Y!/Skype: nengard http://liblime.com http://blogs.liblime.com/open-sesame/ On Tue, Feb 3, 2009 at 6:23 AM, MJ Ray wrote: > Alberto Accomazzi wrote: >> [...] I know about OCRopus but I have a feeling that >> commercial products still have a significant edge over public domain >> packages. [...] > > OCRopus is released under the Apache License 2.0, which allows > commercial development. It is not a public domain package. > Feel free to use it as a commercial product without fear. > > Hope that helps, > -- > MJ Ray (slef) > Webmaster for hire, statistician and online shop builder for a small > worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/ > (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237 >
Re: [CODE4LIB] "best" OCR package?
I'm with Christian - I loved Abbyy FineReader when I used it at both my previous libraries. It's very accurate and it's affordable if you're not using it for mass digitization :) but we never got the server contract because like Christian said - it is quite expensive. --- Nicole C. Engard Open Source Evangelist, LibLime (888) Koha ILS (564-2457) ext. 714 n...@liblime.com AIM/Y!/Skype: nengard http://liblime.com http://blogs.liblime.com/open-sesame/ On Tue, Feb 3, 2009 at 6:23 AM, MJ Ray wrote: > Alberto Accomazzi wrote: >> [...] I know about OCRopus but I have a feeling that >> commercial products still have a significant edge over public domain >> packages. [...] > > OCRopus is released under the Apache License 2.0, which allows > commercial development. It is not a public domain package. > Feel free to use it as a commercial product without fear. > > Hope that helps, > -- > MJ Ray (slef) > Webmaster for hire, statistician and online shop builder for a small > worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/ > (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237 >
Re: [CODE4LIB] "best" OCR package?
Alberto Accomazzi wrote: > [...] I know about OCRopus but I have a feeling that > commercial products still have a significant edge over public domain > packages. [...] OCRopus is released under the Apache License 2.0, which allows commercial development. It is not a public domain package. Feel free to use it as a commercial product without fear. Hope that helps, -- MJ Ray (slef) Webmaster for hire, statistician and online shop builder for a small worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/ (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237
Re: [CODE4LIB] "best" OCR package?
Hello, 2009/2/3 Alberto Accomazzi > Sorry if this is a bit off-topic, but I was wondering if any of you clever > fellows have a recommendation for an OCR package, possibly with a native > linux port. I know about OCRopus but I have a feeling that commercial > products still have a significant edge over public domain packages. So what > are you using and/or do you know what the big guys (google, IA, microsoft) > are using? > We are using the Abbyy Finereader Engine [1] which also has a Linux port available. But it's quite expensive for mass digitization affords, since it's licensed on a per page base. Best, Christian [1] http://www.abbyy.com/sdk/
Re: [CODE4LIB] "best" OCR package?
Hi, It wasn't a recommendation since I never try it, but I've heard a lot of good about tesseract. It was currently developed by Google, but I don't know if they use it. Some link : - http://code.google.com/p/tesseract-ocr/ - http://en.wikipedia.org/wiki/Tesseract_%28software%29 Hope this help, Emmanuel Di Pretoro 2009/2/3 Alberto Accomazzi > Sorry if this is a bit off-topic, but I was wondering if any of you clever > fellows have a recommendation for an OCR package, possibly with a native > linux port. I know about OCRopus but I have a feeling that commercial > products still have a significant edge over public domain packages. So what > are you using and/or do you know what the big guys (google, IA, microsoft) > are using? > > Thanks, > -- Alberto > > > -- > Dr. Alberto Accomazzi aaccomazzi(at)cfa harvard edu > Project Manager > NASA Astrophysics Data Systemads.harvard.edu > Harvard-Smithsonian Center for Astrophysics www.cfa.harvard.edu > 60 Garden St, MS 67, Cambridge, MA 02138, USA >