On 06-Feb-16 16:05, Paul Koning wrote:
>> On Feb 6, 2016, at 2:28 PM, Tom Morris <tfmor...@gmail.com> wrote:
>>
>> ...
>> I think Tesseract is pretty close to the quality of ABBYY.  Google has 
>> trained it on a very large corpus and it's used for Google Books, Google 
>> Drive OCR, etc, so it gets a fair amount of attention.  Of course, a lot of 
>> the training effort has gone into training it for over 100 languages, which 
>> isn't really relevant to old computer documentation, but even for plain 
>> English, it's received lots of training attention.
> Is Tesseract open source?  
Yes, it's open sourced.  https://github.com/tesseract-ocr

> It sounds vaguely like the one I tried, but I'm not sure; I remember 
> something that felt more like a toolkit than like an application.
Yes, it's the engine.  There are various wrappers that provide more
polished interfaces.
> Google's OCR is pretty lousy in many cases.  Perhaps that's because they just 
> feed it stuff without ever looking at the result.  There are plenty of Google 
> books that have errors in the majority of the words.
The amazing thing about a talking dog is not how well it talks, but that
it talks at all.

For the volume of stuff they've scanned, it's pretty impressive.  If a
book is that bad, no one looked at it & retrained.  What Tom sent around
earlier is fairly typical (in my limited experience).  It would take
someone a good hour or two to clean it up.

>       paul
>
>


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Reply via email to