Sven, the more you describe the situation, the more I realize that my needs are not the same as yours or others who are here. Has anybody, now that you are discussing a fork, tried to draw a map of what kinds of needs the user community has?
I'm scanning old books and newspapers, and want to make really good OCR that can be manually proofread with as little effort as possible. This means lots of old typefaces, lots of old spelling, lots of strange names, often different languages on the same page, often bad print quality, often complex page layout. When I discover an error, I want to fix the OCR engine, continuously training it to become more and more perfect. If I find a new kind of upper-case "H", it would be insane to apply this new experience only to the interpretation of Swedish, since it will soon appear in texts in other languages. It would also be insane if I was the only one to benefit from such an improvement. It should go back into the engine, so all users can benefit. The way language training is described in Tesseract, it clearly can't meet these needs. The software never was designed with these goals in mind, or it would look very different. Just one example: If I want to train "fraktur" (black letter), there's no easy way I can generate a pattern page because I don't have fraktur fonts installed on my computer. I never write fraktur, I only read it in old books. The internal needs of Google Book Search should be very similar to my needs, and if that's where the previous lead developer works, I can understand if he has abandoned Tesseract for some other design. I can also understand if Google wants to keep that new design to themselves. It would most probably be based on statistics from the many million books that Google has already scanned. Does anybody know of an open source OCR project that is based on statistics from scanned books? Could parts of the Tesseract software library be used to cut out letters from scanned pages, so some other software could group them statistically? -- Lars Aronsson ([email protected]) Aronsson Datateknik - http://aronsson.se -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

