Is there a collection of links to documentation that isn't in the wiki? From reading the list archives, it appears there must be some, but I could use help finding them. The wiki is (quite understandably) focussed on development issues. I'm trying to learn how to tie the current tools together and use them to create character recognition and language models for the corpus I'm interested in.
Personal Introduction: I'm a technical person (a WAN engineer), have 0.4.4 compiled and working on my Linux PC, and am fairly proficient in tcl scripting. But I'm certainly not a software developer. My interest: I hope to use OCROpus on a large volume (>5000 pages) of old (125-300 years old) German Lutheran theological literature. It's all in the old fraktur German script, which is notoriously difficult for OCR. The only commercial tool worth noting for this OCR niche is from Abbyy. Their tool is terribly expensive, because there is a considerable per-page licensing cost for the Fraktur module, in addition to the initial cost of the software. My project presents several challenges: 1) Tesseract does fairly well with Fraktur. But I'm aware of no language tools to go with it. No matter how good the character recognition gets, centuries-old scanned fraktur documents will have a high error rate. Does anyone know of tools that could start with Tesseract's fairly good output and apply language models to it? 2) OCROpus appears to have a language model, and is trainable, but I gather from the archives that inserting tesseract for the character recognition in OCROpus isn't realistic. So I will have to train OCROpus' character recognition engine. Does anyone already have a trained character recognition model for old German Fraktur documents? 3) Modern German language models (be they spelling, morphology, word frequency, etc.) won't work well on this corpus. So that, too, I will need to create. Does anyone have one already? Where do I find documentation on the nitty gritty of how to create a language model? 4) Any and all suggestions greatly appreciated. Thanks, Dennis -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
