Is there a collection of links to documentation that isn't in the
wiki? From reading the list archives, it appears there must be some,
but I could use help finding them. The wiki is (quite understandably)
focussed on development issues. I'm trying to learn how to tie the
current tools together and use them to create character recognition
and language models for the corpus I'm interested in.

Personal Introduction:

I'm a technical person (a WAN engineer), have 0.4.4 compiled and
working on my Linux PC, and am fairly proficient in tcl scripting. But
I'm certainly not a software developer.

My interest: I hope to use OCROpus on a large volume (>5000 pages) of
old (125-300 years old) German Lutheran theological literature. It's
all in the old fraktur German script, which is notoriously difficult
for OCR. The only commercial tool worth noting for this OCR niche is
from Abbyy. Their tool is terribly expensive, because there is a
considerable per-page licensing cost for the Fraktur module, in
addition to the initial cost of the software.

My project presents several challenges:

1) Tesseract does fairly well with Fraktur. But I'm aware of no
language tools to go with it. No matter how good the character
recognition gets, centuries-old scanned fraktur documents will have a
high error rate. Does anyone know of tools that could start with
Tesseract's fairly good output and apply language models to it?

2) OCROpus appears to have a language model, and is trainable, but I
gather from the archives that inserting tesseract for the character
recognition in OCROpus isn't realistic. So I will have to train
OCROpus' character recognition engine. Does anyone already have a
trained character recognition model for old German Fraktur documents?

3) Modern German language models (be they spelling, morphology, word
frequency, etc.) won't work well on this corpus. So that, too, I will
need to create. Does anyone have one already? Where do I find
documentation on the nitty gritty of how to create a language model?

4) Any and all suggestions greatly appreciated.

Thanks,
Dennis

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en.

Reply via email to