Re: Open source OCR (was [openhealth] Re: Hi folks..)

Tim Churches Mon, 19 Feb 2007 14:28:10 -0800

Karsten Hilbert wrote:
> On Tue, Feb 20, 2007 at 08:16:07AM +1100, Tim Churches wrote:
> 
>>> The data retrieved from step 4 will be computable data ! Not
>>> particularly well constrained, but not just image data either.
>> Ah, OK. Our problem is that many users only want to record data with a
>> pen, on paper. No typing, no computers. And then scan the paper forms in
>> and have their data appear, automagically, in the database just as if
>> they had typed it. Nearly every user over the age of 50 asks for that.
> I see.
> 
>> Mind you, these are mobile users, and paper forms and a pen are highly
>> portable, robust and never need to be plugged in, so they do have a bit
>> of a (ball) point.
>
> No problem, the availability of pen-and-paper is nearly unbeatable.
> 
> That would bring us back to constrained OCR. Which sort of
> pre-sets the range of output to expect from the OCR process.
> Always assuming users write down valid data ...


Yes, and that is what Teleform claims to be able to do, and if users are
well-behaved and write very neatly, then it does work. But if they
don't... more trouble than it is worth, in my experience.

Which leads me to think that something like Amazon's Mechanical Turk
(see http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk ) is required.
Not actually MTurk, due to privacy concerns, but a pool of trusted or
semi-trusted humans who can do the data entry in a distributed fashion
from scanned forms. The fields on each form could even be teased apart
first by raster image manipulation, and each human transcriber/typist is
sent only small parts of a set of randomly chosen forms, so as to
further mitigate privacy concerns. Either all (for double-entry data
validation) or a sample of bits of form text could be sent to more than
one person and the results cross-checked and resolved if discrepant to
ensure accuracy. Hmmmm.

Tim C

Re: Open source OCR (was [openhealth] Re: Hi folks..)

Reply via email to