On Sun, Feb 24, 2013 at 12:20 AM, Nick White <[email protected]>wrote:

> On Fri, Feb 22, 2013 at 03:20:49PM +0000, Nick White wrote:
> > On Sun, Jun 03, 2012 at 10:27:23PM +0100, zdenko podobny wrote:
> > > it looks like it is ASCII only oriented (at least in report non-ASCII
> are
> > > malformed...), ftk has only binary distribution, so no possible fix can
> > > expected...
> > >
> > > BTW: tools are at new place:
> http://code.google.com/p/isri-ocr-evaluation-tools
> > > ; report can be found at stephenvrice.com/images/AT-1995.pdf
> >
> > I finally got around to working with these tools a bit. It seems
> > that they do process unicode correctly (though I haven't tested
> > combined characters, and suspect that may not work). You're correct
> > the reports don't seem to output unicode properly, but that's
> > probably easily fixed.
>
> Right, I created a workaround to enable at least the 'accuracy' tool
> (which is the really important one) to work fine with UTF-8. It's a
> script called utf8toolwrap.sh; if you're interested, check it out;
> it's attached to this issue:
> https://code.google.com/p/isri-ocr-evaluation-tools/issues/detail?id=2
>
> It makes the 'accuracy' tool actually very useful; it shows how
> common various misrecognitions are - very useful for potential
> unicharambigs rules :)
>
> Nick
>
> P.S. It requires a Linux-ish environment, and the tools asc2uni and
> uni2asc from the isri toolkit to be available on the PATH.
>
> Hi

thanks for caring about this...  Maybe with would make a sense to make fork
of these tools ;-) Just in a case that there will be nobody who will react
on your patches. And we case some time with applying several patches from
issues ;-)

I did not have time to work more on this issue, but I (and maybe others in
tesseract community ;-) ) need 2 advanced tools:

   - tool for finding optimal settings - if there is image files with ground
   truth data, you can iterate tesseract variables (or improve image regarding
   resolution, denoising, bluring etc) to find optimal settings (from point of
   speed, OCR result...)
   - tool for measuring of training quality e.g. how many pages I need to
   training to get reasonable result? If I add another similar font how it
   effect OCR result (I have a bad experience on this)? Is there problem with
   specific symbol (is there need to focus on some specific symbol)?

isri-ocr-evaluation-tools could be heart of these;-) We have several
"image generators", box editors, training scripts but no  "quality measure
tool". So if anybody has "free time" to make something useful - this would
be great.

Zdenko

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to