Keep in mind that accuracy depends heavily on the right fonts being
included in the training set. I have no reason to believe that the
2.04 and 3.0 training sets are identical - perhaps someone could
enlighten us. In any case, I routinely come accross certain pages
where recognition is terrible and where there is no doubt that the
cause is a missing font.

On Jul 26, 1:55 pm, Philip Pemberton <phil...@gmail.com> wrote:
> Hi,
> I'm working on cataloguing about 20 years of journals and magazines,
> down to article level where possible. My plan is to scan the Table of
> Contents pages from each issue, OCR with Tesseract, then use text
> processing software (a fancy way of saying "a Python script") to analyse
> the text, find the article titles, and add the data to a MySQL database.
>
> Tesseract 2.04 does pretty well for accuracy -- at worst, I get the
> occasional full-stop turning into a hyphen/dash. All pretty simple to
> fix. Problem is, Tess2.04 can't handle double-quotes -- instead it dies
> with this error:
>
> phil...@cheetah:~/$ tesseract elek0002.tif elek0002_tess2
> Tesseract Open Source OCR Engine
> tesseract: unicharset.cpp:76: const UNICHAR_ID
> UNICHARSET::unichar_to_id(const char*, int) const: Assertion
> `ids.contains(unichar_repr, length)' failed.
> Aborted
>
> If I use Tesseract 3 (the current SVN release), then I can OCR the page:
>
> phil...@cheetah:~/$ LD_LIBRARY_PATH=/tmp/tess/lib
> /tmp/tess/bin/tesseract elek0002.tif elek0002_tess3
> Tesseract Open Source OCR Engine with LibTiff
>
> But the error rate is FAR worse. The page numbers on the right-hand side
> of the page are completely gone, the first line is mush (random letters)
> and upper-case "M" gets OCR'd as "l\/l" (usually when the page contains
> a frequency, e.g. "89 MHz").
>
> The assertion failure seems to be a manifestation of Issue #265
> (http://code.google.com/p/tesseract-ocr/issues/detail?id=265), which is
> apparently "fixed in Tesseract 3". What I'd like is the recognition
> accuracy of 2.04, with the stability of 3.0 (or at least the bugfix for
> #265)...
>
> Is there any way to get the accuracy back where it was with 2.04 (or at
> least get the page numbers back)?
>
> I've uploaded my test images here:
>    http://www.philpem.me.uk/temp/tesseract/
>
> Both are greyscale TIFFs.
>
> ELEK0001.TIF is a "works fine" example that OCRs almost perfectly in
> Tess2.04 but has significant errors in Tess3.0-svn.
>
> ELEK0002.TIF crashes Tess2.04, works in Tess3.0-svn, but has a lot of
> errors (especially on the first line).
>
> When processed with Tess3.0, the page numbers (right-hand column) are
> omitted from the output .TXT file.
>
> Thanks,
> Phil.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to