That's great news, Nick! I can't wait to try it on the old Irish fonts!
-Galt
On Tuesday, July 3, 2012 9:44:27 AM UTC-7, Nick White wrote:
>
> On Fri, Jun 01, 2012 at 10:16:52AM +0100, Nick White wrote:
> > On Wed, May 23, 2012 at 05:39:00PM +0100, Nick White wrote:
Here is my pdfbuilder.rb diff.
This contains my fixes to use Tess3.01-specific hocr output
with crisp word-start boundaries,
as well as tolerate empty word or line in hocr output.
$ diff pdfbuilder.orig.rb pdfbuilder.rb
480c480
< ocr_words = ocr_line.search("//span[@class='ocrx_word']")
---
Worderful news, Zdenko!
> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
> to follow hOCR spec.
I wonder what he did?
> A. Spec conformity. As far as I understood this is fixed (no report about
> non conformity to hOCR spec).
Good.
> B. Usability in other tools. Th
Here's my pdf if anyone is interested:
http://folkplanet.com/seanchlo/gortoir/GortOir.pdf
Made with scanTailor, jbigenc, pdfbeads and Tess3.01.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr
Thanks, Zdenko!
I found most of those same links too.
FYI here is Tess3.01 output:
Dul
fé
na
Gréine
.
.
.
.
3
In a nutshell, Tess 3.01 outputs this pattern for each word:
Dul
And judging by pdfbeads code, tess 3.00 did something like this for
each word:
Dul
>
> > Please create issue with description what is output and how it should be...
> > Until then I have forced to make a little hack to pdfbeads to get it
> > to read the position
> > and word from ocr_word and ocrx_word respectively so that it can read
> > the Tess3.01 hocr input. It seems that
On May 21, 2:04 am, Nick White wrote:
> Hi Galt,
>
> I've been suffering a very similar problem with some of the text I'm
> training, which has several diacritics above and below glyphs. It
> isn't infrequent to find quite a few lines of garbage which are some
>
I should begin by saying that I am grateful and happy to have
a very nice searchable pdf of an old book thanks to Tess.
I found this on the web:
https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b
pdfbeads currently doesn't work with hOCR output generated
oops, swap two column labels:
>
> So, in 72 pages with 56 thousand characters,
> it only made roughy 20 to 30 errors,
> not counting all the chapter titles which
> should be upper case but end up in lower case.
>
> $ cat *.txt | wc
> 1601 9750 55776
> words lines chars
should have
I think I am guilty of exaggerating Tess' output quality a bit,
but after many terrible failures the output looked excellent,
and it is indeed very good if not actually perfect.
Going back over the output very carefully did turn up some
errors many of which are puzzling. Still the overall quality
On May 17, 9:29 am, Falke wrote:
> On May 17, 5:50 am, Galt wrote:
>
> I am assuming you only used one type of font? I mean -- no font
> variation at all, right? I ran a test training recently, with very
> little data, but mixed italic with normal and bold, and very good
> r
SUCCESS AT LAST!
I have used this simple training text and the output is highly
accurate.
I am very happy to have succeeded at last. I only wish the
documentation
had warned more explicitly what is needed in training.
Here is what worked for me:
Start each line with a capital letter that you ne
FOLLOW UP THIS DOES NOT REALLY WORK,
it is only a misleading trick. If I remove the fuzzies
that had appeared as anti-aliasing effects
of using convert with -resize before -monochrome
then the fuzzies disappear and so does the beneficial effect on
quotes.
It does not appear that the 300 dpi was re
Tess 3.01
By trial and error, I seem to have found the following limitation:
No single line may contain all-caps (ignoring punctuation).
If it does, tess will blow up your model and give you incorrect upper
and lower case output. There is no warning.
At first, I was finding trouble just with th
Actually, this limitation seems to apply to all training lines:
Avoid any line of ALL CAPS or else you will get incorrect
and casing of your ocr output.
Even with -psm 6 it happens.
I am not completely sure why this happens,
but it would be nice to have a warning and
some documentation of the li
Right now I have been forced by this problem to use 300dpi
(instead of 600dpi which is what I actually scanned at).
Since the box finder never joins them as double-curlys
by itself, I have taken to defining the single curlys
left and right, and then will use either ambigs or
a post-processing step
I found a message in the forum in which someone
recommends scanning Nepali test at 600dpi,
which implies that Tess is supposed to work on it.
Letters which had a small amount of space between
them at 600 dpi will sometimes lose that at 300dpi.
--
You received this message because you are subscri
ALL CAPS TITLE AT TOP OF PAGE SCREWS UP TRAINING
If I have a scan training page with an ALL-CAPS title as the first
line,
it screws up the training, and I get incorrect upper/lower casing
of many letters.
If one looks at the boxes as identified by tess,
it will be like this:
All Caps Title
inste
I have found that there is a scale-dependency in the curly quotes
handling.
If I create 300 dpi versions of my scans, then Tess3.01 begins working
much better. That is a huge relief and makes tess usable.
I wish I could use the 600dpi scans. I have them.
Seems like this might be a little bug.
May
Tess3.01 has a lot of trouble recognizing my curly double quotes.
Unfortunately, my scans have lots of dialog with these in them.
My Irish font is one with diacriticals.
It has accents over vowels and dots over consonants.
In addition, the uppercase letters are just larger versions
of the lower ca
20 matches
Mail list logo