Re: Tess3.01 not recognizing my curly double quotes.

2012-07-23 Thread Galt
That's great news, Nick! I can't wait to try it on the old Irish fonts! -Galt On Tuesday, July 3, 2012 9:44:27 AM UTC-7, Nick White wrote: > > On Fri, Jun 01, 2012 at 10:16:52AM +0100, Nick White wrote: > > On Wed, May 23, 2012 at 05:39:00PM +0100, Nick White wrote:

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-30 Thread Galt
Here is my pdfbuilder.rb diff. This contains my fixes to use Tess3.01-specific hocr output with crisp word-start boundaries, as well as tolerate empty word or line in hocr output. $ diff pdfbuilder.orig.rb pdfbuilder.rb 480c480 < ocr_words = ocr_line.search("//span[@class='ocrx_word']") ---

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-26 Thread Galt
Worderful news, Zdenko! > Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output > to follow hOCR spec. I wonder what he did? > A. Spec conformity. As far as I understood this is fixed (no report about > non conformity to hOCR spec). Good. > B. Usability in other tools. Th

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-26 Thread Galt
Here's my pdf if anyone is interested: http://folkplanet.com/seanchlo/gortoir/GortOir.pdf Made with scanTailor, jbigenc, pdfbeads and Tess3.01. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-23 Thread Galt
Thanks, Zdenko! I found most of those same links too. FYI here is Tess3.01 output: Dul fé na Gréine . . . . 3 In a nutshell, Tess 3.01 outputs this pattern for each word: Dul And judging by pdfbeads code, tess 3.00 did something like this for each word: Dul

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-22 Thread Galt
> > > Please create issue with description what is output and how it should be... > > Until then I have forced to make a little hack to pdfbeads to get it > > to read the position > > and word from ocr_word and ocrx_word respectively so that it can read > > the Tess3.01 hocr input.  It seems that

Re: Tess3.01 not recognizing my curly double quotes.

2012-05-22 Thread Galt
On May 21, 2:04 am, Nick White wrote: > Hi Galt, > > I've been suffering a very similar problem with some of the text I'm > training, which has several diacritics above and below glyphs. It > isn't infrequent to find quite a few lines of garbage which are some >

Tess3.01 hocr output not working with pdfbeads

2012-05-21 Thread Galt
I should begin by saying that I am grateful and happy to have a very nice searchable pdf of an old book thanks to Tess. I found this on the web: https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b pdfbeads currently doesn't work with hOCR output generated

Re: What are the real requirements for training?

2012-05-19 Thread Galt
oops, swap two column labels: > > So, in 72 pages with 56 thousand characters, > it only made roughy 20 to 30 errors, > not counting all the chapter titles which > should be upper case but end up in lower case. > > $ cat *.txt | wc >    1601    9750   55776 >  words     lines    chars should have

Re: What are the real requirements for training?

2012-05-19 Thread Galt
I think I am guilty of exaggerating Tess' output quality a bit, but after many terrible failures the output looked excellent, and it is indeed very good if not actually perfect. Going back over the output very carefully did turn up some errors many of which are puzzling. Still the overall quality

Re: What are the real requirements for training?

2012-05-18 Thread Galt
On May 17, 9:29 am, Falke wrote: > On May 17, 5:50 am, Galt wrote: > > I am assuming you only used one type of font?  I mean -- no font > variation at all, right? I ran a test training recently, with very > little data, but mixed italic with normal and bold, and very good > r

Re: What are the real requirements for training?

2012-05-17 Thread Galt
SUCCESS AT LAST! I have used this simple training text and the output is highly accurate. I am very happy to have succeeded at last. I only wish the documentation had warned more explicitly what is needed in training. Here is what worked for me: Start each line with a capital letter that you ne

Re: Tess3.01 not recognizing my curly double quotes.

2012-05-16 Thread Galt
FOLLOW UP THIS DOES NOT REALLY WORK, it is only a misleading trick. If I remove the fuzzies that had appeared as anti-aliasing effects of using convert with -resize before -monochrome then the fuzzies disappear and so does the beneficial effect on quotes. It does not appear that the 300 dpi was re

What are the real requirements for training?

2012-05-16 Thread Galt
Tess 3.01 By trial and error, I seem to have found the following limitation: No single line may contain all-caps (ignoring punctuation). If it does, tess will blow up your model and give you incorrect upper and lower case output. There is no warning. At first, I was finding trouble just with th

Re: ALL CAPS TITLE at the top of the page screws up training.

2012-05-16 Thread Galt
Actually, this limitation seems to apply to all training lines: Avoid any line of ALL CAPS or else you will get incorrect and casing of your ocr output. Even with -psm 6 it happens. I am not completely sure why this happens, but it would be nice to have a warning and some documentation of the li

Re: Tess3.01 not recognizing my curly double quotes.

2012-05-14 Thread Galt
Right now I have been forced by this problem to use 300dpi (instead of 600dpi which is what I actually scanned at). Since the box finder never joins them as double-curlys by itself, I have taken to defining the single curlys left and right, and then will use either ambigs or a post-processing step

Re: Tess3.01 not recognizing my curly double quotes.

2012-05-14 Thread Galt
I found a message in the forum in which someone recommends scanning Nepali test at 600dpi, which implies that Tess is supposed to work on it. Letters which had a small amount of space between them at 600 dpi will sometimes lose that at 300dpi. -- You received this message because you are subscri

ALL CAPS TITLE at the top of the page screws up training.

2012-05-13 Thread Galt
ALL CAPS TITLE AT TOP OF PAGE SCREWS UP TRAINING If I have a scan training page with an ALL-CAPS title as the first line, it screws up the training, and I get incorrect upper/lower casing of many letters. If one looks at the boxes as identified by tess, it will be like this: All Caps Title inste

Re: Tess3.01 not recognizing my curly double quotes.

2012-05-13 Thread Galt
I have found that there is a scale-dependency in the curly quotes handling. If I create 300 dpi versions of my scans, then Tess3.01 begins working much better. That is a huge relief and makes tess usable. I wish I could use the 600dpi scans. I have them. Seems like this might be a little bug. May

Tess3.01 not recognizing my curly double quotes.

2012-05-12 Thread Galt
Tess3.01 has a lot of trouble recognizing my curly double quotes. Unfortunately, my scans have lots of dialog with these in them. My Irish font is one with diacriticals. It has accents over vowels and dots over consonants. In addition, the uppercase letters are just larger versions of the lower ca