Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Sven Pedersen
You appear to be a fellow Ithacan! (I no longer live there, but remember it fondly.) Anyway, other common ligatures include ff, ffi, ffl, fb, fy, ft http://ilovetypography.com/2007/09/09/decline-and-fall-of-the-ligature/ Sven On Monday, April 29, 2013, Michael Sander wrote: > Yes, I'm doing some

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Michael Sander
Yes, I'm doing something similar in python. Do you know of a list of a ligatures so I can convert them to ascii? I know fi and fl are the most popular, but there are probably many more. Michael Sander michael.san...@gmail.com 607-227-9859 On Mon, Apr 29, 2013 at 7:48 PM, Greg Dunkel wrote: >

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Greg Dunkel
I couldn't get the config to work on Ubuntu so I wrote a post-processing sed script to convert the ligatures to two characters. On Mon, Apr 29, 2013 at 3:45 AM, Michael Sander wrote: > How did you format your config file? I tried adding the following line and > it doesn't seem to work: > > tesse

Re: is there howto to optimise text extraction from non-document images?

2013-04-29 Thread Sven Pedersen
Search the forum for license plates, that has been the most popular non document application. https://groups.google.com/forum/m/?fromgroups#!forum/tesseract-ocr Sven On Sunday, April 28, 2013, Jonathan Chetwynd wrote: > I have a number of webcam images of road signage, > > tesseract-ocr output is

Re: tesseract training - forms columning and handwritten text

2013-04-29 Thread Sven Pedersen
Tesseract is not good for handwritten text, but you could check out http://www.chronoscan.org which uses Tesseract OCR for part of the processing. There is a free handwriting recognizer at lipitk.sf.net but I'm not sure that'll work for what you want. --Sven On Mon, Apr 29, 2013 at 11:09 AM, wr

tesseract training - forms columning and handwritten text

2013-04-29 Thread luigi . daddario
Hi, have you experience with forms columning and handwritten text? This is the bad result: NOME i 15/1/80 i COGNOME MÉZL 915*675..6..*..- } NATO/AH. / / 55.555 A e}! / lpaov.

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Nick White
On Mon, Apr 29, 2013 at 07:00:47AM -0700, Michael Sander wrote: > On a related note, why is tesseract even generating these characters in the > first place given the fact that I chose English as the training data? They are english characters. They're ligatures, used in printed English a lot. Look

Re: Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread Nick White
Oh cool, I haven't actually used multi-page TIFFs before, it's nice that Tesseract handles them well, straight from ghostscript. Yes, at the moment I suppose you'll just have to make a little script or something to wrap the ghostscript and tesseract steps appropriately. I have used pdfimages for

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Michael Sander
Still not working. I tried attaching the config,, but it won't let me because it's binary. I made a workaround by converting all instances of fi into fi in the output, but obviously it would be better to strip the unicode first in tesseract. On a related note, why is tesseract even generating th

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread klo uo
Michael, for example add this line in your config file: tessedit_char_blacklistfifl I don't know how gmail with represent these characters, but make sure file is in UTF8 I guess On Mon, Apr 29, 2013 at 9:45 AM, Michael Sander wrote: > How did you format your config file? I tried adding the

Re: Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread Steven McArdle
Thanks Nick I already have it set up for ghostscript as it gives better results than imagemagick. As the PDF's are mostly multi-page and ghostscript can generate multi-page TIFF from these, I can feed these directly into Tesseract. So I don't think pdfimages is an option as it spits out multip

Re: Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread TP
On Mon, Apr 29, 2013 at 4:10 AM, Steven McArdle wrote: > What do you mean by "it doesn't support straight PDF" ? > > Leptonica only supports PDF for relatively simple *output*. See "I/O libraries Leptonica is dependent on" [1] and "Image I/O" [2]. If you don't believe that, see src\environ.h [3] f

Re: Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread Nick White
On Mon, Apr 29, 2013 at 04:10:43AM -0700, Steven McArdle wrote: > What do you mean by "it doesn't support straight PDF" ? I mean it only accepts image files. So you need to extract the images from the PDF before getting Tesseract to process them. Now I think of it, the 'pdfimages' tool is better

Re: Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread Steven McArdle
What do you mean by "it doesn't support straight PDF" ? The PDF I have is a pure image PDF i.e. from a scanner with NO OCR, just the image layer. I can convert this to TIFF with good results using Ghostscript but I was hoping that Tesseract could handle image only PDF's Steve On Monday, Ap

Tesseract 3.02.02 compiled with Leptonica 1.69 doesn't process PDF's

2013-04-29 Thread Steven McArdle
Hi All After compiling Tesseract 3.02.02 with Leptonica 1.69 on Ubuntu 12.04 tesseract --version reports tesseract 3.02.02 (No Leptonica details) If I try to run tesseract on a PDF file I get Tesseract Open Source OCR Engine v3.02.02 with Leptonica Error in pixReadStream: Unknown format: no

Re: Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread Nick White
> ALSO, I thought tesseract built with leptonica could handle any of the formats > leptonica can handle, and that include PDF. Nope, it doesn't support straight PDF. Best is to rip the images out of the PDF first. If you have imagemagick, something like this will do that: convert my-test.pdf ou

Building tesseract 3.02.02 with leptonica 1.69

2013-04-29 Thread Steven McArdle
Hi All I have built Tesseract 3.02.02 with Leptonica 1.69 but I have some problems running tesseract --version reports tesseract 3.02.02 Notice it does not mention leptonica ? Secondly, if I try to use a PDF as input I get the following error $ tesseract my-test.pdf my-test Tesseract Open So

is there howto to optimise text extraction from non-document images?

2013-04-29 Thread Jonathan Chetwynd
I have a number of webcam images of road signage, tesseract-ocr output is highly variable, how to optimise? for example http://peepo.com/pics/ocr/road_signs.jpg outputs West End Barbican Exhibition -9 Halls and http://peepo.com/pics/ocr/when_red.png 'when red light shows stop here' outputs

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Michael Sander
How did you format your config file? I tried adding the following line and it doesn't seem to work: tessedit_char_blacklist fi On Sunday, April 1, 2012 5:16:59 AM UTC-4, klo wrote: > > Thanks. I added it to my tesseract configuration file and it works great > > Cheers > > > On Saturday, March 31

Re: Include Tesseract in C++ code

2013-04-29 Thread TP
On Sun, Apr 28, 2013 at 2:16 PM, TedJ wrote: > But if anyone knows of another angle/translation/scale image correction > approach (or code), I'd love to hear about it. I.e. Image stabilization. > I would just use leptonica's pixRead() to read in an image, deskew with pixFindSkewAndDeskew [1] wh