Re: [tesseract-ocr] Remove certain characters while fine tuning (training) tesseract

2021-03-10 Thread Greg Dunkel
Would it be easier to remove these characters from the output using editing tools? On Tue, Mar 9, 2021, 2:30 AM Murtuza Dahodwala wrote: > Hello, > Currently, my OCR model detects certain characters like *₹ *& *|.* > Is it possible that I can remove these characters by correcting my lstm > bound

Re: [tesseract-ocr] Android app using Tesseract v4 for OCR

2019-03-31 Thread Greg Dunkel
Please post to list. I am not the only one who would be interested in such an app. On Sun, Mar 31, 2019, 10:34 AM Soumik Ranjan Dasgupta < srd1...@cse.jgec.ac.in> wrote: > Hi Rene, > Thank you for replying. Can you provide me with the name or link to the > app? > > On Sun, Mar 31, 2019 at 3:29 PM

Re: [tesseract-ocr] Re: German "Straße" is often "StraBe" (tesseract 4.0)

2018-05-24 Thread Greg Dunkel
A work-around could be easily implemented with a sed script. On Thu, May 24, 2018, 7:41 AM shree wrote: > Please try with script/Latin traineddata to see if you get better results. > > I have added your comment to issue at > https://github.com/tesseract-ocr/langdata/pull/54 > > > > On Thursday,

Re: [tesseract-ocr] Tesseract convert image to gibberish

2018-02-25 Thread Greg Dunkel
Probably the scan is at too low dpi. Also slightly skewed. On Sun, Feb 25, 2018 at 5:38 AM, Dusayanta Prasad wrote: > I am try to convert the below image using Tesseract in linux using the > following command: > > tesseract img.jpg out -l eng > > >

Re: [tesseract-ocr] tesseract multiply .png files to singular .txt file

2017-03-16 Thread Greg Dunkel
For a large number of files, it is better to do it a chunk at a time, catch any errors , then concatenate the chunk. On Thu, Mar 16, 2017 at 11:52 AM, ShreeDevi Kumar wrote: > Gui front-end for tesseract such as Vietocr and gimagereader will also allow > for batch processing of multiple files. >

Re: [tesseract-ocr] Improve OCR accuracy

2015-06-23 Thread Greg Dunkel
Scan at a higher resolution. When I went from 200 dpi to 600 dpi my accuracy went from 85% to 98%. On Mon, Jun 22, 2015 at 7:56 AM, Gunasekaran Velu wrote: > > > HI > > I have attached the image as well as Tesseract OCR result for attached > image screen shot. the below OCR some words are missi

Re: [tesseract-ocr] poor recognition of 'fi'

2015-06-08 Thread Greg Dunkel
Since 'fi' and other ligatures generally get OCRed to a separate character, I just run a post-ocr sed script to take care of them, in Linux. On Mon, Jun 8, 2015 at 12:22 PM, Rick Leir wrote: > This problem with ligatures or digraphs is appearing frequently, how can > I avoid it? I want simple ou

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-16 Thread Greg Dunkel
Many OCR programs have trouble with ligatures. On Oct 16, 2014 11:21 AM, "Salvo Piazza" wrote: > Hi all, > I've written a little simple program to extract text from image with > tesseract 3.0.2 as: > > Tesseract instance = Tesseract.getInstance(); > instance.setDatapath(currentDir); > instance.se

Re: errors and the density in OCR

2013-12-15 Thread Greg Dunkel
Perhaps more noise. On Dec 15, 2013 9:58 AM, "saif alfarei" wrote: > Hi guys, > > > what could be the reasons behind the increase in error percentage as the > density increase in OCR process?. > > i already tested Tesseract OCR. > > kind regards, > Saif Farai > > -- > -- > You received this messa

Re: Proposed new page for the wiki: PoorQuality

2013-12-14 Thread Greg Dunkel
I have scanned nearly 3,000 pages and fed them into tesseract. Some were very poor quality -- memeographs from the '60s and other very poor quality faded originals. I found that paying attention to making sure the tifs being input to tesseract were as clean and noise free as possible, that the dp

Re: Different Results on Linux vs Windows

2013-11-04 Thread Greg Dunkel
#2. Even if the source code is the same, the object code -- the instructions the computer executes -- are different, as well as the input-output libraries and other system calls. #3. I have only used tesseract on solaris and linux; I didn't notice much difference -- and it was a few years ago -

Re: How to remove graphic from scanned document before passing it to tesserract for OCRing?

2013-10-22 Thread Greg Dunkel
Can use photo editing software to do it manually On Oct 22, 2013 1:45 PM, "Mitxi" wrote: > I'm working on OCR project but I don't know how to remove graphics from > the scanned document image before passing it to tesserract. > Attached files are some scanned documents which I want to remove grap

Re: pdf to text

2013-06-18 Thread Greg Dunkel
When I had this problem, under Linux, I used a tool that converts the pdf to tiffs and then applied tesseract to that. Worked fairly well, especially since it was Haitian Creole. /greg On Tue, Jun 18, 2013 at 6:43 AM, Subharup Chakraborti < subharup0...@gmail.com> wrote: > Hi > I can't able to

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2013-04-29 Thread Greg Dunkel
I couldn't get the config to work on Ubuntu so I wrote a post-processing sed script to convert the ligatures to two characters. On Mon, Apr 29, 2013 at 3:45 AM, Michael Sander wrote: > How did you format your config file? I tried adding the following line and > it doesn't seem to work: > > tesse

Re: Tiff support for tesseract 3.02 on Ubuntu 12.04

2013-02-04 Thread Greg Dunkel
I just scanned approximately 200 pages in Ubuntu 12.10 with no problems, using 3.02 package from the repository. I had to use convert to improve the tiffs from my scanner, but I got very good results, with a very low error rate. Didnothing special. /greg On Sun, Feb 3, 2013 at 4:08 PM, Michael

Re: problems with grayed background

2012-11-28 Thread Greg Dunkel
preprocess it thru a graphic editor like Gimp to increase the contrast, which will drop out the grey background. On Wed, Nov 28, 2012 at 5:10 AM, sascha4j wrote: > i have trouble to ocr an image like in the attachment. > > only the word text is recognized. > > i tried several binarization algo