Re: How to stop/cancel OCR process?

2012-09-02 Thread patrickq
I think the crux of the problem is your attempt to stop the OCR thread at a random spot in its execution yet expect the state of the Tesseract instance to be consistent. You are right to want to delete the instance otherwise you would have a memory leak but it looks like you can't do that after

Re: How to stop/cancel OCR process?

2012-09-02 Thread patrickq
I don't think so: the C++ code in Tesseract will consume memory from the same heap as any other parts of the app so if you just kill the OCR thread nothing will automatically release that memory and you just created a memory leak - a large one too considering you are working on a large image.

Tess 3.02 English training set broken?

2012-02-05 Thread patrickq
I am running the latest Tess 3.02 with the new English training set and get the following crash at init with lang: actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 48 Has anyone seen this? Note: I am not using the cube version, just eng

Re: Problem with Tesseract 3.01 Accuracy

2011-12-13 Thread patrickq
I have had the opposite experience: Tess 3.01 beats 3.00 often - the reverse does happen but rarely. Note that Tess 3.01 will do WORSE if using Tess 3.00 trained data - is it possible you are not using the Tess 3.01 trained data? On Dec 13, 9:22 am, Alasdair 569...@googlemail.com wrote: For

Re: get image direction ?

2011-10-31 Thread patrickq
Yes, sure - there is a Tesseract API for that, it's called DetectOS Note: this API will work only if there is enough text to make a determination On Oct 31, 7:46 am, emre yemrecavuso...@gmail.com wrote: anyone who knows ? On 27 Ekim, 09:27, Yunus Emre Cavusoglu yemrecavuso...@gmail.com

Re: Odd behaviour with scan resolution

2011-10-27 Thread patrickq
We have experienced precisely the same behavior with business cards and found that for business cards optimal image size is around 1,024 x 768. Try that same size with your documents and see what happens - just remember to adjust your size to the size of your documents - if you have twice the

Re: Is there minimum of letters?

2011-10-24 Thread patrickq
The basic reason it helps Tesseract to repeat text is because Tesseract makes an initial assumption what kind of letters it is looking at: tall (digits, uppercase letters, tall lowercase) or lowercase letters. Only after it makes that assumption / guess will it try to match the letters against the

Re: Is there minimum of letters?

2011-10-24 Thread patrickq
What's PSM? Alternative spelling for PMS :-)? On Oct 24, 1:35 pm, Quan Nguyen nguyen...@gmail.com wrote: Try with PSM 8 or 10. On Oct 24, 9:09 am, Giuseppe Menga me...@polito.it wrote: That is interesting. I'm recognizing espiration dates from medicines, and I found convenient to

Re: quality of a word ?

2011-10-18 Thread patrickq
If you are referring to the confidence level values returned by Tesseract these are expressed as costs which means a higher values is a lower confidence in returned characters. In any case: why would you ever expect Tesseract to say it's 100% sure ever about anything (even when it happens to be

Re: Quality of OCR

2011-09-26 Thread patrickq
As Dmitri said, nothing can be said without your sample image. I'll just say that from our experience the Tesseract decisions on where spaces should appear is extremely poor. Within ScanBizCards we test and revisit not only every space but also any two letters where a space should be inserted. We

Re: problem with lower case l

2011-08-05 Thread patrickq
My experience has been that this mistake does occur - but far less frequently than what you suggest. IMHO it's not even frequent enough to make it a high priority issue, we have addressed many other more important issues in our project and that issue is still only on the to- do list. Patrick On

Re: JNI implementation of TessBaseAPI::GetRegions/TessBaseAPI::GetWords?

2011-08-04 Thread patrickq
Very cool Robert! The video for our own Android app with Tesseract is here: http://www.youtube.com/watch?v=jiPl_9rWoz0 Not as cool as your though since we don't show anything in real time, we have a bit of work to do before we scan the whole card and make some adjustments and semantic analysis ...

Re: Memory management in Tesseract

2011-07-27 Thread patrickq
As Zdenko pointed out Tesseract does NOT release the input image - nor would it make any sense if it did as it would force the calling app to make a copy of the image buffer every time it called Tesseract if it needed to reuse it for other calls. Note also that all output parameters such as text,

Re: Memory management in Tesseract

2011-07-27 Thread patrickq
words ? On 27 Temmuz, 11:04, patrickq patrick.questemb...@gmail.com wrote: As Zdenko pointed out Tesseract does NOT release the input image - nor would it make any sense if it did as it would force the calling app to make a copy of the image buffer every time it called Tesseract

Re: Tesseract on iPhone - which version?

2011-07-22 Thread patrickq
ScanBizCards actually uses Tesseract 3.01 - I believe the fears expressed by many on this forum about using non official versions of Tesseract are misplaced. We switched from 2.04 to 3.00 as soon as 3.0 was made available - and only benefited from it - then switched to 3.01 quickly - and again

Re: Teseract vs Abbyy

2011-07-03 Thread patrickq
The answer is (of course) it depends: 1. If you compare Tesseract and ABBY on a same image, without applying preprocessing to it, ABBY wins (because Tesseract's image processing is very rudimentary - at best). Of course if your test images are produced (for example) by a flatbed scanner, the lack

Re: Trouble with White- and Blacklists

2011-07-01 Thread patrickq
Yes, Tesseract black lists and whitelists are useful almost exclusively in situations where you really don't have the blacklisted characters anywhere in the image (otherwise Tesseract will return the next best guess, no matter how poor) or vice-versa where you have only the whitelisted characters

Re: How to use osd?

2011-06-23 Thread patrickq
segmentation with orientation and script detection. (OSD) I used a copy of eng.traineddata as osd.traineddata HTH Warm regards, Dmitri Silaevwww.CustomOCR.com On Wed, Jun 22, 2011 at 9:05 AM, ogorman ogor...@gmail.com wrote: On Jun 22, 6:48 am, patrickq patrick.questemb...@gmail.com

Re: How to use osd?

2011-06-22 Thread patrickq
I tested it via ScanBizCards and Indeed OSD has no issues whatsover getting it right - there is 10 times the amount of text it needs and the image is very sharp, it's guaranteed to get it right. I am not familiar with the command-line tools however so I can't help, I'll just say that it should be

Re: recognizing screenshots

2011-06-20 Thread patrickq
Tesseract is very poor when recognizing images with a mixture of non- text blogs and text, especially when the non-text elements are larger than the text. In these instances it is likely to completely ignore the text. I suggest you do your own layout analysis and process sub- images one by one -

Re: Page layout analysis module

2011-06-20 Thread patrickq
You can definitely get just layout analysis before text recognition - look at the FindLinesCreateBlockList() API and the BLOCK_LIST data structure. You can then iterate through that structure to look at blocks and rows within these blocks. Keep in mind that a sentence in the image could be broken

Re: Tesseract doesn't work with a very simple example

2011-06-17 Thread patrickq
I don't think you are doing anything wrong - I tested this with ScanBizCards (Tesseract 3.01) and I get Very mpxe (note the same mistake of x instead of l). I think this is yet another example of Tesseract's poor recognition whenever it has either too little information about height (as this case,

Tesseract expecting escape sequence in black list?

2011-06-16 Thread patrickq
I pass a string as black list with these two characters: \ then t and it seems that Tesseract interprets this as the tab character and may return backslash in its OCR output (i.e. backslash is not excluded)! Or if I pass \ then x, it treats it as x (but ignores the backslash). I ended up passing \

Re: When Tesseract refers to dictionary

2011-05-11 Thread patrickq
The dictionary is used along with a list of character combinations considered to be ambiguous. This is a list that is part of the training set. For example, it includes an entry that says that the sequence rn can be mistaken for the letter 'm'. For each entry in that list there is an indication

Re: Binary Images using SetImage

2011-05-05 Thread patrickq
I don't have the answers to your questions but we pass a binary image to Tesseract like you do, with values set to either 0 or 255. Tesseract will threshold the image so we experiment with modifying Tesseract to short-circuit the thresholding for performance reasons - but then realized the

Re: Binary Images using SetImage

2011-05-05 Thread patrickq
- Von: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] Im Auftrag von patrickq Gesendet: Donnerstag, 5. Mai 2011 12:58 An: tesseract-ocr Betreff: Re: Binary Images using SetImage I don't have the answers to your questions but we pass a binary image to Tesseract like

Re: new trained language on iphone

2011-05-02 Thread patrickq
I believe that you need to append / to the path - at least that's what we do (successfully) and I think it failed without it. Tesseract is apparently not recognizing anything not ending with / as a directory. Patrick On May 2, 8:46 am, srinivasan srini@gmail.com wrote: Hi jes, Im also

Format of blacklist string

2011-03-30 Thread patrickq
I am trying to provide a black list with UTF8 characters specified using their byte codes, as follows: // U+FB00 ff ef ac 80LATIN SMALL LIGATURE FF // U+FB01 fi ef ac 81LATIN SMALL LIGATURE FI

Re: Format of blacklist string

2011-03-30 Thread patrickq
of the characters is not in the training set in the first place. On Mar 30, 10:33 pm, patrickq patrick.questemb...@gmail.com wrote: I am trying to provide a black list with UTF8 characters specified using their byte codes, as follows:         // U+FB00       ff     ef ac 80        LATIN SMALL

Re: Tesseract 3.0 English Trained Data

2011-03-18 Thread patrickq
Why not simply use a blacklist to exclude these unichars? On Mar 18, 4:38 pm, mw1 man_...@yahoo.com wrote: In evaluating the Tesseract 3.0/2.0x, I find that the trained data eng.traineddata.gz from http://code.google.com/p/tesseract-ocr/downloads/detail?name=eng.trai... is much better

Re: Customising Tesseract for character recognition

2011-03-13 Thread patrickq
Tesseract 3.00 gets this text 100% correct, including the smudged numbers at the bottom. See: http://www.scanbizcards.com/plate1.jpg http://www.scanbizcards.com/plate2.jpg (scanning was done with ScanBizCards on an iPhone - if you try it yourself with the app on Android or iPhone, please disable

Re: Customising Tesseract for character recognition

2011-03-13 Thread patrickq
You expect way too much from Tesseract: it's not Tesseract's job to slice and dice the text according to various organizational requirements of applications - that's for the application to handle. You can get all the coordinates of all characters and easily determine which one are in what you

Re: Trouble recognizing characters in images with different character size

2011-03-09 Thread patrickq
This is a known issue with Tesseract. One solution is to process the OCR results then detect the size discrepancy between the two parts of the line and then re-process each part as a separate image. In essence, doing that prevents Tesseract from drawing bad inferences. I think Tesseract 3.01

trouble using the blacklist

2011-03-09 Thread patrickq
Here is the sequence of calls we are using to get the complete information about text in the image: myTess-SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight, 1, grayScaleWidth); BLOCK_LIST* block_list = myTess-FindLinesCreateBlockList(); PAGE_RES* page_res_pass1 =

Re: OCR Sensitivity

2011-03-03 Thread patrickq
The answer lies within your own question! Since you expect only digits, simply accept these letters as the equivalent digit by replacing them. Patrick On Mar 3, 5:09 am, Richard rhe...@dial.pipex.com wrote: Hi, I am really new to Tesseract OCR 3.0 as a static DLL within a windows envionment

Re: text rotated upside down or of 90°

2011-02-28 Thread patrickq
ScanBizCards (iPhone version) is using the Tesseract 3.0 orientation detection, works quite well - accurate in 95%+ of cases and the 5% failure cases are oftentimes because we scan business cards where there isn't a lot of text to go by + there is a lot of non-text confusing the detection.

Re: text rotated upside down or of 90°

2011-02-28 Thread patrickq
in the result by examining the value of the score that won. Patrick On Feb 28, 1:50 pm, Giuseppe Menga me...@polito.it wrote: Patrick just a hint of how to use the orientation functionality of Tesseract Giuseppe -Messaggio originale- From: patrickq Sent: Monday, February 28, 2011 7:44 PM

Irrelevant letters in training sets

2010-11-18 Thread patrickq
I am working with the various international Latin training sets and am discovering that most of them have plenty of letters that are entirely illegal in that language. For example, the Latin letter S with caron is in the German set (Unicode u0161, the caron looks like the bottom half of a circle

Setting black list

2010-11-09 Thread patrickq
This statement seems to be totally ignored by Tesseract: myTess-SetVariable(tessedit_char_blacklist, A) and 'A' are still getting recognized. I could have sworn this worked for me before. I am using Tesseract 3.0 - has the name of the variable changed? Do I need to set another variable too? I am

No treatment for touching letters?

2010-08-12 Thread patrickq
See http://www.scanbizcards.com/touchingdigits.jpg Includes a tel number where OO appear twice with no spacing, i.e. touching. Tesseract fails on both sets, returning: (65)81W6W instead of (65)8100 6002 (00 - W and '002 - W) I have not seen Tesseract do well with hardly any situation where two

Can't get the user dictionary to work

2010-07-30 Thread patrickq
This what I did: 1. Created a text file called eng.user-words, containing: Chest Chestnut Floor Vice 2. Placed the file in the tessdata folder (next to eng.traineddata) 3. Ran recognition on an image returning Chesf instead of Chest and Fioor instead of Floor. Both mistaken f and i look quite

Re: Ready Space Problem

2010-07-30 Thread patrickq
Tesseract's space handling needs a total rewrite - it's not be saying so, it's Ray Smith in a previous post in this forum. Specifically, after the digit '1' Tesseract appears to struggle more than usual with spaces, probably because it attaches too much importance to the width of the last

Re: Can't get the user dictionary to work

2010-07-30 Thread patrickq
which has been discussed in the last three days. Please look in the archives or check the emails you've received from the list for the last few days. --Sven On Fri, Jul 30, 2010 at 8:04 AM, patrickq patrick.questemb...@gmail.com wrote: This what I did: 1. Created a text file called

Re: Accuracy worse on 3.0-svn than 2.04?

2010-07-27 Thread patrickq
Keep in mind that accuracy depends heavily on the right fonts being included in the training set. I have no reason to believe that the 2.04 and 3.0 training sets are identical - perhaps someone could enlighten us. In any case, I routinely come accross certain pages where recognition is terrible

Re: problem getting character coordinates (using EANYCODE_CHAR and ETEXT_DESC)

2010-07-26 Thread patrickq
Looks like a simple case of Y inversion - try transforming whatever Y values you thought were right into (height - Y) where height is the height of the image. On Jul 27, 1:28 am, Andres andrej...@gmail.com wrote: Hello people, I'm trying to get the characters coordinates but I'm getting them

Re: Tesseract Reading Issue

2010-07-19 Thread patrickq
, 8:30 am, Jimmy O'Regan jore...@gmail.com wrote: On 19 July 2010 13:20, patrickq patrick.questemb...@gmail.com wrote: This is a great example of a serious problem with Tesseract when analyzing any image with fonts of variable sizes such as a street sign, flyer, business card etc. What

Re: Tesseract Reading Issue

2010-07-19 Thread patrickq
impact on this if I tweak it? I am not really sure I understand the significance of the values passed for this option though. Thanks Austin -Original Message- From: patrickq Sent: Monday, July 19, 2010 9:00 AM To: tesseract-ocr Subject: Re: Tesseract Reading Issue Setting

Re: Tesseract Reading Issue

2010-07-19 Thread patrickq
words being output without spaces between them (needs to be lower), or if you get spaces between letters (needs to be higher). I am not really sure I understand the significance of the values passed for this option though. Thanks Austin -Original Message- From: patrickq Sent

Re: OSAL required

2010-07-13 Thread patrickq
Just my $0.02 opinion but I highly doubt that you are going to elicit a lot of helpful response by approaching an open source community with such secrecy (note the word open in open source ...). On Jul 13, 4:34 am, sai saikumar@gmail.com wrote: Hi, I want to port this engine in our specific

Re: Is it possible to get a confidence value for the tesseract OCR result?

2010-07-09 Thread patrickq
TesserractExtractResult() returns the confidence numbers for all characters returned. A high number means low confidence. Caveats: 1. The confidence numbers are the same for all letters in a word (even though Tesseract does compute confidence numbers for each letter, it just doesn't return them to

Re: Tessnet2 - image processing problem

2010-06-30 Thread patrickq
I would highly question the assertion that www.free-ocr/com output is 100% accurate :-). In any case, it's expected that their output would be better: Tessnet2 is just an application wrapper around Tesseract while www.free-ocr.com includes their own image pre-processing prior to submitting it to

Re: Tessnet2 - image processing problem

2010-06-30 Thread patrickq
monochrome or grayscale images :) I'm using OpenCV, but i have not clue about what tesseract prefers... Cheers, Andres 2010/6/30 patrickq patrick.questemb...@gmail.com I would highly question the assertion thatwww.free-ocr/comoutput is 100% accurate :-). In any case, it's expected

Re: Detect numbers only

2010-06-29 Thread patrickq
I don't recommend the blacklist/whitelist approach because if you force Tess to recognize only digits, it will turn many letters into digits. We are using Tesseract 3.0 within our iPhone application http://www.scanbizcards.com and using that approach - there is a free version of the app if you

Re: Tesseract 3.0 for iPhone Compiling

2010-03-06 Thread patrickq
We have build the ScanBizCards iPhone application to scan business card images (see http://scanbizcards.com) using Tesseract 3.0 - you can install the free version from this page http://itunes.apple.com/us/app/scanbizcards-lite-business/id338143149?mt=8 For those of you interested in OCR on the

Re: Where can I download tesseract-ocr 3.0 source?

2010-02-16 Thread patrickq
Why URL or instructions are you using? Those on http://code.google.com/p/tesseract-ocr/source/checkout work fine and get you the 3.0 sources. On Feb 16, 7:03 am, @hytgbn hyt...@gmail.com wrote: I want to download tesseract-ocr 3.0 but I cannot find it on download and source tab of google code

Re: Find out coordinates and bounding box of a word/phrase/paragraph

2010-02-05 Thread patrickq
Looks like I missed that one ... better late than never! Yes, the coordinates are returned in units of pixels. On Jan 7, 2:48 pm, jdevelop jdeve...@gmail.com wrote: On Jan 7, 9:22 pm, patrickq patrick.questemb...@gmail.com wrote: Yes, look up the definition of TesseractExtractResults

Re: Different results on subimages

2010-02-04 Thread patrickq
in the page. Cheers, Faisal On Wed, Feb 3, 2010 at 4:46 PM, patrickq patrick.questemb...@gmail.comwrote: Hi Francesco, Tesseract 3.0 actually recognizes all the digits in your sample image just great. I have processed your image using the ScanBizCards iPhone application (which uses Tesseract

Re: Parameters to Ignore rows under a minimum height

2010-01-22 Thread patrickq
input image. That said, I don't know how to suppress small rows efficiently. Andrei On Jan 17, 11:55 am, patrickq patrick.questemb...@gmail.com wrote: I am scanning images with large, clear text but on a grainy background and although I get the text fine, I also get myriads of irrelevant

Parameters to Ignore rows under a minimum height

2010-01-17 Thread patrickq
I am scanning images with large, clear text but on a grainy background and although I get the text fine, I also get myriads of irrelevant letters with a size of 3 or 5 pixels (way below a size at which anything could be recognized accurately). I could eliminate them based on size post-OCR but

Re: Find out coordinates and bounding box of a word/phrase/paragraph

2010-01-07 Thread patrickq
Yes, look up the definition of TesseractExtractResults: it returns the set of boxes for all characters it recognized, with blank characters (ascii 32) between words or lines (you have to map to a space or to a newline based on the X Y coordinates of the box before and after the delimiter). A word

Re: Boxes have wrong coordinates

2009-11-30 Thread patrickq
Indeed - I am experiencing illegal box coordinates in about 40% of the images I scan (I am using Tesseract 3.0). You can find such an example on: http://www.scanbizcards.com/IMG_0735.JPG and the resulting boxes (after reducing image size by 2 along X and Y dimension) on:

Recognition confused between i (lowercase i) and l (lowercase l)

2009-11-19 Thread patrickq
I am getting a very high incidences of 'i' returned instead of 'l' and vice-versa, even with high quality images where every other letter is recognized fine. In appearance it feels like some image pre-processing done by Tesseract fuses the dot of the i with the bar of the i below it, making it

Re: Tesseract OCR produces non-existing spaces in the middle of the words: how to change spacing tolarance?

2009-11-13 Thread patrickq
I have had the same experience getting spaces in many spots where none should exist. Since I have no idea how to navigate the many Tess variables, my approach has been to test and remove such spaces myself post-scan, based on the width spacing of characters in the current word. Indeed italic or

Re: Encoding format with TesseractExtractResult?

2009-09-18 Thread patrickq
, or AllWordConfidences, or some combination. (In the 3.00 code on svn.) In some future version, it would be desirable to have an all-purpose function that extends TeseractExtractResult with more useful information. Ray. On Wed, Sep 16, 2009 at 12:43 PM, patrickq patrick.questemb...@gmail.comwrote