I think the crux of the problem is your attempt to stop the OCR thread at a
random spot in its execution yet expect the state of the Tesseract instance
to be consistent. You are right to want to delete the instance otherwise
you would have a memory leak but it looks like you can't do that after
I don't think so: the C++ code in Tesseract will consume memory from the
same heap as any other parts of the app so if you just kill the OCR thread
nothing will automatically release that memory and you just created a
memory leak - a large one too considering you are working on a large image.
I am running the latest Tess 3.02 with the new English training set
and get the following crash at init with lang:
actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert
failed:in file tessdatamanager.cpp, line 48
Has anyone seen this?
Note: I am not using the cube version, just eng
I have had the opposite experience: Tess 3.01 beats 3.00 often - the
reverse does happen but rarely.
Note that Tess 3.01 will do WORSE if using Tess 3.00 trained data - is
it possible you are not using the Tess 3.01 trained data?
On Dec 13, 9:22 am, Alasdair 569...@googlemail.com wrote:
For
Yes, sure - there is a Tesseract API for that, it's called DetectOS
Note: this API will work only if there is enough text to make a
determination
On Oct 31, 7:46 am, emre yemrecavuso...@gmail.com wrote:
anyone who knows ?
On 27 Ekim, 09:27, Yunus Emre Cavusoglu yemrecavuso...@gmail.com
We have experienced precisely the same behavior with business cards
and found that for business cards optimal image size is around 1,024 x
768. Try that same size with your documents and see what happens -
just remember to adjust your size to the size of your documents - if
you have twice the
The basic reason it helps Tesseract to repeat text is because
Tesseract makes an initial assumption what kind of letters it is
looking at: tall (digits, uppercase letters, tall lowercase) or
lowercase letters. Only after it makes that assumption / guess will it
try to match the letters against the
What's PSM? Alternative spelling for PMS :-)?
On Oct 24, 1:35 pm, Quan Nguyen nguyen...@gmail.com wrote:
Try with PSM 8 or 10.
On Oct 24, 9:09 am, Giuseppe Menga me...@polito.it wrote:
That is interesting. I'm recognizing espiration dates from medicines, and I
found convenient to
If you are referring to the confidence level values returned by
Tesseract these are expressed as costs which means a higher values
is a lower confidence in returned characters. In any case: why would
you ever expect Tesseract to say it's 100% sure ever about anything
(even when it happens to be
As Dmitri said, nothing can be said without your sample image. I'll
just say that from our experience the Tesseract decisions on where
spaces should appear is extremely poor. Within ScanBizCards we test
and revisit not only every space but also any two letters where a
space should be inserted. We
My experience has been that this mistake does occur - but far less
frequently than what you suggest. IMHO it's not even frequent enough
to make it a high priority issue, we have addressed many other more
important issues in our project and that issue is still only on the to-
do list.
Patrick
On
Very cool Robert! The video for our own Android app with Tesseract is
here: http://www.youtube.com/watch?v=jiPl_9rWoz0
Not as cool as your though since we don't show anything in real time,
we have a bit of work to do before we scan the whole card and make
some adjustments and semantic analysis ...
As Zdenko pointed out Tesseract does NOT release the input image - nor
would it make any sense if it did as it would force the calling app to
make a copy of the image buffer every time it called Tesseract if it
needed to reuse it for other calls. Note also that all output
parameters such as text,
words ?
On 27 Temmuz, 11:04, patrickq patrick.questemb...@gmail.com wrote:
As Zdenko pointed out Tesseract does NOT release the input image - nor
would it make any sense if it did as it would force the calling app to
make a copy of the image buffer every time it called Tesseract
ScanBizCards actually uses Tesseract 3.01 - I believe the fears
expressed by many on this forum about using non official versions of
Tesseract are misplaced. We switched from 2.04 to 3.00 as soon as 3.0
was made available - and only benefited from it - then switched to
3.01 quickly - and again
The answer is (of course) it depends:
1. If you compare Tesseract and ABBY on a same image, without applying
preprocessing to it, ABBY wins (because Tesseract's image processing
is very rudimentary - at best). Of course if your test images are
produced (for example) by a flatbed scanner, the lack
Yes, Tesseract black lists and whitelists are useful almost
exclusively in situations where you really don't have the blacklisted
characters anywhere in the image (otherwise Tesseract will return the
next best guess, no matter how poor) or vice-versa where you have only
the whitelisted characters
segmentation with orientation
and script detection. (OSD)
I used a copy of eng.traineddata as osd.traineddata
HTH
Warm regards,
Dmitri Silaevwww.CustomOCR.com
On Wed, Jun 22, 2011 at 9:05 AM, ogorman ogor...@gmail.com wrote:
On Jun 22, 6:48 am, patrickq patrick.questemb...@gmail.com
I tested it via ScanBizCards and Indeed OSD has no issues whatsover
getting it right - there is 10 times the amount of text it needs and
the image is very sharp, it's guaranteed to get it right. I am not
familiar with the command-line tools however so I can't help, I'll
just say that it should be
Tesseract is very poor when recognizing images with a mixture of non-
text blogs and text, especially when the non-text elements are larger
than the text. In these instances it is likely to completely ignore
the text. I suggest you do your own layout analysis and process sub-
images one by one -
You can definitely get just layout analysis before text recognition -
look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
structure. You can then iterate through that structure to look at
blocks and rows within these blocks. Keep in mind that a sentence in
the image could be broken
I don't think you are doing anything wrong - I tested this with
ScanBizCards (Tesseract 3.01) and I get Very mpxe (note the same
mistake of x instead of l). I think this is yet another example of
Tesseract's poor recognition whenever it has either too little
information about height (as this case,
I pass a string as black list with these two characters: \ then t
and it seems that Tesseract interprets this as the tab character and
may return backslash in its OCR output (i.e. backslash is not
excluded)! Or if I pass \ then x, it treats it as x (but ignores the
backslash). I ended up passing \
The dictionary is used along with a list of character combinations
considered to be ambiguous. This is a list that is part of the
training set. For example, it includes an entry that says that the
sequence rn can be mistaken for the letter 'm'. For each entry in
that list there is an indication
I don't have the answers to your questions but we pass a binary image
to Tesseract like you do, with values set to either 0 or 255.
Tesseract will threshold the image so we experiment with modifying
Tesseract to short-circuit the thresholding for performance reasons -
but then realized the
-
Von: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
Im Auftrag von patrickq
Gesendet: Donnerstag, 5. Mai 2011 12:58
An: tesseract-ocr
Betreff: Re: Binary Images using SetImage
I don't have the answers to your questions but we pass a binary image
to Tesseract like
I believe that you need to append / to the path - at least that's
what we do (successfully) and I think it failed without it. Tesseract
is apparently not recognizing anything not ending with / as a
directory.
Patrick
On May 2, 8:46 am, srinivasan srini@gmail.com wrote:
Hi jes,
Im also
I am trying to provide a black list with UTF8 characters specified
using their byte codes, as follows:
// U+FB00 ff ef ac 80LATIN SMALL LIGATURE FF
// U+FB01 fi ef ac 81LATIN SMALL LIGATURE FI
of the characters is not in the training
set in the first place.
On Mar 30, 10:33 pm, patrickq patrick.questemb...@gmail.com wrote:
I am trying to provide a black list with UTF8 characters specified
using their byte codes, as follows:
// U+FB00 ff ef ac 80 LATIN SMALL
Why not simply use a blacklist to exclude these unichars?
On Mar 18, 4:38 pm, mw1 man_...@yahoo.com wrote:
In evaluating the Tesseract 3.0/2.0x, I find that the trained data
eng.traineddata.gz from
http://code.google.com/p/tesseract-ocr/downloads/detail?name=eng.trai...
is much better
Tesseract 3.00 gets this text 100% correct, including the smudged
numbers at the bottom. See:
http://www.scanbizcards.com/plate1.jpg
http://www.scanbizcards.com/plate2.jpg
(scanning was done with ScanBizCards on an iPhone - if you try it
yourself with the app on Android or iPhone, please disable
You expect way too much from Tesseract: it's not Tesseract's job to
slice and dice the text according to various organizational
requirements of applications - that's for the application to handle.
You can get all the coordinates of all characters and easily determine
which one are in what you
This is a known issue with Tesseract. One solution is to process the
OCR results then detect the size discrepancy between the two parts of
the line and then re-process each part as a separate image. In
essence, doing that prevents Tesseract from drawing bad inferences.
I think Tesseract 3.01
Here is the sequence of calls we are using to get the complete
information about text in the image:
myTess-SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
BLOCK_LIST* block_list = myTess-FindLinesCreateBlockList();
PAGE_RES* page_res_pass1 =
The answer lies within your own question! Since you expect only
digits, simply accept these letters as the equivalent digit by
replacing them.
Patrick
On Mar 3, 5:09 am, Richard rhe...@dial.pipex.com wrote:
Hi,
I am really new to Tesseract OCR 3.0 as a static DLL within a windows
envionment
ScanBizCards (iPhone version) is using the Tesseract 3.0 orientation
detection, works quite well - accurate in 95%+ of cases and the 5%
failure cases are oftentimes because we scan business cards where
there isn't a lot of text to go by + there is a lot of non-text
confusing the detection.
in the result by examining the
value of the score that won.
Patrick
On Feb 28, 1:50 pm, Giuseppe Menga me...@polito.it wrote:
Patrick
just a hint of how to use the orientation functionality of Tesseract
Giuseppe
-Messaggio originale-
From: patrickq
Sent: Monday, February 28, 2011 7:44 PM
I am working with the various international Latin training sets and am
discovering that most of them have plenty of letters that are entirely
illegal in that language. For example, the Latin letter S with caron
is in the German set (Unicode u0161, the caron looks like the bottom
half of a circle
This statement seems to be totally ignored by Tesseract:
myTess-SetVariable(tessedit_char_blacklist, A)
and 'A' are still getting recognized. I could have sworn this worked
for me before. I am using Tesseract 3.0 - has the name of the variable
changed? Do I need to set another variable too? I am
See http://www.scanbizcards.com/touchingdigits.jpg
Includes a tel number where OO appear twice with no spacing, i.e.
touching. Tesseract fails on both sets, returning:
(65)81W6W instead of (65)8100 6002
(00 - W and '002 - W)
I have not seen Tesseract do well with hardly any situation where two
This what I did:
1. Created a text file called eng.user-words, containing:
Chest
Chestnut
Floor
Vice
2. Placed the file in the tessdata folder (next to eng.traineddata)
3. Ran recognition on an image returning Chesf instead of Chest
and Fioor instead of Floor. Both mistaken f and i look quite
Tesseract's space handling needs a total rewrite - it's not be saying
so, it's Ray Smith in a previous post in this forum.
Specifically, after the digit '1' Tesseract appears to struggle more
than usual with spaces, probably because it attaches too much
importance to the width of the last
which has been discussed in the last three days.
Please look in the archives or check the emails you've received from
the list for the last few days.
--Sven
On Fri, Jul 30, 2010 at 8:04 AM, patrickq patrick.questemb...@gmail.com
wrote:
This what I did:
1. Created a text file called
Keep in mind that accuracy depends heavily on the right fonts being
included in the training set. I have no reason to believe that the
2.04 and 3.0 training sets are identical - perhaps someone could
enlighten us. In any case, I routinely come accross certain pages
where recognition is terrible
Looks like a simple case of Y inversion - try transforming whatever Y
values you thought were right into (height - Y) where height is the
height of the image.
On Jul 27, 1:28 am, Andres andrej...@gmail.com wrote:
Hello people,
I'm trying to get the characters coordinates but I'm getting them
, 8:30 am, Jimmy O'Regan jore...@gmail.com wrote:
On 19 July 2010 13:20, patrickq patrick.questemb...@gmail.com wrote:
This is a great example of a serious problem with Tesseract when
analyzing any image with fonts of variable sizes such as a street
sign, flyer, business card etc. What
impact on this if I
tweak it?
I am not really sure I understand the significance of the values passed for
this option though.
Thanks
Austin
-Original Message-
From: patrickq
Sent: Monday, July 19, 2010 9:00 AM
To: tesseract-ocr
Subject: Re: Tesseract Reading Issue
Setting
words
being output without spaces between them (needs to be lower), or if
you get spaces between letters (needs to be higher).
I am not really sure I understand the significance of the values passed
for
this option though.
Thanks
Austin
-Original Message- From: patrickq
Sent
Just my $0.02 opinion but I highly doubt that you are going to elicit
a lot of helpful response by approaching an open source community with
such secrecy (note the word open in open source ...).
On Jul 13, 4:34 am, sai saikumar@gmail.com wrote:
Hi, I want to port this engine in our specific
TesserractExtractResult() returns the confidence numbers for all
characters returned. A high number means low confidence. Caveats:
1. The confidence numbers are the same for all letters in a word (even
though Tesseract does compute confidence numbers for each letter, it
just doesn't return them to
I would highly question the assertion that www.free-ocr/com output is
100% accurate :-). In any case, it's expected that their output would
be better: Tessnet2 is just an application wrapper around Tesseract
while www.free-ocr.com includes their own image pre-processing prior
to submitting it to
monochrome or grayscale images :) I'm using
OpenCV, but i have not clue about what tesseract prefers...
Cheers,
Andres
2010/6/30 patrickq patrick.questemb...@gmail.com
I would highly question the assertion thatwww.free-ocr/comoutput is
100% accurate :-). In any case, it's expected
I don't recommend the blacklist/whitelist approach because if you
force Tess to recognize only digits, it will turn many letters into
digits. We are using Tesseract 3.0 within our iPhone application
http://www.scanbizcards.com and using that approach - there is a free
version of the app if you
We have build the ScanBizCards iPhone application to scan business
card images (see http://scanbizcards.com) using Tesseract 3.0 - you
can install the free version from this page
http://itunes.apple.com/us/app/scanbizcards-lite-business/id338143149?mt=8
For those of you interested in OCR on the
Why URL or instructions are you using? Those on
http://code.google.com/p/tesseract-ocr/source/checkout work fine and
get you the 3.0 sources.
On Feb 16, 7:03 am, @hytgbn hyt...@gmail.com wrote:
I want to download tesseract-ocr 3.0
but I cannot find it on download and source tab of google code
Looks like I missed that one ... better late than never! Yes, the
coordinates are returned in units of pixels.
On Jan 7, 2:48 pm, jdevelop jdeve...@gmail.com wrote:
On Jan 7, 9:22 pm, patrickq patrick.questemb...@gmail.com wrote:
Yes, look up the definition of TesseractExtractResults
in
the page.
Cheers,
Faisal
On Wed, Feb 3, 2010 at 4:46 PM, patrickq patrick.questemb...@gmail.comwrote:
Hi Francesco,
Tesseract 3.0 actually recognizes all the digits in your sample image
just great. I have processed your image using the ScanBizCards iPhone
application (which uses Tesseract
input image.
That said, I don't know how to suppress small rows efficiently.
Andrei
On Jan 17, 11:55 am, patrickq patrick.questemb...@gmail.com wrote:
I am scanning images with large, clear text but on a grainy background
and although I get the text fine, I also get myriads of irrelevant
I am scanning images with large, clear text but on a grainy background
and although I get the text fine, I also get myriads of irrelevant
letters with a size of 3 or 5 pixels (way below a size at which
anything could be recognized accurately). I could eliminate them based
on size post-OCR but
Yes, look up the definition of TesseractExtractResults: it returns the
set of boxes for all characters it recognized, with blank characters
(ascii 32) between words or lines (you have to map to a space or to a
newline based on the X Y coordinates of the box before and after the
delimiter). A word
Indeed - I am experiencing illegal box coordinates in about 40% of the
images I scan (I am using Tesseract 3.0). You can find such an example
on:
http://www.scanbizcards.com/IMG_0735.JPG
and the resulting boxes (after reducing image size by 2 along X and Y
dimension) on:
I am getting a very high incidences of 'i' returned instead of 'l' and
vice-versa, even with high quality images where every other letter is
recognized fine. In appearance it feels like some image pre-processing
done by Tesseract fuses the dot of the i with the bar of the i below
it, making it
I have had the same experience getting spaces in many spots where none
should exist. Since I have no idea how to navigate the many Tess
variables, my approach has been to test and remove such spaces myself
post-scan, based on the width spacing of characters in the current
word. Indeed italic or
, or AllWordConfidences, or some combination. (In the 3.00 code on
svn.) In some future version, it would be desirable to have an all-purpose
function that extends TeseractExtractResult with more useful information.
Ray.
On Wed, Sep 16, 2009 at 12:43 PM, patrickq
patrick.questemb...@gmail.comwrote
64 matches
Mail list logo