Re: Tesseract 3.01 Released

2011-10-30 Thread Cong Nguyen
Thanks man!

On Sunday, October 30, 2011, zdenko podobny  wrote:
> Hello all,
> Tesseract 3.01 was released and you can find it in download section [1]
or on the Project page in section "Featured".
> Windows installer was build on Windows XP SP3 with VC++ 2008 Express, so
maybe you will need Microsoft Visual C++ 2008 SP1 Redistributable Package
(x86) [2]. Tesseract.exe is static build from VS2008 solution, so no
additional libraries are needed.
> Language data files created for 3.00 can be used with 3.01.
Language data files created with tesseract 3.01 will not work with
tesseract 3.00.
>
> Tesseract release notes - V3.01
>
> Thread-safety! Moved all critical globals and statics to members of the
appropriate class. Tesseract is now thread-safe (multiple instances can be
used in parallel in multiple threads.) with the minor exception that some
control parameters are still global and affect all threads.
> Added Cube, a new recognizer for Arabic. Cube can also be used in
combination with normal Tesseract for other languages with an improvement
in accuracy at the cost of (much) lower speed. There is no training module
for Cube yet.
> OcrEngineMode in Init replaces AccuracyVSpeed to control cube.
> Greatly improved segmentation search with consequent accuracy and speed
improvements, especially for Chinese.
> Added PageIterator and ResultIterator as cleaner ways to get the full
results out of Tesseract, that are not currently provided by any of
the TessBaseAPI::Get* methods. All other methods, such as
the ETEXT_STRUCT in particular are deprecated and will be deleted in the
future.
> ApplyBoxes totally rewritten to make training easier. It can now cope
with touching/overlapping training characters, and a new boxfile format
allows word boxes instead of character boxes, BUT to use that you have to
have already boostrapped the language with character boxes. "Cyclic
dependency" on traineddata.
> Auto orientation and script detection added to page layout analysis.
> Deleted lots of dead code.
> Fixxht module replaced with scalable data-driven module.
> Output font characteristics accuracy improved.
> Removed the double conversion at each classification.
> Upgraded oldest structs to be classes and deprecated PBLOB.
> Removed non-deterministic baseline fit.
> Added fixed length dawgs for Chinese.
> Handling of vertical text improved.
> Handling of leader dots improved.
> Table detection greatly improved.
> Fixed a couple of memory leaks.
> Fixed font labels on output text. (Not perfect, but a lot better than
before.)
> Cleanup and more bug fixes
> Special treatments for Hindi.
> Support for build in VS2010 with Microsoft Windows SDK for Windows 7
(thanks to Michael Lutz)
>
> [1] http://code.google.com/p/tesseract-ocr/downloads/list
> [2]
http://www.microsoft.com/download/en/details.aspx?id=5582&WT.mc_id=MSCOM_EN_US_DLC_DETAILS_121LSUS007998

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


New version of tesseractdotnetwrapper

2011-07-05 Thread Cong Nguyen
Dear all,

New version of tesseractdotnetwrapper has released on 2011 July, 04.

It is based-on tesseract-ocr v3.01 r590.

Here is the link: 
http://code.google.com/p/tesseractdotnet/
http://tesseractdotnet.googlecode.com/files/tesseractdotnetwrapper_r590.zip


Changed logs and notes:
- libleptxxx.dll is replaced by integrating directly its static librarian 
inside tesseract project,
- use ROI, UseROI properties for recognition in region of interest (see: 
Simple3.cs in IPoVnOCRer project),
- added: CreateBinaryPix(), CreateGreyPix() in PixConverter (PixFromImage 
maybe will be deprecated),
- created new document layout (adapt to tesseract-engine v3.01 r590) 
structure: DocumentLayout >> Block >> Paragraph >> TextLine >> Word >> 
Character/Symbol,
- be able to set tessdata path,
- be able to recognize with others OcrEngineMode,
- be able to analyze layout only,
- be able to recognize in parallel, see Simple3.cs in IPoVnOCRer project,
- use IPoVn.IPCore to load/save/crop/invert/binarize image in generic image 
format,
- test tesseract.dll in .net 4/vs2010 please look at Simple1.cs only, other 
*.cs maybe need to compile IPoVn.IPCore project,
- be able to own your flow to process:
- 0. Do some pre-processing first (in my case: adaptive thresholding was 
performed)
- 1. AnalyseLayout() -> get blobs (block/paragraph/textline/word)
- 2. Do some image processing for each blobs
- 3. Recognize for each ROI/blobs with OcrEngineMode/PageSegmentMode 
corresponding to
- 4. Do some post-processing (VietOCR is an example).

IPoVnOCRer project:
- Simple1.cs: use tesseract.dll only
- example to recognize and analyze layout.
- Simple2.cs: tesseract.dll + IPoVn.IPCore + IPoVnSystem
- example to recognize and analyze layout after performing adaptive 
thresholding.
- Simple3.cs: tesseract.dll + IPoVn.IPCore + IPoVnSystem
- example to recognize in ROIs


Hope that it is helpful.
Cong.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


RE: Is there a way to specify language data (*including hard path*) from the command line?

2011-05-21 Thread Cong Nguyen
If you have to change code for resolving the problem, you can refer to
http://code.google.com/p/tesseractdotnet/source/browse/trunk/dotnetwrapper/T
esseractEngineWrapper/tessedit.cpp

Search "_BUILDASDLL" in above file for more details.

Hope it helps!

Cong.

-Original Message-
From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Daniel
Sent: Sunday, May 22, 2011 2:35 AM
To: tesseract-ocr
Subject: Re: Is there a way to specify language data (*including hard path*)
from the command line?

Thanks for the replies.

As I said before, I was reluctant to modify the TESSDATA_PREFIX
environment variable, since it could hypothetically interfere with
another installation of Tesseract the user might have. After thinking
it over, I realized that the simplest solution was just to ignore the
whole problem. If the variable existed, Tesseract would try to use the
pre-installed language files. Otherwise, it would use the language
files in the directory it was run from.

This was still less than ideal since I wanted to do some custom
training, which would be overridden if another installation existed.
Eventually, I just tweaked the code to ignore TESSDATA_PREFIX
altogether and assume the language data was stored in a fixed location
relative to the engine. Peachy!

On May 21, 12:43 pm, Dmitri Silaev  wrote:
> Maybe you don't need all these details but at least this can be useful
> for other forum users. The rules are a bit complex.
>
> As Quan said, Tesseract first looks if the TESSDATA_PREFIX environment
> variable exists. If it does, Tess then appends to it the string stored
> in the "m_data_sub_dir" param. The resulting dir name is where Tess
> looks for lang files.
>
> Now the hard part:
> - If the environment variable does not exist, under Windows Tess
> checks if it's run via a DLL or linked statically into an EXE.
> Whichever is in effect, its location is taken as a base directory. The
> name of the DLL being checked is stored in the "tessedit_module_name"
> config param, default is "tessdll.dll".
> - If for some reason Tess cannot obtain executable's file name, as a
> base directory it takes the current working directory (namely "./").
> - By default "m_data_sub_dir" is "tessdata/" but it can be altered via
> a config file (you can specify it in the command line).
> - Both the environment variable and "m_data_sub_dir" should contain
> trailing "/".
> - Windows installer automatically creates TESSDATA_PREFIX and sets it
> to "\Tesseract-OCR\".
>
> So, if don't want to deal with environment variables, you can stick to
> a config file and set "m_data_sub_dir" to point to any directory you
> like using a *relative* path. Well, almost to any. It should be on the
> same drive.
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
>
>
>
>
>
>
> On Fri, May 20, 2011 at 2:47 PM, Daniel  wrote:
> > I'm attempting to integrate Tesseract 3 with another stand-alone app,
> > but I'm running into a problem: Tesseract always looks for the
> > language files in "\Program Files (x86)\Tesseract-OCR\tessdata"; I
> > need to store the language files in a different location (a subfolder
> > of my app's installation folder.)
>
> > I'm assuming Tesseract is getting this folder from the registry, so I
> > could just change the installation path listed, but (a) I don't want
> > to break user's possible other installations, and (b) I tried that and
> > it (inexplicably) didn't work.
>
> > Is there a way to specify the hard path from the command line, or do I
> > have to modify the code?
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to tesseract-ocr@googlegroups.com
> > To unsubscribe from this group, send email to
> > tesseract-ocr+unsubscr...@googlegroups.com
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


RE: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread Cong Nguyen
Please refer to "OPTIMIZING SPEED FOR ADAPTIVE LOCAL THRESHOLDING ALGORITHM
USING DYNAMIC PROGRAMMING".
Complexity is: O(n), n is number of pixels.

-Original Message-
From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Max Cantor
Sent: Thursday, March 31, 2011 7:28 AM
To: tesseract-ocr@googlegroups.com
Cc: tesseract-ocr@googlegroups.com
Subject: Re: tips for improving Tesseract accuracy and speed...

Yes. I've had great experience with sauvola binarize from leptonica. Gamer
works too but is much much slower

On Mar 31, 2011, at 0:02, cong nguyenba  wrote:

> I have another approach for you here: try to apply binarization using
> adaptive threshold! Delving into engine by following apdaptive
> classification in source code for speedup! I think it is enough for
> your expectation!
> 
> On Wednesday, March 30, 2011, Dmitri Silaev  wrote:
>> P.S.: If you're still sure that reasonable downscaling of your images
>> sacrifices the accuracy, please share one or two of your *unprocessed*
>> images to investigate further.
>> 
>> And I'd suggest to keep up with the latest revisions of Tesseract. The
>> API changes significantly, but Tess is definitely being improved in
>> the sense of stability, new capabilities and also code efficiency,
>> which explicitly may lead to improved performance which you are
>> looking for.
>> 
>> Warm regards,
>> Dmitri Silaev
>> 
>> 
>> 
>> 
>> 
>> On Tue, Mar 29, 2011 at 8:17 AM, Andres  wrote:
>>> ...required.
>>> 
>>> Hello people,
>>> 
>>> I'm develping a licence plate recognition system from long ago and I
still
>>> have to improve the use of Tesseract to make it usable.
>>> 
>>> My first concern is about speed:
>>> After extracting the licence plate image, I get an image like this:
>>> 
>>>
https://docs.google.com/leaf?id=0BxkuvS_LuBAzNmRkODhkYTUtNjcyYS00Nzg5LWE0ZDI
tNWM4YjRkYzhjYTFh&hl=en&authkey=CP-6tsgP
>>> 
>>> As you may see, there are only 6 characters (tess is recognizing more
>>> because there are some blemishes over there, but I get rid of them with
some
>>> postprocessing of the layout of the recognized chars)
>>> 
>>> In an Intel I7 720 (good power, but using a single thread) the tesseract
>>> part is taking something like 230 ms. This is too much time for what I
need.
>>> 
>>> The image is 500 x 117 pixels. I noted that when I reduce the size of
this
>>> image the detection time is reduced in proportion with the image area,
which
>>> makes good sense. But the accuracy of the OCR is poor when the
characters
>>> height is below 90 pixels.
>>> 
>>> So, I assume that there is a problem with the way I trained tesseract.
>>> 
>>> Because the characters in the plates are assorted (3 alphanumeric, 3
>>> numeric) I trained it with just a single image with all the letters in
the
>>> alphabet. I saw that you suggest large training but I imagine that that
>>> doesn't apply here where the characters are not organized in words. Am I
>>> correct with this ?
>>> 
>>> So, for you to see, this is the image with what I trained Tesseract:
>>> 
>>>
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxkuvS_Lu
BAzODc1YjIxNWUtNzIxMS00Yjg3LTljMDctNDkyZGIxZWM4YWVm&hl=en&authkey=CMXwo-AL
>>> 
>>> In this image the characters are about 55 pixels height.
>>> 
>>> Then, for frequent_word_list and words_list I included a single entry
for
>>> each character, I mean, something starting with this:
>>> 
>>> A
>>> B
>>> C
>>> D
>>> ...
>>> 
>>> Do you see something to be improved on what I did ? Should I perhaps use
a
>>> training image with more letters, with more combinations ? Will that
help
>>> somehow ?
>>> 
>>> Should I include in the same image a copy the same character set but
with
>>> smaller size ? In that way, will I be able to pass Tesseract smaller
images
>>> and get more speed without sacrificing detection quality ?
>>> 
>>> 
>>> On the other hand, I found some strange behavior of Tesseract about
which I
>>> would like to know a little more:
>>> In my preprocessing I tried Otsu thresholding
>>> (http://en.wikipedia.org/wiki/Otsu%27s_method) and I visually got too
much
>>> better results, but surprisingly for Tesseract it was worse. It
decreased
>>> the thickness of the draw of the chars, and the chars I used to train
>>> Tesseract were bolder. So, Tesseract matches the "boldness" of the
>>> characters ? Should I train Tesseract with different levels of boldness
?
>>> 
>>> I'm using Tesseract 2.04 for this. Do you think that some of these
issues
>>> will go better by using Tess 3.0 ?
>>> 
>>> 
>>> Thanks,
>>> 
>>> Andres
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> You received this message because you are subscribed to the Google
Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>> 

RE: image binarization

2011-03-02 Thread Cong Nguyen
Please be careful with the Otsu algorithm, because we use only one threshold
value for whole image.

No method is best for all cases J.

 

You should do and compare the results between Otsu algorithm and adaptive
threshold algorithm.

About adaptive threshold algorithm, you can be based on integral image
(known by Paul Viola et. al) to increase performance.

 

Cong.

 

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Saurabh Gandhi
Sent: Wednesday, March 02, 2011 3:34 PM
To: tesseract-ocr@googlegroups.com
Cc: Bikash Bag
Subject: Re: image binarization

 

You can use Otsu's binarisation algorithm:

http://www.sas.bg/code-snippets/image-binarization-the-otsu-method.html

--
Regards,
Saurabh Gandhi





On Wed, Mar 2, 2011 at 1:45 PM, Bikash Bag  wrote:

Hi, I am new to OCR, can anyone please tell me a good image binarization
algorithm. 

regards,
bikash

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
 .
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

 

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: text rotated upside down or of 90°

2011-02-28 Thread Cong Nguyen
Dear Giuseppe,

Could you post some samples to analyze?

If you are afraid that tesseract page layout doesn't work on rotated image,
you can run step-by-step as belows:

1. Firstly, you can call tesseract to FindLinesCreateBlockList (have a look
at TessBaseAPI class), you should achieved a BLOCK_LIST.

2. Now, please check BLOCK_LIST:
  I showed here only member fields:
...
  ROW_LIST rows;   //< rows in block
  ...
  FCOORD skew_;//< Direction of true horizontal.
  ICOORD median_size_; //< Median size of blobs.

  And here are ROW class:

inT32 kerning;   //inter char gap
inT32 spacing;   //inter word gap
TBOX bound_box;   //bounding box
float xheight;   //height of line
float ascrise;   //size of ascenders
float descdrop;  //-size of descenders
WERD_LIST words; //words
QSPLINE baseline;//baseline spline
...

  A page included block(s), and a block included row(s)

3. Try to visualize any things you need to have an overview of
segmentation/detection step worked...

Also, if you want to understand how to tesseract works, please read some
papers in doc folder, they have been published by Ray.

Hope it's helpful to you!

Cong.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: Wrappers for tessearct3.01?

2011-02-24 Thread Cong Nguyen
Dear devTess,

I does not plan to implement delegate event at engine-level, so you should
manage your implementations at high-level.

Example:
 
 string result = _ocrProcessor.Apply(bitmap);
 List detectedWords = _ocrProcessor.RetriveResultDetail();
  // you can raise event/do some things here if you want...

Thanks,
Cong.

-Original Message-
From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of devTess
Sent: Thursday, February 24, 2011 9:03 PM
To: tesseract-ocr
Subject: Re: Wrappers for tessearct3.01?

HI Cong Nguyen,
Exactly what I need.
Would you be implementing delegate event for the monitor class similar
to tessnet2.

Finally, someone did it. Thanks
J.

On Feb 21, 10:44 am, Cong Nguyen  wrote:
> Dear devTess,
>
> I have just implemented a simple .net wrapper:
>
> http://code.google.com/p/tesseractdotnet/w/list
>
> Hope it's helpful!
> Cong.

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



tesseract engine 3 .net wrapper v1.0 beta

2011-02-23 Thread Cong Nguyen
Dear all,

Here is link to the project: http://code.google.com/p/tesseractdotnet/

This version supported:
- extract locations of detected characters
- can run parallel for a set of images

This version run slower than original version, because I have to save
the processing image to memory stream as tiff format.

Here is application's screenshot:
https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576837564838753458

Cong.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: How to extract the images of each word from the whole image page?

2011-02-22 Thread Cong Nguyen
If you used tesseract 2.04, you should have a look at tessdll\tessdll.cpp.

 

>From tesseract 3.x, orcshell.h/.cpp have been removed. So you need to do
backward.

 

I try to do the same on my project, hope it releases soon.

 

Cong.

 

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of longyi
Sent: Sunday, February 20, 2011 5:54 AM
To: tesseract-ocr@googlegroups.com
Subject: How to extract the images of each word from the whole image page?

 

Hey everyone,
I am trying to use tesseract to extract images of each word from a scanned
images (say, a Chinese article).
I have looked into the codes for several days, but still unable to find a
way to do that.
It seems like the this engine are trying to try different adjacent connected
components combination to recognize a single word.
Anyone have suggestion on that? really need your help, thanks!

Seems like the previous post are failed. So I am trying again

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: Image pre-processing for good OCR results

2011-02-22 Thread Cong Nguyen
Dear Jon,

 

Beginning for analyzing; I try also to detect lines, corners; but results
are not good. I think due to images are low contrast.

 

Please try to analyze with some data line profiles:

 

ROI-left-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706091073985
362

 

ROI-top-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706094761082
706

 

ROI-right-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706102033630
978

 

ROI-bottom-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706106389606
898

 

After doing ROI detection, may be you need to align image.

My solution for this step is: 

-  detect all lines (Hough transform approach), and then keep all
lines have slops are similar to horizontal lines.

-  Estimate base-slop based on mean slop

-  Align image

Here are detected lines:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576709473940745
778

 

Hope it's helpful to you!

 

Good luck,

Cong.

 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: Image pre-processing for good OCR results

2011-02-22 Thread Cong Nguyen
Dear Andres,

 

The recognition results which I showed, have achieved after I had used my
simple tesseract engine 3.01 .net wrapper (link here:
http://code.google.com/p/tesseractdotnet/).

 

ROI detection is cropping ROI manually, after that I used my company
software to filter.

 

About filtering, you can analyze on control set to find out solution to
estimate parameters feasibly.

 

Thanks,

Cong.

 

 

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Andres
Sent: Wednesday, February 23, 2011 4:02 AM
To: tesseract-ocr@googlegroups.com
Subject: Re: Image pre-processing for good OCR results

 

Hello,

A few comments from my side, sorry for being disordered, but I have not much
time right now.

In OpenCV you can use thresholding with the Otsu algorithm, it's not
documented in the documentation of the threshold function, but the parameter
is CV_THRESH_OTSU.

Otsu thresholding involves the calibration of the parameters by performing a
previous histogram:
http://en.wikipedia.org/wiki/Otsu%27s_method

I tried it in my project (a licence plate recognition system) and I visually
got too much better results, but surprisingly for Tesseract it was worse. It
changed the thickness of the draw of the letters, and when I trained
Tesseract the letters were bolder than the results of the Otsu threshold, so
perhaps there is the explanation for my problem. So, perhaps it would be a
good solution for you.

If you want to make some rapid tests with OpenCV for preprocessing you can
use this:

http://code.google.com/p/cvpreprocessor/

It's not a complete tool but it helps.

I think that your system is close to mine in certain aspects. I was thinking
in doing some skeletonization or something like that for the fonts and then
training Tesseract with these modified letters. Then doing the same process
with the acquired images and executing Tesseract. I didn't try that yet.

Skeletonization:
http://homepages.inf.ed.ac.uk/rbf/HIPR2/skeleton.htm

In accordance with what Tom Morris said, you have some constraints in text
layout. Tesseract gives you the coordinates of each character. You can work
with that. Perhaps you will need some grouping algorithm like k-means to
make some statitstics: http://en.wikipedia.org/wiki/Kmeans
OpenCV has an implementation of k-means, ask me for a snippet in case of
needing it.

Question to Cong Nguyen: The program that you used here, is something that
is available on the web or is something that you have for your projects ? :
https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605
922

Cheers,

Andres
www.visiondepatentes.com.ar




2011/2/22 Tom Morris 

On Feb 20, 9:02 pm, Jon Andersen  wrote:

> My project athttp://RecordAGrave.comis about recording headstones from

> graves and posting the text and images on the Net so that people can
> research their family history.  I would appreciate some advice on how to
> pre-process these headstone images to get the best results from Tesseract
> OCR.  I have thousands of 1-2 MB jpg images of headstones to process.

Post-image capture is too late for one of the most important
enhancements, namely high contrast lighting.  It's not really an issue
with stones that have the carving painted or are otherwise naturally
high contrast, but for many stones sharp oblique lighting is important
to get an image that's readable by humans, let alone OCR software.

Once you've got the best quality image capture you can manage, you'll
probably find that you need to use different image processing
pipelines for different types of stones and carving, so the first step
will be to categorize the stone and figure out which pipeline to run
it through (or run it through them all and compare the results).

In addition to image processing, you may also be able to improve
results by making use of the fact that the vocabulary and layout of
the text is much more constrained than free text.

It'll be interesting to see what kind of results you get.  I suspect
it's going to be a fairly challenging project for the general case,
but you may be able to pick of the low hanging fruit and gradually
expand the types of stones you can handle.

Tom


--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
<mailto:tesseract-ocr%2bunsubscr...@googlegroups.com> .
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

 

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http

RE: Image pre-processing for good OCR results

2011-02-21 Thread Cong Nguyen
Dear Jon,

 

Try to analyze with some preprocessing steps as belows:

 

Step1: Detect ROI

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756516993
234

 

Setp2: Apply low-pass  fft  filter, with parameters:

- intensity threshold is 130

- fft cutoff: 15% 

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366759922523
650

 

Step3: Scale image with scale factor

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756371708
834

 

Step4: try to recognize use Tesseract/others

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605
922

 

Step5: post-processing requires

 

Good luck,

Cong.

 

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Jon Andersen
Sent: Monday, February 21, 2011 10:46 PM
To: tesseract-ocr@googlegroups.com
Subject: Re: Image pre-processing for good OCR results

 

Whoops, sorry - links were broken for a bit.  I just fixed the image links,
they should work now.

 

Thanks!!

 

-Jon

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Numbers & Noise

2011-02-21 Thread Cong Nguyen
Dear Zvezdoslav Kunov,

I have some ideas for preprocessing:

1. Apply thresholding image, analyze two simple method:
- static threshold: keep pixels have lower intensity
- adaptive threshold

2. Do connected component
- filter objects/clusters based on boundary

3. Based-on median of objects/clusters boundary, calculate scale
factor (depend on trained character size) and apply scaling image

After that, I think we should get good results.

Cong.

P/S: here are illustrations about the approach:
extracted ROI (I cropped manually :)):
https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576335049023069730
scaled image: 
https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576335048933180546
tesseract ocr recognition result for scaled image:
https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576335050086804194
You can find simple application at: http://code.google.com/p/tesseractdotnet/

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Wrappers for tessearct3.01?

2011-02-21 Thread Cong Nguyen
I have just implemented a simple .net wrapper.

http://code.google.com/p/tesseractdotnet/w/list

Hope it's helpful!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Wrappers for tessearct3.01?

2011-02-21 Thread Cong Nguyen
Sorry, but it's not spam!

I try to send message 3 times, but I cannot see my post! :)

So, I try to post once more

I have just implemented a simple wrapper here 
http://code.google.com/p/tesseractdotnet/w/list.
Hope it is helpful.

Cong.

On Feb 18, 11:53 pm, devTess  wrote:
> Hi, I am getting many questions for help on how to make a simple
> wrapper, where to find information of many parameter terms used in the
> api. I have asked them to address the questions here. The developers
> need feedbacks to refine the software.
>
> For clarification, I am struggling to understand myself.
>
> I hope someone who is able to make it work just post the simplest
> example on how to use the api beside following the tesseractMain.cpp
>
> I am sure the community would appreciate.
>
> I am trying to understand how to do the core part of the tessnet to
> work with version 3.
>
> The codes provided by Remi have shown how to pass the bitmap to native
> tessearct image struct Pix.
>
> the challenge now is after many trials, I still could not get pass
>
> the various ways to get the
>
> recog_all_words(page_res_, monitor, NULL, NULL, 1);
>
> works.
>
> Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Wrappers for tessearct3.01?

2011-02-21 Thread Cong Nguyen
Dear devTess,

I have just implemented a simple .net wrapper:

http://code.google.com/p/tesseractdotnet/w/list

Hope it's helpful!
Cong.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.