Re: How to detect inverted image in a picture

2011-03-16 Thread Dmitry Silaev
This param controls the maximum allowed length of connected
component's contour (blob outline in terms of Tesseract).

I suspect, Tess decided that the dominant color of the text is white
and hence it could not construct all blobs' outlines until you've
raised the value edges_maxedgelength (since the outer outline of the
big white blob is very long). One of Tess's notable features is that
it can handle inverted text. Though it should be able to get all
outlines, and you've helped him to achieve this.

Warm regards,
Dmitry Silaev





On Wed, Mar 16, 2011 at 10:57 AM, Ice Head iceh...@gmail.com wrote:
 After several try, I found a resolution by changing parameter
 edges_maxedgelength
 Do you know what is the functionalty of this parameter ?

 Thanx,

 Ice

 On Mar 9, 2:12 pm, Ice Head iceh...@gmail.com wrote:
 Hi,

 I'm using tesseract 3.01 and failed to read simple images like this
 one (see link below)

 https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B1...

 Is there a way to read this kind of picture ?

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Especial Characteres

2011-03-14 Thread Dmitry Silaev
Manuel,

I'm afraid just chaining command line tools won't help in this case.
I'm talking about programming.

And yes, I did solve many practical problems related to layout
analysis, and other fields of document image processing, and succeeded
in it ))

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 7:55 AM, manuel...@gmail.com
manuel...@gmail.com wrote:
 What would you recommend to use to split the columns?

 I think I will need to scan using tesseract column by column.
 So after that I will need to merge it to make correct rows.

 Can you point me a direction to help me?
 What tools (unix compatible tools) can I use to tell tesseract to scan a 
 specific  column?

 Later I will recompile to test, but first I need to find a way to scan 
 correct these reports to generate CSV files to import later to a database.
 If it works I will spend more time tunning tesseract.

 Have you ever did this before? (scan reports using tesseract or other tools 
 to generate csv files)

 Thanks



 Em 13/03/2011, às 11:20, Dmitry Silaev escreveu:

 Running via ports can cause diverse errors. Try to compile Tesseract
 natively. I use revision 549 and as I said it works fine.

 Such tables as you have present a challenge for simple layout
 processing algorithms, due to sparsely located text. A minimal skew
 which is almost inevitable could break all the logic. In such cases I
 prefer to devise a custom made segmentation logic specific to the
 document type being processed. In this way I do not depend on
 Tesseract's segmentation - Tesseract is being used as a raw
 classifier.

 Warm regards,
 Dmitry Silaev





 On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com
 manuel...@gmail.com wrote:
 I'm using the latest version tesseract @3.00_2+eng
 I installed using ports in MacOSX

 Another question Dmitry about this sample
 In this sample why doesn't tesseract recognize a complete row? It's not a 
 perfect align, but it is impossible to get a image 100% aligned.
 Tesseract is breaking columns in new lines like :

 1           test    productA
 2           test2
 productB

 Do you know how to fix it?

 Regard
 Manuel Pardo


 Em 13/03/2011, às 08:32, Dmitry Silaev escreveu:

 Manuel,

 The sample you provided definitely has insufficient resolution. You
 may only expect some part of the heading to be recognized. So this is
 what happened when I've run the recognition of your image. But I
 haven't got any error or warning messages with my por.traineddata at
 all!

 However all this was tested under Windows. Probably I can try this
 under Ubuntu, but I don't know when I have enough time to reboot, set
 up a C++ compiler, build Tesseract and do some testing, sorry ))

 Are you sure you downloaded the latest stable version of Tesseract?

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com
 manuel...@gmail.com wrote:
 I just replaced por.traineddata with your file por.traineddata.
 After that I'm getting this message error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert 
 failed:in file tessdatamanager.cpp, line 55
 Segmentation fault

 I haven't succeeded. I'm using version 3 - MacOSX 10.6



 Attached Reported.tiff






 Regards
 Manuel Pardo

 Em 04/03/2011, às 03:19, Dmitry Silaev escreveu:

 Manuel,

 Is the error message generated by version 2.xx? Did you try to run
 version 3.xx with my por.traineddata file?
 I don't get it - have you succeeded or not?
 Please provide us with the image you are trying to recognize.

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com 
 manuel...@gmail.com wrote:
 Hi Dmitry,

 I just replaced with your file por.traineddata
 But I'm getting an error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert 
 failed:in file tessdatamanager.cpp, line 55
 Segmentation fault

 It's seem to be interesting to convert old files from 2.0X to 3, 
 because there isn't a brazillian portuguese for version 3,  just 
 portuguese.
 At least the dictionary por.traineeddata is working correctly in 
 version 3.
 The special chars is being recognized by tesseract 3.

 regards,
 Manuel Pardo




 Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images

Re: how to get the character in an image file which is in table format.

2011-03-14 Thread Dmitry Silaev
I suspect, this paper is a sledgehammer for a nut. It's quite
universal and elaborated. Usually it may take a great deal of time to
implement and debug it. Your images might require much simplier
methods.

I always say the same thing: send your sample images and the community
will try to help.

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote:
 Hi Vicky,

 Can you tell me more about this paper?  It looks like this is not a
 free document so I can't just read it to see if it would solve the
 problem I have.

 My problem is that I have grey-scale image data (tif/jpg/etc) that
 contains text within a table format, i.e. cells on the page.  The
 documents where originally faxed then converted to PDF so the image
 quality varies from poor to good.  I don't want the table formatting,
 I'm looking for a way to remove the formatting and get to just the
 image text, I want to convert that to text using OCR, Tesseract or
 otherwise.

 My programming environment is Java but can shell out to other programs
 if I need to.

 Would the approach in the paper solve this problem space?  How
 practical is the software solution for a one man effort?

 Thanks,
 -Dave



 On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja vicky.vi...@gmail.com 
 wrote:
 Hello,

 I used this paper (for pre-processing):
 Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE
 Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240
 - 1256

 Best Regards,
 Vicky



 -Original Message-
 From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
 On Behalf Of Daphne
 Sent: Friday, March 11, 2011 01:15
 To: tesseract-ocr
 Subject: how to get the character in an image file which is in table format.

 Hello,

 I have a scanned image file which contains table. When I OCR it using
 tessnet it doesn't give the desired output.
 It is not reading the characters in the table. Instead it give some
 numbers.

 How to read the character in table format image

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Tesseract 3.00

2011-03-14 Thread Dmitry Silaev
Actually, there's more than just VietOCR. Check this:

http://en.wikipedia.org/wiki/Tesseract_(software)#User_interfaces

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 2:13 AM, Onion onionzwie...@gmail.com wrote:
 Ok, thanks. That will be too complicated for me to use. Will have to
 uninstall it.

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Tesseract 3.00

2011-03-14 Thread Dmitry Silaev
You don't need to bother using *two together*. Tesseract is a basis
FreeOCR is built on, so these two are together already. FreeOCR's
graphic interface is quite user friendly. Just install and use. I
don't know what else needs to be said ))

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 12:31 PM, Onion onionzwie...@gmail.com wrote:
 I have FreeOCR installed already. So somehow, this works with Tesseract? Can
 you explain in simpleton terms how I'd use the two together? Or is it too
 geeky?
 Thanks

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Customising Tesseract for character recognition

2011-03-14 Thread Dmitry Silaev
Ehmm... I don't get it. If you've succeeded in using iterators, it's
at your full disposal to format the output in any way you want
programmatically, isn't it?

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 1:56 PM, Jose diox...@gmail.com wrote:
 *I only modify how the result is printed! nothing else... I grab all the
 info from the word and it's bounding box! that is ok right?

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: how to get the character in an image file which is in table format.

2011-03-14 Thread Dmitry Silaev
Dave,

Yep, quality is relatively poor so don't expect high accuracy from Tess.

Do you need every table cell's contents? Or getting numbers is just
enough and in a next step you can restore [predefined] item names?

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 4:19 PM, David Hoffer dhoff...@gmail.com wrote:
 Dmity,

 That would be great thanks for the offer, I'll attach two samples.

 These two are good examples of the range of quality.  What I need to
 do is extract cell data for processing.  I can generate these in any
 image format, tiff, jpeg if one should be preferred.

 Best regards,
 -Dave


 On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev daemons2...@gmail.com wrote:
 I suspect, this paper is a sledgehammer for a nut. It's quite
 universal and elaborated. Usually it may take a great deal of time to
 implement and debug it. Your images might require much simplier
 methods.

 I always say the same thing: send your sample images and the community
 will try to help.

 Warm regards,
 Dmitry Silaev





 On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote:
 Hi Vicky,

 Can you tell me more about this paper?  It looks like this is not a
 free document so I can't just read it to see if it would solve the
 problem I have.

 My problem is that I have grey-scale image data (tif/jpg/etc) that
 contains text within a table format, i.e. cells on the page.  The
 documents where originally faxed then converted to PDF so the image
 quality varies from poor to good.  I don't want the table formatting,
 I'm looking for a way to remove the formatting and get to just the
 image text, I want to convert that to text using OCR, Tesseract or
 otherwise.

 My programming environment is Java but can shell out to other programs
 if I need to.

 Would the approach in the paper solve this problem space?  How
 practical is the software solution for a one man effort?

 Thanks,
 -Dave



 On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja vicky.vi...@gmail.com 
 wrote:
 Hello,

 I used this paper (for pre-processing):
 Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE
 Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 
 1240
 - 1256

 Best Regards,
 Vicky



 -Original Message-
 From: tesseract-ocr@googlegroups.com 
 [mailto:tesseract-ocr@googlegroups.com]
 On Behalf Of Daphne
 Sent: Friday, March 11, 2011 01:15
 To: tesseract-ocr
 Subject: how to get the character in an image file which is in table 
 format.

 Hello,

 I have a scanned image file which contains table. When I OCR it using
 tessnet it doesn't give the desired output.
 It is not reading the characters in the table. Instead it give some
 numbers.

 How to read the character in table format image

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.





-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: how to get the character in an image file which is in table format.

2011-03-14 Thread Dmitry Silaev
Dave,

What is the format and resolution in which you initially get your
images? For such poor quality every conversion makes an image even
worse...

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 5:29 PM, David Hoffer dhoff...@gmail.com wrote:
 Dmitry,

 Would using a loss-less format like TIFF be preferred?

 (I'm going to give this a try but some of these steps might be a bit
 more than I can handle...I'm not an image processing guru.)

 -Dave

 On Mon, Mar 14, 2011 at 5:23 PM, Dmitry Silaev daemons2...@gmail.com wrote:
 Ehmm, actually I thought a bit more and now I say no to deskewing. It
 can be detrimental to such poor quality images - they are almost
 binary (almost probably because of the JPEG compression algo) and
 low-res. As far as I see, you only can have binary images.

 Therefore we need to assume a skew of an input image to be always
 within some narrow range and modify all our following steps to work in
 a skewed coordinate system.

 Dmitry

 On Mar 14, 4:19 pm, David Hoffer dhoff...@gmail.com wrote:
 Dmity,

 That would be great thanks for the offer, I'll attach two samples.

 These two are good examples of the range of quality.  What I need to
 do is extract cell data for processing.  I can generate these in any
 image format, tiff, jpeg if one should be preferred.

 Best regards,
 -Dave

 On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev daemons2...@gmail.com 
 wrote:
  I suspect, this paper is a sledgehammer for a nut. It's quite
  universal and elaborated. Usually it may take a great deal of time to
  implement and debug it. Your images might require much simplier
  methods.

  I always say the same thing: send your sample images and the community
  will try to help.

  Warm regards,
  Dmitry Silaev

  On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote:
  Hi Vicky,

  Can you tell me more about this paper?  It looks like this is not a
  free document so I can't just read it to see if it would solve the
  problem I have.

  My problem is that I have grey-scale image data (tif/jpg/etc) that
  contains text within a table format, i.e. cells on the page.  The
  documents where originally faxed then converted to PDF so the image
  quality varies from poor to good.  I don't want the table formatting,
  I'm looking for a way to remove the formatting and get to just the
  image text, I want to convert that to text using OCR, Tesseract or
  otherwise.

  My programming environment is Java but can shell out to other programs
  if I need to.

  Would the approach in the paper solve this problem space?  How
  practical is the software solution for a one man effort?

  Thanks,
  -Dave

  On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja 
  vicky.vi...@gmail.com wrote:
  Hello,

  I used this paper (for pre-processing):
  Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. 
  IEEE
  Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 
  1240
  - 1256

  Best Regards,
  Vicky

  -Original Message-
  From: tesseract-ocr@googlegroups.com 
  [mailto:tesseract-ocr@googlegroups.com]
  On Behalf Of Daphne
  Sent: Friday, March 11, 2011 01:15
  To: tesseract-ocr
  Subject: how to get the character in an image file which is in table 
  format.

  Hello,

  I have a scanned image file which contains table. When I OCR it using
  tessnet it doesn't give the desired output.
  It is not reading the characters in the table. Instead it give some
  numbers.

  How to read the character in table format image

  --
  You received this message because you are subscribed to the Google 
  Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.

  --
  You received this message because you are subscribed to the Google 
  Groups tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to 
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group 
  athttp://groups.google.com/group/tesseract-ocr?hl=en.

  --
  You received this message because you are subscribed to the Google 
  Groups tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to 
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group 
  athttp://groups.google.com/group/tesseract-ocr?hl=en.



  hud1.jpeg
 748KViewDownload

  hud2.jpeg
 2046KViewDownload

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group

Re: Especial Characteres

2011-03-14 Thread Dmitry Silaev
I doubt there's a GUI which can help with what you want. As for
programmatic way of doing this, please refer to the following thread
where I already tried to answer a similar question:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/6322a29f28ba49dc/f98699a9caf36dbc#f98699a9caf36dbc

If you see no clues in these posts then you need to send your sample
images, there's no other way to help you.

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 5:22 PM, manuel...@gmail.com
manuel...@gmail.com wrote:
 Thanks.

 I need a GUI that tells to tesseract to recognize just a specific column.
 I'm a Java and C++ developer. Can you point me a direction ?


 Regards
 Manuel Pardo

 Em 14/03/2011, às 04:50, Dmitry Silaev escreveu:

 Manuel,

 I'm afraid just chaining command line tools won't help in this case.
 I'm talking about programming.

 And yes, I did solve many practical problems related to layout
 analysis, and other fields of document image processing, and succeeded
 in it ))

 Warm regards,
 Dmitry Silaev





 On Mon, Mar 14, 2011 at 7:55 AM, manuel...@gmail.com
 manuel...@gmail.com wrote:
 What would you recommend to use to split the columns?

 I think I will need to scan using tesseract column by column.
 So after that I will need to merge it to make correct rows.

 Can you point me a direction to help me?
 What tools (unix compatible tools) can I use to tell tesseract to scan a 
 specific  column?

 Later I will recompile to test, but first I need to find a way to scan 
 correct these reports to generate CSV files to import later to a database.
 If it works I will spend more time tunning tesseract.

 Have you ever did this before? (scan reports using tesseract or other tools 
 to generate csv files)

 Thanks



 Em 13/03/2011, às 11:20, Dmitry Silaev escreveu:

 Running via ports can cause diverse errors. Try to compile Tesseract
 natively. I use revision 549 and as I said it works fine.

 Such tables as you have present a challenge for simple layout
 processing algorithms, due to sparsely located text. A minimal skew
 which is almost inevitable could break all the logic. In such cases I
 prefer to devise a custom made segmentation logic specific to the
 document type being processed. In this way I do not depend on
 Tesseract's segmentation - Tesseract is being used as a raw
 classifier.

 Warm regards,
 Dmitry Silaev





 On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com
 manuel...@gmail.com wrote:
 I'm using the latest version tesseract @3.00_2+eng
 I installed using ports in MacOSX

 Another question Dmitry about this sample
 In this sample why doesn't tesseract recognize a complete row? It's not a 
 perfect align, but it is impossible to get a image 100% aligned.
 Tesseract is breaking columns in new lines like :

 1           test    productA
 2           test2
 productB

 Do you know how to fix it?

 Regard
 Manuel Pardo


 Em 13/03/2011, às 08:32, Dmitry Silaev escreveu:

 Manuel,

 The sample you provided definitely has insufficient resolution. You
 may only expect some part of the heading to be recognized. So this is
 what happened when I've run the recognition of your image. But I
 haven't got any error or warning messages with my por.traineddata at
 all!

 However all this was tested under Windows. Probably I can try this
 under Ubuntu, but I don't know when I have enough time to reboot, set
 up a C++ compiler, build Tesseract and do some testing, sorry ))

 Are you sure you downloaded the latest stable version of Tesseract?

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com
 manuel...@gmail.com wrote:
 I just replaced por.traineddata with your file por.traineddata.
 After that I'm getting this message error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert 
 failed:in file tessdatamanager.cpp, line 55
 Segmentation fault

 I haven't succeeded. I'm using version 3 - MacOSX 10.6



 Attached Reported.tiff






 Regards
 Manuel Pardo

 Em 04/03/2011, às 03:19, Dmitry Silaev escreveu:

 Manuel,

 Is the error message generated by version 2.xx? Did you try to run
 version 3.xx with my por.traineddata file?
 I don't get it - have you succeeded or not?
 Please provide us with the image you are trying to recognize.

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com 
 manuel...@gmail.com wrote:
 Hi Dmitry,

 I just replaced with your file por.traineddata
 But I'm getting an error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert 
 failed:in file tessdatamanager.cpp, line 55
 Segmentation fault

 It's seem to be interesting to convert old files from 2.0X to 3, 
 because there isn't a brazillian portuguese for version 3,  just 
 portuguese.
 At least the dictionary por.traineeddata is working correctly in 
 version 3.
 The special chars is being recognized by tesseract

Re: how to get the character in an image file which is in table format.

2011-03-14 Thread Dmitry Silaev
As I can see, your source data can be deemed as 1-bit (binary)
losslessly compressed image. So a lossless conversion to any image
format (makes no difference which) will do no harm.

Warm regards,
Dmitry Silaev





On Tue, Mar 15, 2011 at 8:31 AM, David Hoffer dhoff...@gmail.com wrote:
 Dmitry,

 Originally the documents are PDF with these images CCITTFax encoded I
 decoded them using iText.  At this point I have a BufferedImage which
 I can save in any format supported by Java.  I assume Tiff would be
 one of the best.

 Best regards,
 -Dave

 On Tue, Mar 15, 2011 at 7:52 AM, Dmitry Silaev daemons2...@gmail.com wrote:
 Dave,

 What is the format and resolution in which you initially get your
 images? For such poor quality every conversion makes an image even
 worse...

 Warm regards,
 Dmitry Silaev





 On Mon, Mar 14, 2011 at 5:29 PM, David Hoffer dhoff...@gmail.com wrote:
 Dmitry,

 Would using a loss-less format like TIFF be preferred?

 (I'm going to give this a try but some of these steps might be a bit
 more than I can handle...I'm not an image processing guru.)

 -Dave

 On Mon, Mar 14, 2011 at 5:23 PM, Dmitry Silaev daemons2...@gmail.com 
 wrote:
 Ehmm, actually I thought a bit more and now I say no to deskewing. It
 can be detrimental to such poor quality images - they are almost
 binary (almost probably because of the JPEG compression algo) and
 low-res. As far as I see, you only can have binary images.

 Therefore we need to assume a skew of an input image to be always
 within some narrow range and modify all our following steps to work in
 a skewed coordinate system.

 Dmitry

 On Mar 14, 4:19 pm, David Hoffer dhoff...@gmail.com wrote:
 Dmity,

 That would be great thanks for the offer, I'll attach two samples.

 These two are good examples of the range of quality.  What I need to
 do is extract cell data for processing.  I can generate these in any
 image format, tiff, jpeg if one should be preferred.

 Best regards,
 -Dave

 On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev daemons2...@gmail.com 
 wrote:
  I suspect, this paper is a sledgehammer for a nut. It's quite
  universal and elaborated. Usually it may take a great deal of time to
  implement and debug it. Your images might require much simplier
  methods.

  I always say the same thing: send your sample images and the community
  will try to help.

  Warm regards,
  Dmitry Silaev

  On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com 
  wrote:
  Hi Vicky,

  Can you tell me more about this paper?  It looks like this is not a
  free document so I can't just read it to see if it would solve the
  problem I have.

  My problem is that I have grey-scale image data (tif/jpg/etc) that
  contains text within a table format, i.e. cells on the page.  The
  documents where originally faxed then converted to PDF so the image
  quality varies from poor to good.  I don't want the table formatting,
  I'm looking for a way to remove the formatting and get to just the
  image text, I want to convert that to text using OCR, Tesseract or
  otherwise.

  My programming environment is Java but can shell out to other programs
  if I need to.

  Would the approach in the paper solve this problem space?  How
  practical is the software solution for a one man effort?

  Thanks,
  -Dave

  On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja 
  vicky.vi...@gmail.com wrote:
  Hello,

  I used this paper (for pre-processing):
  Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. 
  IEEE
  Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 
  Pages 1240
  - 1256

  Best Regards,
  Vicky

  -Original Message-
  From: tesseract-ocr@googlegroups.com 
  [mailto:tesseract-ocr@googlegroups.com]
  On Behalf Of Daphne
  Sent: Friday, March 11, 2011 01:15
  To: tesseract-ocr
  Subject: how to get the character in an image file which is in table 
  format.

  Hello,

  I have a scanned image file which contains table. When I OCR it using
  tessnet it doesn't give the desired output.
  It is not reading the characters in the table. Instead it give some
  numbers.

  How to read the character in table format image

  --
  You received this message because you are subscribed to the Google 
  Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.

  --
  You received this message because you are subscribed to the Google 
  Groups tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to 
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group 
  athttp://groups.google.com/group/tesseract-ocr?hl=en.

  --
  You received this message because you are subscribed to the Google 
  Groups

Re: how to get the character in an image file which is in table format.

2011-03-13 Thread Dmitry Silaev
The first step in this technique is to threshold the image using a
manually selected threshold value. Within the text of the article this
step only deserved a line of code (pix1 = pixThresholdToBinary(pixs,
150)), but not a single word. However the fact that such a convenient
threshold luckily exists is crucial for the whole subsequent method
steps to work. I think the your source images do not enjoy such good
separability conditions.

I think this article is more an example of what can be done with
Leptonica from user's, not developer's point of view. It's like you
take one concrete image in Photoshop and try to achieve what you have
in your head. You try various filters, apply transformations, effects,
etc. However none of these can be applied automatically: every time
you need to choose parameters manually and make decisions specifically
for this very image.

Imho this is the reason why the author chose morphology - oh, great!
that's worked!. It's easier to use in one function call, but in the
overwhelming majority of cases, using algorithmic approach gives
much more precise results. In real situations morphology requires from
you to do a great deal of cleaning after it has done its work, which
can be a lot more complex and not so mathematically elegant than
morphology algos themselves. Another reason why I try to stay away
from morphology is that it is really slow by its nature compared to
other methods, despite recent emerging of some fast methods. By the
way, the article advertises the processing speed of 1 Mpix/sec, which
I think is relatively slow for the intended goal even for yesterday's
P4s.

The moral is: you can use this article as a guideline or maybe just
for several specific images. However it's not well suited for
automatic processing.

P.S.: This my own opinion, and it does not necessarily coincide with
the views of other document image processing people.

Warm regards,
Dmitry Silaev





On Sun, Mar 13, 2011 at 12:52 AM, TP wing...@gmail.com wrote:
 How about this technique mentioned in the Leptonica documentation (its
 even easier if you can use binary morphology): Removing dark lines
 from a light pencil drawing at
 http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html
 .

  -- TP

 On Sat, Mar 12, 2011 at 12:57 PM, Dmitry Silaev daemons2...@gmail.com wrote:
 Dave,

 There's a number of methods you can use to remove straight lines or
 borders, either individually or in combination. The most simple are:
 Hough line detector (http://en.wikipedia.org/wiki/Hough_transform),
 vertical/horizontal profile method (X and Y histograms of foreground
 pixel counts - detect lines by most bin count or table cell margins by
 least bin count), connected component analysis (detect nested CCs -
 outer ones serve as borders), methods based on alignment analysis. If
 your documents can have a skew, for some methods they need to be
 deskewed.

 After you detect table borders, you can get bounding boxes of
 individual cells and then pass them to Tesseract. I think for
 Tesseract, small single-row portions of text, yet allowing to
 determine the baseline and x-height, are often much easier to
 recognize than full-sized pages, even with no tables in them. This is
 because Tesseract's native layout analysis. To disable it (or to avoid
 it as much as possible) you would need to set pageseg_mode to
 PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to
 PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or
 PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout
 analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for
 PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own
 segmentation. I don't know if you are ready to dive into such serious
 development.

 HTH

 Warm regards,
 Dmitry Silaev





 On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer dhoff...@gmail.com wrote:
 Dmitry,

 Yeah, I was thinking too of preprocessing to remove all straight
 lines/borders but haven't found a good approach to this yet.  I can
 clean up the margins, headers, footers but I haven't found a good way
 to remove table row lines.  if you/others have any suggestions I would
 love to hear them.

 I will also experiment with the config file.

 Thanks much!
 -Dave

 On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev daemons2...@gmail.com 
 wrote:
 Actually I think there's no fully user-friendly solution. Maybe you
 can try to use the first of the two possible methods currently seen to
 me.

 So the first method is to devise a special config file and include it
 in the command line for Tesseract. The following values need to be
 within this config file:

 tessedit_pageseg_mode 1 or 3 (I recommend 3)
 textord_tabfind_find_tables T
 textord_tablefind_recognize_tables T

 You can play with the last param trying the T or F values. Actually I
 give no guarantee for the whole method to work, only I found out some
 clues by studying the code. I suspect corresponding pieces

Re: Especial Characteres

2011-03-13 Thread Dmitry Silaev
Manuel,

The sample you provided definitely has insufficient resolution. You
may only expect some part of the heading to be recognized. So this is
what happened when I've run the recognition of your image. But I
haven't got any error or warning messages with my por.traineddata at
all!

However all this was tested under Windows. Probably I can try this
under Ubuntu, but I don't know when I have enough time to reboot, set
up a C++ compiler, build Tesseract and do some testing, sorry ))

Are you sure you downloaded the latest stable version of Tesseract?

Warm regards,
Dmitry Silaev





On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com
manuel...@gmail.com wrote:
 I just replaced por.traineddata with your file por.traineddata.
 After that I'm getting this message error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in 
 file tessdatamanager.cpp, line 55
 Segmentation fault

 I haven't succeeded. I'm using version 3 - MacOSX 10.6



 Attached Reported.tiff






 Regards
 Manuel Pardo

 Em 04/03/2011, às 03:19, Dmitry Silaev escreveu:

 Manuel,

 Is the error message generated by version 2.xx? Did you try to run
 version 3.xx with my por.traineddata file?
 I don't get it - have you succeeded or not?
 Please provide us with the image you are trying to recognize.

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com 
 wrote:
 Hi Dmitry,

 I just replaced with your file por.traineddata
 But I'm getting an error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in 
 file tessdatamanager.cpp, line 55
 Segmentation fault

 It's seem to be interesting to convert old files from 2.0X to 3, because 
 there isn't a brazillian portuguese for version 3,  just portuguese.
 At least the dictionary por.traineeddata is working correctly in version 3.
 The special chars is being recognized by tesseract 3.

 regards,
 Manuel Pardo




 Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.

 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.

 Warm regards,
 Dmitry Silaev





 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
 Helo list,
 I can't find a solution for special chars

 I installed tesseract 3 in my MacOSX 10.6
 It is running very well

 But I'm having problems with charset.
 I need tesseract working with brazillian portuguese. (ISO8859-1)

 I installed the portuguese dictionary but is not working with special
 chars like  Ç Ã É é   (ISO8859-1)
 Is there any solution ?

 There is an old dictionary special for brazilian portuguese in version
 2.0.4. Is it possible to use in version 3? How?


 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.

 por.traineddata

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups

Re: Tesseract 3.00

2011-03-13 Thread Dmitry Silaev
Although Tesseract team struggles to get it more user-friendly, many
obvious user issues are still opaque or hard to find an answer to...

Tesseract is a console application, it has no GUI. You should open a
Windows command line and type a command. Read more at
http://code.google.com/p/tesseract-ocr/wiki/ReadMe#Windows

Warm regards,
Dmitry Silaev





On Sun, Mar 13, 2011 at 11:36 PM, Onion onionzwie...@gmail.com wrote:
 I installed Tesseract 3.00 and the German and Czech languages as well as
 English.
 Now how do I run it? Are there directions somewhere?
 When I click Start  Tesseract OCR, a DOS screen flashes for a split second,
 then nothing happens.
 Thanks

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Customising Tesseract for character recognition

2011-03-13 Thread Dmitry Silaev
Jose,

I run Tesseract revision 549 from the command line under Windows with
no special config and get the segmentation which is almost correct.
What language file do you use? I used the following command line

tesseract 3.tiff test3 -l eng

with no pageseg_mode (-psm argument) as well as with it, and always
the result was satisfactory.

Let me know the details on your command line and OS.

Warm regards,
Dmitry Silaev





On Sun, Mar 13, 2011 at 11:18 PM, patrickq
patrick.questemb...@gmail.com wrote:
 You expect way too much from Tesseract: it's not Tesseract's job to
 slice and dice the text according to various organizational
 requirements of applications - that's for the application to handle.
 You can get all the coordinates of all characters and easily determine
 which one are in what you consider the first column and which are in
 the 2nd column. In ScanBizCards' case considering our target material,
 we treat each line as a single number formed of two sequences - but if
 we wanted to treat the input as columns, it would take us a mere 20
 minutes of coding or organize the results that way. We actually don't
 even pay attention to where Tesseract thinks lines end and start, we
 figure that out ourselves based on coordinates. It's not hard.

 Patrick

 On Mar 13, 4:10 pm, Jose diox...@gmail.com wrote:
 Hi Patrick,

 yes the results are correct! but the format of the results it is not! that's
 my trouble

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: how to get the character in an image file which is in table format.

2011-03-12 Thread Dmitry Silaev
Dave,

There's a number of methods you can use to remove straight lines or
borders, either individually or in combination. The most simple are:
Hough line detector (http://en.wikipedia.org/wiki/Hough_transform),
vertical/horizontal profile method (X and Y histograms of foreground
pixel counts - detect lines by most bin count or table cell margins by
least bin count), connected component analysis (detect nested CCs -
outer ones serve as borders), methods based on alignment analysis. If
your documents can have a skew, for some methods they need to be
deskewed.

After you detect table borders, you can get bounding boxes of
individual cells and then pass them to Tesseract. I think for
Tesseract, small single-row portions of text, yet allowing to
determine the baseline and x-height, are often much easier to
recognize than full-sized pages, even with no tables in them. This is
because Tesseract's native layout analysis. To disable it (or to avoid
it as much as possible) you would need to set pageseg_mode to
PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to
PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or
PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout
analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for
PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own
segmentation. I don't know if you are ready to dive into such serious
development.

HTH

Warm regards,
Dmitry Silaev





On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer dhoff...@gmail.com wrote:
 Dmitry,

 Yeah, I was thinking too of preprocessing to remove all straight
 lines/borders but haven't found a good approach to this yet.  I can
 clean up the margins, headers, footers but I haven't found a good way
 to remove table row lines.  if you/others have any suggestions I would
 love to hear them.

 I will also experiment with the config file.

 Thanks much!
 -Dave

 On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev daemons2...@gmail.com wrote:
 Actually I think there's no fully user-friendly solution. Maybe you
 can try to use the first of the two possible methods currently seen to
 me.

 So the first method is to devise a special config file and include it
 in the command line for Tesseract. The following values need to be
 within this config file:

 tessedit_pageseg_mode 1 or 3 (I recommend 3)
 textord_tabfind_find_tables T
 textord_tablefind_recognize_tables T

 You can play with the last param trying the T or F values. Actually I
 give no guarantee for the whole method to work, only I found out some
 clues by studying the code. I suspect corresponding pieces of code may
 not work perfectly, or there are some more parameters that can
 influence table recognition. Please try this yourself. It would be
 nice if you share your results with the community. Sample images are
 also appreciated.

 The second method is to pre-process your images. You need to remove
 lines and borders and pass the cleaned image to Tesseract. There can
 arise many issues related to this process, but I think there's no need
 to tell anything else now, except if you express some interest in it.

 Warm regards,
 Dmitry Silaev





 On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer dhoff...@gmail.com wrote:
 I have the same problem, I posted a message a few day's ago titled
 Working with FAX images with lines/borders.  If you find a solution
 please let me know.

 Thanks,
 -Dave

 On Thu, Mar 10, 2011 at 10:44 PM, Daphne flower.dap...@gmail.com wrote:
 Hello,

 I have a scanned image file which contains table. When I OCR it using
 tessnet it doesn't give the desired output.
 It is not reading the characters in the table. Instead it give some
 numbers.

 How to read the character in table format image

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr

Re: how to get the character in an image file which is in table format.

2011-03-11 Thread Dmitry Silaev
Actually I think there's no fully user-friendly solution. Maybe you
can try to use the first of the two possible methods currently seen to
me.

So the first method is to devise a special config file and include it
in the command line for Tesseract. The following values need to be
within this config file:

tessedit_pageseg_mode 1 or 3 (I recommend 3)
textord_tabfind_find_tables T
textord_tablefind_recognize_tables T

You can play with the last param trying the T or F values. Actually I
give no guarantee for the whole method to work, only I found out some
clues by studying the code. I suspect corresponding pieces of code may
not work perfectly, or there are some more parameters that can
influence table recognition. Please try this yourself. It would be
nice if you share your results with the community. Sample images are
also appreciated.

The second method is to pre-process your images. You need to remove
lines and borders and pass the cleaned image to Tesseract. There can
arise many issues related to this process, but I think there's no need
to tell anything else now, except if you express some interest in it.

Warm regards,
Dmitry Silaev





On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer dhoff...@gmail.com wrote:
 I have the same problem, I posted a message a few day's ago titled
 Working with FAX images with lines/borders.  If you find a solution
 please let me know.

 Thanks,
 -Dave

 On Thu, Mar 10, 2011 at 10:44 PM, Daphne flower.dap...@gmail.com wrote:
 Hello,

 I have a scanned image file which contains table. When I OCR it using
 tessnet it doesn't give the desired output.
 It is not reading the characters in the table. Instead it give some
 numbers.

 How to read the character in table format image

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: are there parameters to increase the chances for white space between words?

2011-03-11 Thread Dmitry Silaev
Try textord_words_min_minspace, fraction of x-height

Warm regards,
Dmitry Silaev





On Mon, Mar 7, 2011 at 8:28 PM, JMW white.j...@gmail.com wrote:
 I'm having some consistent problems with lack of whte space between
 words.  I.e. Thisisyour statementthatshows theamount you owe
 foryour.  are there tuening parameters that will help increase the
 change of getting whitespace between words?

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: noise output

2011-03-04 Thread Dmitry Silaev
Zdravko,

You should do text-detection before passing images to Tesseract.
Text-detection is a process of determining of image regions containing
text. Even if an image contains no text, Tesseract anyways will treat
it as an image of text.

Before recognition Tess applies a so-called binarization algorithm,
which converts an RGB image to monochrome one (black for text and
white for background). For your sample image the Otsu binarization
used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would
certainly give a number of skewed vertical lines resembling
backslashes and further recognition classifies them as such.

textord_heavy_nr and some other variables control size-based noise
removal but work satisfactory only in case when there's a significant
body of good text surrounded but some amount of noise. In your image
everything is noise, so it won't work.

Therefore you need to extend your pre-processing in order to feed Tess
with images indeed containing text. Decisions can be made based on
contrast estimation, distinctive color distribution, etc.

HTH

Warm regards,
Dmitry Silaev





On Fri, Mar 4, 2011 at 5:25 PM, zdravco zdra...@gmail.com wrote:
 Hello,

 I am using tesseract in my project after some image pre-processing.
 There are some false negatives I was hoping tesseract would eliminate
 by producing no output. However, sometimes there is a strange output
 that I get from almost blank images.
 Here is the sample image:
 https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274

 When I run it with tesseract rev. 552 using English language I get:
   R \.

 Does anyone know if there are some options in tesseract that could
 eliminate this noise? Or maybe if I could improve my input image with
 some further pre-processing. I have also tried to recompile tesseract
 with textord_heavy_nr set to TRUE, but then the output is:
 an \\“ R \.

 Thanks,
 Zdravko

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Fwd: noise output

2011-03-04 Thread Dmitry Silaev
There are tons of. And I believe, no ready recipe can be used
universally, this is very task-specific, especially in photographic
images. Also I believe, to do good text detection your algo should in
some extent mimic human behavior so it probably should be multi-stage,
gradually refining results at every stage. Don't account on getting a
working code snippet from the internet, most likely you'd have to
write the code yourself.

Some articles I had picked out when I was self-studying this field of
document image processing. For the moment, there might be newer ones,
but these can provide you with the basis. Apologies, I've no time to
provide you with direct references and author names - I only listed my
file system directory on this topic. You can Google for exact article
titles to find links.

1990 Scale-Space and Edge Detection Using Anisotropic Diffusion.pdf
1998 Edge detection and ridge detection with automatic scale
       selection.pdf
2001 Edge-Based Method for Text Detection from Complex Document
       Images.pdf
2001 TEXT EXTRACTION FROM GREY SCALE PAGE IMAGES BY SIMPLE EDGE
       DETECTORS.pdf
2002 Gaussian-Based Edge-Detection Methods - A Survey.pdf
2003 Fast Computation of Scale Normalised Gaussian Receptive
       Fields.pdf
2003 Real-time scale selection in hybrid multi-scale
       representations.pdf
2003 Recognition of text in 3-D scenes.pdf
2004 A method for ridge extraction.pdf
2004 A Review of Vessel Extraction Techniques and Algorithms.pdf
2004 Distinctive Image Features from Scale-Invariant Keypoints.pdf
2004 Scene Text Extraction in Natural Scene Images using
       Hierarchical Feature Combining and Verification.PDF
2004 Text Detection from Natural Scene Images - Towards a System
       for Visually Impaired Persons.PDF
2005 A novel approach for text detection in images using structural
       features.pdf
2005 Color Text Extraction from Camera-based Images - the Impact of
       the Choice of the Clustering Distance.PDF
2005 Improved Text-Detection Methods for a Camera-based Text
       Reading System for Blind Persons.PDF
2005 Text Extraction from Gray Scale Historical Document Images
       Using Adaptive Local Connectivity Map.pdf
2006 Multiscale Edge-Based Text Extraction from Complex Images.PDF
2006 Spatial and Color Spaces Combination for Natural Scene Text
       Extraction.PDF
2008 A double-threshold image binarization method based on edge
       detector.PDF

HTH

Warm regards,
Dmitry Silaev





On Sat, Mar 5, 2011 at 8:56 AM, Saurabh Gandhi saurabh...@gmail.com wrote:
 Hey,
 Any algorithm / whitepaper suggestions for text extraction, especially if
 the text is not over-lay text but a part of the image itself. Most
 algorithms I saw on the internet are compute intensive.

 --
 Regards,
 Saurabh Gandhi




 On Sat, Mar 5, 2011 at 11:20 AM, Dmitry Silaev daemons2...@gmail.com
 wrote:

 Zdravko,

 You should do text-detection before passing images to Tesseract.
 Text-detection is a process of determining of image regions containing
 text. Even if an image contains no text, Tesseract anyways will treat
 it as an image of text.

 Before recognition Tess applies a so-called binarization algorithm,
 which converts an RGB image to monochrome one (black for text and
 white for background). For your sample image the Otsu binarization
 used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would
 certainly give a number of skewed vertical lines resembling
 backslashes and further recognition classifies them as such.

 textord_heavy_nr and some other variables control size-based noise
 removal but work satisfactory only in case when there's a significant
 body of good text surrounded but some amount of noise. In your image
 everything is noise, so it won't work.

 Therefore you need to extend your pre-processing in order to feed Tess
 with images indeed containing text. Decisions can be made based on
 contrast estimation, distinctive color distribution, etc.

 HTH

 Warm regards,
 Dmitry Silaev





 On Fri, Mar 4, 2011 at 5:25 PM, zdravco zdra...@gmail.com wrote:
  Hello,
 
  I am using tesseract in my project after some image pre-processing.
  There are some false negatives I was hoping tesseract would eliminate
  by producing no output. However, sometimes there is a strange output
  that I get from almost blank images.
  Here is the sample image:
  https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274
 
  When I run it with tesseract rev. 552 using English language I get:
    R \.
 
  Does anyone know if there are some options in tesseract that could
  eliminate this noise? Or maybe if I could improve my input image with
  some further pre-processing. I have also tried to recompile tesseract
  with textord_heavy_nr set to TRUE, but then the output is:
  an \\“ R \.
 
  Thanks,
  Zdravko
 
  --
  You received this message because you are subscribed to the Google
  Groups tesseract-ocr group.
  To post to this group, send email to tesseract-ocr

Re: Especial Characteres

2011-03-03 Thread Dmitry Silaev
Sriranga,

Thanks for letting me know. You are the first one then, and I invented
the bicycle ))
However an article might be still of use instead of verbose forum discussion...
May be you'd like to write it then?

Warm regards,
Dmitry Silaev





On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold)
withblessi...@gmail.com wrote:
 Dimitry,
 I had generated traineddata(Kannada) files sucessfully from the old
 datafiles of 2.xx last year. There is discussion by spohorsky in the forum
 how to do.
 sriranga(78)
 ♫
 On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com wrote:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.

 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.

 Warm regards,
 Dmitry Silaev





 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
  Helo list,
  I can't find a solution for special chars
 
  I installed tesseract 3 in my MacOSX 10.6
  It is running very well
 
  But I'm having problems with charset.
  I need tesseract working with brazillian portuguese. (ISO8859-1)
 
  I installed the portuguese dictionary but is not working with special
  chars like  Ç Ã É é   (ISO8859-1)
  Is there any solution ?
 
  There is an old dictionary special for brazilian portuguese in version
  2.0.4. Is it possible to use in version 3? How?
 
 
  --
  You received this message because you are subscribed to the Google
  Groups tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.
 
 

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Especial Characteres

2011-03-03 Thread Dmitry Silaev
Sriranga,

Actually I don't understand why one needs to refer to the forum
discussion you've just mentioned above, as I managed to build this
traineddata file without writing a single line of code and even
without a compiler, say Visual C++...

The value I can add is in that any user inexperienced in programming
can make this traineddata file himself ))

Warm regards,
Dmitry Silaev





On Thu, Mar 3, 2011 at 5:08 PM, Sriranga(78yrsold)
withblessi...@gmail.com wrote:
 Dmitry,
 No I am NOT the first invented but actually credited to spohor...@sjm.com
 -who helped me very lot including creating vcproj for combined traineddata
 for windows. I am very thankful to him for his help/guidance rendered from
 time to time. Without his help I would not succeeded to generate traineddata
 file out of old datafiles  All credits should go to Steve. Steve has already
 explained in detail how to do in the forum discussion are available.
 -sriranga(78yrs)

 On Thu, Mar 3, 2011 at 6:36 PM, Dmitry Silaev daemons2...@gmail.com wrote:

 Sriranga,

 Thanks for letting me know. You are the first one then, and I invented
 the bicycle ))
 However an article might be still of use instead of verbose forum
 discussion...
 May be you'd like to write it then?

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold)
 withblessi...@gmail.com wrote:
  Dimitry,
  I had generated traineddata(Kannada) files sucessfully from the old
  datafiles of 2.xx last year. There is discussion by spohorsky in the
  forum
  how to do.
  sriranga(78)
  ♫
  On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com
  wrote:
 
  Manuel,
 
  It's quite an interesting question although it may seem to be an
  ordinary newbie-like one.
 
  I was always wondering if 2.xx files can be used with version 3.xx.
  The wiki states that the files in the traineddata file are different
  from the list used prior to 3.00, and will most likely change,
  possibly dramatically in future revisions.
 
  I have no time to investigate it in the code so I decided to act
  rather than to think. After some tinkering with all those files I
  slipped the resulted por.traineddata into my Tesseract algo I'm
  currently working at, and - guess what? - it worked! ))
 
  I must say it was tested only with a couple of *very simple* images
  and also it absolutely lacks any dictionary-related data. And my test
  images don't contain these specific Portuguese letters with
  diacritics. So in fact this file may perform poorly. Please test and
  report your results. The file is in the attachment.
 
  It was not difficult at all but also not so straight-forward to make
  this training data file, so probably this process deserves a separate
  article and later I'd like to post it in my blog.
 
  Warm regards,
  Dmitry Silaev
 
 
 
 
 
  On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
   Helo list,
   I can't find a solution for special chars
  
   I installed tesseract 3 in my MacOSX 10.6
   It is running very well
  
   But I'm having problems with charset.
   I need tesseract working with brazillian portuguese. (ISO8859-1)
  
   I installed the portuguese dictionary but is not working with special
   chars like  Ç Ã É é   (ISO8859-1)
   Is there any solution ?
  
   There is an old dictionary special for brazilian portuguese in
   version
   2.0.4. Is it possible to use in version 3? How?
  
  
   --
   You received this message because you are subscribed to the Google
   Groups tesseract-ocr group.
   To post to this group, send email to tesseract-ocr@googlegroups.com.
   To unsubscribe from this group, send email to
   tesseract-ocr+unsubscr...@googlegroups.com.
   For more options, visit this group at
   http://groups.google.com/group/tesseract-ocr?hl=en.
  
  
 
  --
  You received this message because you are subscribed to the Google
  Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.
 
 
  --
  You received this message because you are subscribed to the Google
  Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.
 

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


 --
 You received this message because you are subscribed to the Google Groups
 tesseract

Re: Especial Characteres

2011-03-03 Thread Dmitry Silaev
Manuel,

Is the error message generated by version 2.xx? Did you try to run
version 3.xx with my por.traineddata file?
I don't get it - have you succeeded or not?
Please provide us with the image you are trying to recognize.

Warm regards,
Dmitry Silaev





On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com wrote:
 Hi Dmitry,

 I just replaced with your file por.traineddata
 But I'm getting an error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in 
 file tessdatamanager.cpp, line 55
 Segmentation fault

 It's seem to be interesting to convert old files from 2.0X to 3, because 
 there isn't a brazillian portuguese for version 3,  just portuguese.
 At least the dictionary por.traineeddata is working correctly in version 3.
 The special chars is being recognized by tesseract 3.

 regards,
 Manuel Pardo




 Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.

 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.

 Warm regards,
 Dmitry Silaev





 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
 Helo list,
 I can't find a solution for special chars

 I installed tesseract 3 in my MacOSX 10.6
 It is running very well

 But I'm having problems with charset.
 I need tesseract working with brazillian portuguese. (ISO8859-1)

 I installed the portuguese dictionary but is not working with special
 chars like  Ç Ã É é   (ISO8859-1)
 Is there any solution ?

 There is an old dictionary special for brazilian portuguese in version
 2.0.4. Is it possible to use in version 3? How?


 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.

 por.traineddata

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: image binarization

2011-03-02 Thread Dmitry Silaev
Without any image samples, you can only get a vague advice.
Provide the community with samples and you might get a satisfactory
concrete response.

Warm regards,
Dmitry Silaev





On Wed, Mar 2, 2011 at 1:43 PM, Cong Nguyen congnguye...@gmail.com wrote:
 Please be careful with the Otsu algorithm, because we use only one threshold
 value for whole image.

 No method is best for all cases J.



 You should do and compare the results between Otsu algorithm and adaptive
 threshold algorithm.

 About adaptive threshold algorithm, you can be based on integral image
 (known by Paul Viola et. al) to increase performance.



 Cong.



 From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
 On Behalf Of Saurabh Gandhi
 Sent: Wednesday, March 02, 2011 3:34 PM
 To: tesseract-ocr@googlegroups.com
 Cc: Bikash Bag
 Subject: Re: image binarization



 You can use Otsu's binarisation algorithm:

 http://www.sas.bg/code-snippets/image-binarization-the-otsu-method.html

 --
 Regards,
 Saurabh Gandhi



 On Wed, Mar 2, 2011 at 1:45 PM, Bikash Bag bikash...@gmail.com wrote:

 Hi, I am new to OCR, can anyone please tell me a good image binarization
 algorithm.

 regards,
 bikash

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Customising Tesseract for character recognition

2011-02-24 Thread Dmitry Silaev
I don't know if it's affordable for you, but imho decent results can
only be achieved if you do segmentation yourself and then pass image
fragments to Tesseract on a word-by-word basis. Problems may appear
when you have words that are too short, however, as I can see, it's
not your case.

Long time ago, I had started my project relying on Tess's segmentation
and struggled much with it, until I came to a word-by-word approach.
Finally, I even switched to the character-wise recognition which at
last produces decent results. Mostly this transition was caused by
specifics of input images I'm working on (photos, usually of low
quality), but I think this is almost required for ideally scanned
images too.

There are some fruitful math ideas behind Tess's segmentation, but I
think the current implementation is not mature enough to be used
extensively in the production mode.

Warm regards,
Dmitry Silaev





On Thu, Feb 24, 2011 at 1:05 PM, Jose diox...@gmail.com wrote:
 Hi, (as you now Saurabh because we talked in private the other day) I tried
 the PSM_SINGLE_COLUMN and the accuracy drops dramatically... I can't afford
 to loose that accuracy. Is it possible to change the way the output is
 display? Looking a the code it seems rather hard to change it... perhaps I
 could print the pos x,y of the word found and then I could work out the
 horizontal/vertial layout? What are your thoughts? regards

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Customising Tesseract for character recognition

2011-02-24 Thread Dmitry Silaev
Unfortunately not only text output order can suffer from Tess's
segmentation, but also extents of some text fragments can be
identified incorrectly (say one segmented row can span over two
real rows, probably in partial way), and that in turn can lead to
*completely* irrelevant recognition results.

However you can run as many as possible tests on your images and
prove that this probably is not the case, and hope that segmentation
errors are won't be destructive and only will introduce this kind of
disorder. Then certainly you can use your (x,y)-sort method and be
happy ))

Warm regards,
Dmitry Silaev





On Thu, Feb 24, 2011 at 1:50 PM, Jose diox...@gmail.com wrote:
 Dmitry the recognition works the only thing is the way it is parsing it...
 :S I think segmentation of the images would be too much painful! I only
 won't to change the other that is display or the bounding boxes so I could
 now the x and y of the word recognized and thereby can organise the results
 better myself! don't you think it's a good aproach?
 thank you very much for you help

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Customising Tesseract for character recognition

2011-02-24 Thread Dmitry Silaev
The best way to explain everything would be just to send your source
image examples, describe what information you want to get from them
and provide the community with the code snippets you use to interface
with Tess. And please be as detailed as possible.

Warm regards,
Dmitry Silaev





On Thu, Feb 24, 2011 at 2:17 PM, Jose diox...@gmail.com wrote:
 In my particular case is just a matter that the first word of each column is
 in one font and the other is in another so instead of reading column by
 column it reads all the columns of the first row and then all the columns of
 the second row! My god is really hard to explain in english. I get an
 accurate result: 90% but instead I get the concat of the column 1 and
 column 2! I'm trying my best to understand the OCR but it's really hard for
 me as I don't have any OCR background. I don't see any other approach than
 printing where is the word ridden and try to postprocess all the results
 after, please correct me if I'm wrong or you see some improvements that can
 be made.
 please excuse my bad english

 regards,
 jose

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: [Tesseract 3] English training text

2011-02-22 Thread Dmitry Silaev
Interesting. I was wondering about Cube since its traces began to
appear in the source code but had no enough time to investigate it
thorougly

Zdenko, would you please kindly share your other findings on Cube?

Regards,
Dmitry

On Tue, Feb 22, 2011 at 11:13 AM, zdenko podobny zde...@gmail.com wrote:
 I doubt that google will release their (full) training set :-(
 Have a look at svn to file eng.cube.size [1]. You can see there name of
 fonts that was training for English in 3.01. As far as I understood there is
 (unpublished/not released) possibility to train language data directly on
 font files. Unfortunately there are no detail for cube part of training.
 Zd.
 [1] 12,4Mb! http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/eng.cube.size
 On Wed, Feb 9, 2011 at 5:48 PM, Sly_bzh sl...@laposte.net wrote:

 I would like to train tesseract for English with some special fonts.
 Tesseract training documentation says that a text should be prepared
 and it must follow some important points (see

 http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images)

 Could someone provide to the community the content of a good and
 efficient text for english training ?

 Note : I think it could be useful to provide the texts that have been
 used to build the training files that could be downloaded in the
 Download section (http://code.google.com/p/tesseract-ocr/downloads/
 list). What do you think about that ?

 Thanks !

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: problem in single word recognition

2011-02-22 Thread Dmitry Silaev
I might not understood you fully, but this is an obvious excerpt from
baseapi.h:

Each SetRectangle clears the recogntion results so multiple
rectangles
can be recognized with the same image

Indeed, SetRectangle() calls ClearResults() which deletes the pageres
and
clears the block list ready for a new page

As of 3.01 reconize_a_word() is obsolete and was refactored out

Regards,
Dmitry

On Feb 18, 9:11 pm, Jacob George jg1...@gmail.com wrote:
 Hi,

 I am using Tesseract 3.0 for my project which requires real time text
 recognition from a video stream of a web camera.
 I need only to recognize a single word or character after finding lines. In
 my code the function findlines() is called first (after setting the image
 using SetImage) and using this the TBOX of the word to be recognized is
 found out. Using the coordinates from the bounding box the setrectangle
 function is called and the text is recognized.
 but I face few problems:
 . If small words(such as are, is, be etc) are taken they are either
 recognized wrongly or not at all recognized. But if I give the bounding box
 of the same line these words are recognized accurately. why is it so?
 .does setrectangle clears all the result of the findlines()?
 . I came across a function reconize_a_word to recognize a single word. Does
 reconize_ a_word recognize all the page and returns only the text of the
 target box or does it recognize only the word given in the target box?

 thanks in advance,

 jacob

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Adaptive Data

2011-02-21 Thread Dmitry Silaev
Hi Zvezdoslav,

Check out the code of the Classify::EndAdaptiveClassifier() and
Classify::InitAdaptiveClassifier() methods.
Also search for classify_use_pre_adapted_templates and
classify_save_adapted_templates

HTH

Regards,
Dmitry

On Feb 16, 4:50 pm, Zvezdoslav Kunov z.ku...@gmail.com wrote:
 Hi all, I'm using tesseract api v3.0.1 under Linux.

 Does anybody knows a way to save/load the adaptive data that tesseract
 accumulates while running?

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Image pre-processing for good OCR results

2011-02-20 Thread Dmitry Silaev
Jon,

I don't know if it's intended but all your links to images report
We're sorry. The page you tried to access is not available. In that
way nothing can be advised on your issue...

Warm regards,
Dmitry Silaev





On Mon, Feb 21, 2011 at 5:02 AM, Jon Andersen jande...@gmail.com wrote:
 Hi,
 My project at http://RecordAGrave.com is about recording headstones from
 graves and posting the text and images on the Net so that people can
 research their family history.  I would appreciate some advice on how to
 pre-process these headstone images to get the best results from Tesseract
 OCR.  I have thousands of 1-2 MB jpg images of headstones to process.
 Example images:
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg
 I am a software developer so I can script up pre-processing steps to prepare
 the input for Tesseract.
 Any advice on improving OCR accuracy through pre-processing steps?
 Thanks so much,

 -Jon

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Wrappers for tessearct3.01?

2011-02-15 Thread Dmitry Silaev
devTess,

I'd not ask questions like this as Tess is undergoing transition from the
old code base and is under hard development of new features. I've no enough
time to investigate but the prev_word_best_choice_ data member seems to be
related to best segmentation search based on the language model.

Instead of rummaging in Tess's guts I'd better use a pretty convenient and
high-level interface provided by ResultIterator (see GetIterator() in
baseapi.h and then read all comments in resultiterator.h and
pageiterator.h)

Warm regards,
Dmitry Silaev




On Wed, Feb 16, 2011 at 5:34 AM, devTess jim...@googlemail.com wrote:

 Question:
 where can I find out more about (see below)

 tesseract_-prev_word_best_choice_


 What is the purpose of doing that?
 Why is it that it is not sufficient just to

 page_res_ = new PAGE_RES(block_list_);

 Thank you.
 =

 int TessBaseAPI::RecognizeText(ETEXT_DESC* monitor) {

  if (tesseract_ == NULL)
return -1;
  if (page_res_ != NULL)
delete page_res_;

  block_list_ =FindLinesCreateBlockList();

  tesseract_-SetBlackAndWhitelist();
  recognition_done_ = true;

  page_res_ = new PAGE_RES(block_list_, tesseract_-
 prev_word_best_choice_);

   // Now run the main recognition.
 tesseract_-recog_all_words(page_res_, monitor, NULL, NULL, 1);

return 0;
 }

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Provide/visualize baseline info?

2011-02-08 Thread Dmitry Silaev
Sriranga,

I'm glad you've succeeded.

Thanks to Zdenko for his guiding thought. I was aware of this Tess's
debugging capability but also strangely kept overlooking it.

I think the empty test1.txt is mostly a normal situation. I noticed this
fact also, but as I can remember there were times when it was filled with
recognized text. Probably it depends on actions you perform during the
ScrollView session or their specific sequence - I really have no time to
investigate.

Probably I'll publish more tutorials on what can be done using ScrollView.
But for my own needs this tool adds almost no value ((

*** I'm still seeking for somebody's help regarding this topic's subject.
***

Warm regards,
Dmitry Silaev




2011/2/8 Sriranga(78yrsold) withblessi...@gmail.com

 Dmitry,

 Congratulations !! successfully installed in winXP and tried using
 phototest.tif
 1st commandline  tesseract phototest.tif test1 segdemo inter  works well
 2nd command line tesseract phototest.tif test1 matdemo inter wokrs well
 however it is observed that output text1 is zero KB - where i made a
 mistake?
 Just Now tested for Kannda script as well as Khem script - it displayed
 correctly and  only text file does contains
 0 KB

 It would be nice if the screenshots of Modes, Display and others are
 reproduced for benefit of newbies/users - this will enable to view  their
 own lang other than english.Really it is boon to newbies like me.

 It would be better to have your extract  of your blog published in the wiki
 section for benefit of users of forum by the owner.


 With Warmest Regards,
 -sriranga(78yrs)


 2011/2/6 Dmitry Silaev daemons2...@gmail.com

 Here are the brief instructions on how to set up the Tesseract interactive
 debug environment (ScrollView) on Windows:

1. Make sure you have Java Runtime Environment installed
2. Download my home-brewed single archived installation suite from
http://www.4shared.com/get/Z4gnbJdP/tess_debug.html
3. Unpack the installation suit
4. Run cmd.exe
5. Change working directory to where you've unpacked the installation
suit
6. Follow the instructions in
http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging to run
Tesseract+ScrollView from the command line

 To keep the reasonable forum post size here in Google Groups, I placed the
 more verbose and overall nicer looking instructions in my blog at
 http://rdaemons.blogspot.com/2011/02/tesseract-ocr-setting-up-interactive.html

 Warm regards,
 Dmitry Silaev


 2011/2/6 Sriranga(78yrsold) withblessi...@gmail.com

 Dear dmitry,
 Though it may or may not help me much atleast it will be benefited for
 users of tesseract-ocr -
 for which users of the forum/newbies shall be thankful to you.
 With Warmest Regards,
 -sriranga(78yrs)

 On Sun, Feb 6, 2011 at 1:47 AM, Dmitry Silaev daemons2...@gmail.comwrote:

 Dear Sriranga,

 I've just managed to start the interactive Tess's visualizer. I don't
 really know if it might help you much, but I can publish the step-by-step
 instructions on how to make it work. At least these instructions may help
 some of Tess community newbies. Most likely, I'll be able to publish this
 within the next 24 hours.

 However it's not a workable solution for me. I still in desperate need
 to know if I can provide Tess with my own baseline info using some
 high-level structures and methods. Or whatever information you may have on
 this subject.

 Warm regards,
 Dmitry Silaev




 2011/2/5 Sriranga(78yrsold) withblessi...@gmail.com

 Tried to install in WinXP but failed. extract of cmd is reproduced below
 for further guidance please.
 C:\set JAVA_HOME=C:\jdk1.4

 C:\.\build.bat all (win32)
 '.\build.bat' is not recognized as an internal or external command,
 operable program or batch file.

 C:\
 C:\j:

 J:\tesseract-ocr-3.01alpha-r527\java.\build.bat all (win32)
 Piccolo Build System
 ---
 Building with classpath
 C:\jdk1.4\lib\tools.jar;.\lib\ant.jar;.\lib\junit.jar;
 Starting Ant...
 The system cannot find the path specified.

 J:\tesseract-ocr-3.01alpha-r527\java
 J:\tesseract-ocr-3.01alpha-r527\java

 I may kindly be intimated where I made a mistake?
 with warmest regards,
 -sriranga(78yrs)


 On Sat, Feb 5, 2011 at 7:28 PM, Sriranga(78yrsold) 
 withblessi...@gmail.com wrote:

 As per wiki instruction on debug mode , On Windows: The build process
 for building ScrollView.jar is not defined. Instead copy piccolo-1.2.jar 
 and
 piccolox-1.2.jar to tesseract/java - which appears prescribed 
 for*tesseract 2.04
 *
 .
 It is presumed whether  by coping  piccolo-1.2jar and piccolox-1.2 to
 tesseract/java folder of tesserac-3.01Alpha
 ( r527) will work? For this purpose whether picolo.java1.2( compiled
 source 4.3MB)have to be downloaded for WinXP? Kindly confirm - since I am
 not programmer/developer.
 With Regards,
 -Sriranga(78yrs)


 2011/2/5 Zdenko Podobný zde...@gmail.com

  I am not sure what you if it helps you, but did you try debug mode (
 http://code.google.com

Re: Wrappers for tessearct3.01?

2011-02-08 Thread Dmitry Silaev
devTess be careful with coffee, don't overdose ))

 Q1
 Init(datapath, language, OcrEngineMode);
 What is the normal setting of OcrEngineMode?
Currently OEM_OcrEngineMode = TESSERACT_ONLY would be sufficient for all cases.

 Q2: which of the following is USED In normal running mode of
 tessearct.exe to recognize text
The values of the variables you can see within the code of Recognize()
(e.g. tesseract_-tessedit_resegment_from_boxes) are often loaded from
config files. Usually recognition runs with no config files at all, so
you can assume all these variables to be false. In that way you can
examine the control paths and figure out what procedures get called at
the recognition stage.

 Q3: which of the following is USED In normal running mode of
 tessearct.exe to recognize text
You meant to train - copy-paste. Training is a 2-stage process:

  1) Making box files. Requires two config files: batch.nochop and makebox

  2) Generation of .tr files. Needs nobatch and box.train

You can find the above configs inside the tessdata/configs and
tessdata/tessconfigs directories in Tess's distribution. Check these
files and you'll understand what usually happens while training. Plain
old step-by-step debugging is also of use ))

Warm regards,
Dmitry Silaev




On Tue, Feb 8, 2011 at 6:44 PM, devTess jim...@googlemail.com wrote:

 Hi Dimitry, with the guidelines provided from you, I prepared a strong
 cup of coffee and start reading the top part of baseapi.h

 Q1
 Init(datapath, language, OcrEngineMode);
 What is the normal setting of OcrEngineMode?

 I try to use the :Recognize(ETEXT_DESC* monitor) method.
  There are two PARTS to the Recognize method

 Part ONE:
 Q2: which of the following is USED In normal running mode of
 tessearct.exe to recognize text

  if (tesseract_-tessedit_resegment_from_line_boxes)
    page_res_ = tesseract_-ApplyBoxes(*input_file_, true,
 block_list_);
  else if (tesseract_-tessedit_resegment_from_boxes)
    page_res_ = tesseract_-ApplyBoxes(*input_file_, false,
 block_list_);
  else
    page_res_ = new PAGE_RES(block_list_, tesseract_-
 prev_word_best_choice_);  My guess
  if (tesseract_-tessedit_make_boxes_from_boxes) {
    tesseract_-CorrectClassifyWords(page_res_);
    return 0;
  }

 Part TWO:
 Q3: which of the following is USED In normal running mode of
 tessearct.exe to recognize text
 if (tesseract_-interactive_mode) {
    tesseract_-pgeditor_main(rect_width_, rect_height_, page_res_);
    // The page_res is invalid after an interactive session, so
 cleanup
    // in a way that lets us continue to the next page without
 crashing.
    delete page_res_;
    page_res_ = NULL;
    return -1;
  } else if (tesseract_-tessedit_train_from_boxes) {
    tesseract_-ApplyBoxTraining(*output_file_, page_res_);
  } else if (tesseract_-tessedit_ambigs_training) {
    FILE *training_output_file = tesseract_-
 init_recog_training(*input_file_);
    // OCR the page segmented into words by tesseract.
    tesseract_-recog_training_segmented(
        *input_file_, page_res_, monitor, training_output_file);
    fclose(training_output_file);
  } else {
    // Now run the main recognition.
    tesseract_-recog_all_words(page_res_, monitor, NULL, NULL, 0);
 My guess
  }

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Provide/visualize baseline info?

2011-02-06 Thread Dmitry Silaev
Here are the brief instructions on how to set up the Tesseract interactive
debug environment (ScrollView) on Windows:

   1. Make sure you have Java Runtime Environment installed
   2. Download my home-brewed single archived installation suite from
   http://www.4shared.com/get/Z4gnbJdP/tess_debug.html
   3. Unpack the installation suit
   4. Run cmd.exe
   5. Change working directory to where you've unpacked the installation
   suit
   6. Follow the instructions in
   http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging to run
   Tesseract+ScrollView from the command line

To keep the reasonable forum post size here in Google Groups, I placed the
more verbose and overall nicer looking instructions in my blog at
http://rdaemons.blogspot.com/2011/02/tesseract-ocr-setting-up-interactive.html

Warm regards,
Dmitry Silaev


2011/2/6 Sriranga(78yrsold) withblessi...@gmail.com

 Dear dmitry,
 Though it may or may not help me much atleast it will be benefited for
 users of tesseract-ocr -
 for which users of the forum/newbies shall be thankful to you.
 With Warmest Regards,
 -sriranga(78yrs)

 On Sun, Feb 6, 2011 at 1:47 AM, Dmitry Silaev daemons2...@gmail.comwrote:

 Dear Sriranga,

 I've just managed to start the interactive Tess's visualizer. I don't
 really know if it might help you much, but I can publish the step-by-step
 instructions on how to make it work. At least these instructions may help
 some of Tess community newbies. Most likely, I'll be able to publish this
 within the next 24 hours.

 However it's not a workable solution for me. I still in desperate need to
 know if I can provide Tess with my own baseline info using some high-level
 structures and methods. Or whatever information you may have on this
 subject.

 Warm regards,
 Dmitry Silaev




 2011/2/5 Sriranga(78yrsold) withblessi...@gmail.com

 Tried to install in WinXP but failed. extract of cmd is reproduced below
 for further guidance please.
 C:\set JAVA_HOME=C:\jdk1.4

 C:\.\build.bat all (win32)
 '.\build.bat' is not recognized as an internal or external command,
 operable program or batch file.

 C:\
 C:\j:

 J:\tesseract-ocr-3.01alpha-r527\java.\build.bat all (win32)
 Piccolo Build System
 ---
 Building with classpath
 C:\jdk1.4\lib\tools.jar;.\lib\ant.jar;.\lib\junit.jar;
 Starting Ant...
 The system cannot find the path specified.

 J:\tesseract-ocr-3.01alpha-r527\java
 J:\tesseract-ocr-3.01alpha-r527\java

 I may kindly be intimated where I made a mistake?
 with warmest regards,
 -sriranga(78yrs)


 On Sat, Feb 5, 2011 at 7:28 PM, Sriranga(78yrsold) 
 withblessi...@gmail.com wrote:

 As per wiki instruction on debug mode , On Windows: The build process
 for building ScrollView.jar is not defined. Instead copy piccolo-1.2.jar 
 and
 piccolox-1.2.jar to tesseract/java - which appears prescribed 
 for*tesseract 2.04
 *
 .
 It is presumed whether  by coping  piccolo-1.2jar and piccolox-1.2 to
 tesseract/java folder of tesserac-3.01Alpha
 ( r527) will work? For this purpose whether picolo.java1.2( compiled
 source 4.3MB)have to be downloaded for WinXP? Kindly confirm - since I am
 not programmer/developer.
 With Regards,
 -Sriranga(78yrs)


 2011/2/5 Zdenko Podobný zde...@gmail.com

  I am not sure what you if it helps you, but did you try debug mode (
 http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging)?

 Zd.


 Dňa 05.02.2011 01:33, daemon-s  wrote / napísal(a):

 Hi!

 I train Tess using separate images for every text line. Recognition is
 also ran over single text line images. Recognition performs pretty
 well, however there are many errors that, I believe, related to
 misdetected baselines, during training or recognition - I don't know.
 These include:

  (double quote) detected as n
 S detected as s (and vice versa)
 V detected as v (and vice versa)
 etc.

 Is there any (preferably high-level) way to provide Tess with baseline
 info? Or at least obtain baseline info from Tess in order to visualize
 it further for debugging?

 Thanks,
 Dmitry


   --
 You received this message because you are subscribed to the Google
 Groups tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.



  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr

Re: Tesseract Training

2011-01-24 Thread Dmitry Silaev
Dear Sochenda,

Glad you have succeeded in training for Khmer and thanks for your kind
words.

Could you please share with us the images and .box files you used for
training? Also some sample input images and respective recognition results
would be of much use.

Sriranga, I see your *training* process is doing pretty well. Most of your
problems are in the dictionary facility. However I do not feel proficient in
this field. I mean I know how it works (to be exact how it *should* work), I
understand the theoretical basis besides it, but I avoided using it as much
as could. When I was getting ready to start using Tess in my project, I read
much of the tesseract-XXX groups and I understood that dictionary facility
is far from being perfect, at list it's not ready to use yet. Fortunately my
project involves much image processing and the specifics of my task imply
block/line/letter segmentation so I managed to keep off most of dubious
Tess's parts and used it solely as a raw classifier. And I think,
classification is what Tess does quite well.

Unfortunately I think you will have much struggling with various
inconsistencies and cryptic errors, but anyway I think it's worth it. You
should report your every error to the team and wait until it's fixed, at the
same time trying to found your way around it. Or you can leave the
dictionary facility and rely completely on some home brewed post-processing.
If you choose this, your problem turns into a small RD project so you need
to find appropriate people to do this job.

Warm regards,
Dmitry Silaev

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Tesseract Training

2011-01-20 Thread Dmitry Silaev
Dear Sochenda,

Please provide us with every file you use to make up your traineddata.
Also we need all command lines with what you run Tess and Tess tools.
Please be sure to be as detailed as possible.

Internship is a good opportunity for everyone here; I'd probably try to
apply also but I'm not much of a recent graduate already ((

Warm regards,
Dmitry Silaev

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Tesseract Training

2011-01-18 Thread Dmitry Silaev
Dear Sochenda,

In addition to what Sriranga said I'd remind that you should do a lot of
manual work:

In pyTesseractTrainer check that no bounding boxes intersect glyphs; if some
does - correct its BB coordinates manually.

In cases of BB overlap you should space out participating glyphs in the
training image (see the attached picture for examples).

You should use manual spacing if participating glyphs are dependent
characters (in your language - vowels) and the number of possible
combinations is practically uncountable. Then you would assign every glyph
its own code. Tess would consider these glyphs as separate characters and
you should post-process the resulting code sequence to obtain a well-formed
dependent Unicode pair (or triplet).

If there can be only few such combinations - you can merge these BBs into
one to encompass all the required glyphs and assign a single code to the
entire glyph combination. Then during the post-processing you'll need to
replace this single code with a predefined dependent Unicode pair.

Hope I've managed to express myself clearly.

Warm regards,
Dmitry Silaev

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

attachment: figure01.GIF

Re: Tesseract Training

2011-01-17 Thread Dmitry Silaev
Dear Sochenda,

I've checked the Unicode table range you've sent and now I see what the
problem is. I'd agree that in such algorithmic writing system (contrasted
with simpler positional systems like say Roman or Cyrillic) the stages of
pre-/post-processing are inevitable.

I'd suggest making special hand-crafted or generated training images. In
these images you would properly space out all the joint character
combinations as well as character components that can make up Khmer
characters. Then you would edit the resulting box files to assign codes
according to your coding system. The noted process should be repeated as
many times as required to achieve the sample count of 15-20 for every glyph.

At the recognition stage, if trained properly, overlapping bounding boxes is
not a problem for Tess. My experience shows that it is very inventive in
character segmentation even in cases of BB overlap. So I hope you should
have no severe difficulties with partially overhanging or underlying glyphs.

Your post-processor should be able to decode recognition output using an
algorithmic approach to form good Unicode characters. You can also use some
Khmer bigram or trigram statistics to do error correction. Probably you'd
want to play around with Tess's dictionary facility but I doubt it would be
helpful in your case.

Dmitry

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Tesseract Training

2011-01-16 Thread Dmitry Silaev
Dear Sochenda,

I'm not sure what's the ultimate goal of your code assignment but a formal
answer to your question is Yes. You can assign k001 or k002 to a
bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
character sequence. In Tess version 3.0x (current) the only restriction is a
24 byte limit for the entire char sequence length. This also allows you to
use not only an abstract code like k001 but a meaningful character
sequence from your real language (e.g. a well-known fi ligature in some
Latin fonts) which then relieves you from using the pre- and
post-processing.

If you still prefer using abstract codes then pre-/post-processing can be
done without tinkering with Tess's code. Since training as well as
recognition result in generation of output files, you can develop a couple
of file processing command-line utilities which then can be used along with
calls to the Tesseract executable within shell scripts (or .bat files in
Windows).

For further details you definitely should study thoroughly the
TrainingTesseract3 and ReadMe (section Installation Notes - Tesseract
3.00) documents (
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite
easy searchable documents but they contain all the info you might need.

Warm regards,
Dmitry Silaev




On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda khemsoche...@gmail.comwrote:


 Dear Dmitry,

 Thank you very much for a comprehensive explanation.
 Let say, to go straight, does it sound ok by assigning a code like 'k001'
 or 'k002' to the glype obtain from tesseract segmentation?

 For post processing, touching the code tesseract, could you please point me
 out which I files I should modify to work on. Advice me if the last version
 of tesseract will do fine.

 Thank you very much in advance for your time and response back.

 Best Regards,

 Sochenda


 On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev daemons2...@gmail.comwrote:

 Chenda,

 In fact Tesseract doesn't care if you do training for a real language's
 letter and which language this letter belongs to. Simplistically saying Tess
 only saves the mapping of feature sets obtained from training to Unicode
 ids. This implies that during training you can assign virtually any
 character code to virtually any glyph (to be exact, to a connected component
 or to a set of connected components).

 If your language script is comprised by a reasonable number of joint
 character combinations then while training you can assign every such
 combination a predefined Unicode id (some restrictions apply). Later, when
 running recognition, you should do some post-processing to decode your
 predefined ids into real language's character sequences.

 For good results all this requires you to develop a training file
 pre-processor (mapping: language char combinations - provisional ids) and a
 recognition result post-processor (mapping: provisional ids - language char
 sequences). I'm not sure but this also may require correcting character
 property bit masks in the unicharset file (I don't know exactly how this
 information is used by Tess as I don't need it in my project).

 Warm regards,
 Dmitry Silaev




 On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda 
 khemsoche...@gmail.comwrote:

 Dear Tesseract Team,

 In training new language step, we have to assign a unicode value to each
 box.
 I would like to know if a shape that is composed of *several unicode
 characters?
 Is there anyway to assign only an id for each box in tesseract?

 Thank you very much in advance for your response.

 Best Regards,
 Chenda *

1. **

  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post

Re: Tesseract Training

2011-01-14 Thread Dmitry Silaev
Chenda,

In fact Tesseract doesn't care if you do training for a real language's
letter and which language this letter belongs to. Simplistically saying Tess
only saves the mapping of feature sets obtained from training to Unicode
ids. This implies that during training you can assign virtually any
character code to virtually any glyph (to be exact, to a connected component
or to a set of connected components).

If your language script is comprised by a reasonable number of joint
character combinations then while training you can assign every such
combination a predefined Unicode id (some restrictions apply). Later, when
running recognition, you should do some post-processing to decode your
predefined ids into real language's character sequences.

For good results all this requires you to develop a training file
pre-processor (mapping: language char combinations - provisional ids) and a
recognition result post-processor (mapping: provisional ids - language char
sequences). I'm not sure but this also may require correcting character
property bit masks in the unicharset file (I don't know exactly how this
information is used by Tess as I don't need it in my project).

Warm regards,
Dmitry Silaev




On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda khemsoche...@gmail.comwrote:

 Dear Tesseract Team,

 In training new language step, we have to assign a unicode value to each
 box.
 I would like to know if a shape that is composed of *several unicode
 characters?
 Is there anyway to assign only an id for each box in tesseract?

 Thank you very much in advance for your response.

 Best Regards,
 Chenda *

1. **

  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Can't get the user dictionary to work

2010-07-30 Thread Dmitry Silaev

 On the plus side, it turns out that there are functions buried in the
 code to serialise/deserialise the classifier state, so it might be
 useful to run a whole corpus of short images through tess in one
 batch, save the state, and load that at startup.


Could you please be more specific, what are your findings: which functions
and what they do? I think it might be of interest for many subscribers...

Thanks,
Dmitry

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: tesseract output correction gear

2010-07-13 Thread Dmitry Silaev
there was a minor bug which prevented display of magnified textline images
in the viewport after save
now it's fixed
eh, development version as i said

On Tue, Jul 13, 2010 at 3:36 PM, Jimmy O'Regan jore...@gmail.com wrote:

 On 13 July 2010 11:55, daemon-s daemons2...@gmail.com wrote:
  Please check out this little tool at
 http://www.c2scan.com/pokupedia.ru/index.php?r=site/ocr
 
  As an output it produces an XML file containing bounding box
  descriptions and character mappings. For the current demo image anyone
  can make corrections and save. These corrections will be visible to
  successive visitors as well.
 

 Crowd sourced OCR training? Cool :)

  It is going to be a part of my new project pokupedia.ru. Since this is
  a development version bugs and inconsistencies may exist. Currently
  works in Firefox only. Sometimes the web site may be down as I use my
  home computer as a web server.
 
  If this tool gains interest, I will probably make it available to the
  public domain. Until then your feedback is expected - bug reports and
  improvement proposals are welcome.
 

 This is something that Distributed Proofreaders might be interested
 in; you might get more testers i you mention it on the forums there.

  Regards,
  Dmitry
 
  --
  You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
  To post to this group, send email to tesseract-...@googlegroups.com.
  To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com
 .
  For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.
 
 



 --
 Leftmost jimregan, that's because deep inside you, you are evil.
 Leftmost Also not-so-deep inside you.

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-...@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.