Re: How to detect inverted image in a picture
This param controls the maximum allowed length of connected component's contour (blob outline in terms of Tesseract). I suspect, Tess decided that the dominant color of the text is white and hence it could not construct all blobs' outlines until you've raised the value edges_maxedgelength (since the outer outline of the big white blob is very long). One of Tess's notable features is that it can handle inverted text. Though it should be able to get all outlines, and you've helped him to achieve this. Warm regards, Dmitry Silaev On Wed, Mar 16, 2011 at 10:57 AM, Ice Head iceh...@gmail.com wrote: After several try, I found a resolution by changing parameter edges_maxedgelength Do you know what is the functionalty of this parameter ? Thanx, Ice On Mar 9, 2:12 pm, Ice Head iceh...@gmail.com wrote: Hi, I'm using tesseract 3.01 and failed to read simple images like this one (see link below) https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B1... Is there a way to read this kind of picture ? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Especial Characteres
Manuel, I'm afraid just chaining command line tools won't help in this case. I'm talking about programming. And yes, I did solve many practical problems related to layout analysis, and other fields of document image processing, and succeeded in it )) Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 7:55 AM, manuel...@gmail.com manuel...@gmail.com wrote: What would you recommend to use to split the columns? I think I will need to scan using tesseract column by column. So after that I will need to merge it to make correct rows. Can you point me a direction to help me? What tools (unix compatible tools) can I use to tell tesseract to scan a specific column? Later I will recompile to test, but first I need to find a way to scan correct these reports to generate CSV files to import later to a database. If it works I will spend more time tunning tesseract. Have you ever did this before? (scan reports using tesseract or other tools to generate csv files) Thanks Em 13/03/2011, às 11:20, Dmitry Silaev escreveu: Running via ports can cause diverse errors. Try to compile Tesseract natively. I use revision 549 and as I said it works fine. Such tables as you have present a challenge for simple layout processing algorithms, due to sparsely located text. A minimal skew which is almost inevitable could break all the logic. In such cases I prefer to devise a custom made segmentation logic specific to the document type being processed. In this way I do not depend on Tesseract's segmentation - Tesseract is being used as a raw classifier. Warm regards, Dmitry Silaev On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com manuel...@gmail.com wrote: I'm using the latest version tesseract @3.00_2+eng I installed using ports in MacOSX Another question Dmitry about this sample In this sample why doesn't tesseract recognize a complete row? It's not a perfect align, but it is impossible to get a image 100% aligned. Tesseract is breaking columns in new lines like : 1 test productA 2 test2 productB Do you know how to fix it? Regard Manuel Pardo Em 13/03/2011, às 08:32, Dmitry Silaev escreveu: Manuel, The sample you provided definitely has insufficient resolution. You may only expect some part of the heading to be recognized. So this is what happened when I've run the recognition of your image. But I haven't got any error or warning messages with my por.traineddata at all! However all this was tested under Windows. Probably I can try this under Ubuntu, but I don't know when I have enough time to reboot, set up a C++ compiler, build Tesseract and do some testing, sorry )) Are you sure you downloaded the latest stable version of Tesseract? Warm regards, Dmitry Silaev On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com manuel...@gmail.com wrote: I just replaced por.traineddata with your file por.traineddata. After that I'm getting this message error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault I haven't succeeded. I'm using version 3 - MacOSX 10.6 Attached Reported.tiff Regards Manuel Pardo Em 04/03/2011, às 03:19, Dmitry Silaev escreveu: Manuel, Is the error message generated by version 2.xx? Did you try to run version 3.xx with my por.traineddata file? I don't get it - have you succeeded or not? Please provide us with the image you are trying to recognize. Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com wrote: Hi Dmitry, I just replaced with your file por.traineddata But I'm getting an error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault It's seem to be interesting to convert old files from 2.0X to 3, because there isn't a brazillian portuguese for version 3, just portuguese. At least the dictionary por.traineeddata is working correctly in version 3. The special chars is being recognized by tesseract 3. regards, Manuel Pardo Em 03/03/2011, às 09:12, Dmitry Silaev escreveu: Manuel, It's quite an interesting question although it may seem to be an ordinary newbie-like one. I was always wondering if 2.xx files can be used with version 3.xx. The wiki states that the files in the traineddata file are different from the list used prior to 3.00, and will most likely change, possibly dramatically in future revisions. I have no time to investigate it in the code so I decided to act rather than to think. After some tinkering with all those files I slipped the resulted por.traineddata into my Tesseract algo I'm currently working at, and - guess what? - it worked! )) I must say it was tested only with a couple of *very simple* images
Re: how to get the character in an image file which is in table format.
I suspect, this paper is a sledgehammer for a nut. It's quite universal and elaborated. Usually it may take a great deal of time to implement and debug it. Your images might require much simplier methods. I always say the same thing: send your sample images and the community will try to help. Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote: Hi Vicky, Can you tell me more about this paper? It looks like this is not a free document so I can't just read it to see if it would solve the problem I have. My problem is that I have grey-scale image data (tif/jpg/etc) that contains text within a table format, i.e. cells on the page. The documents where originally faxed then converted to PDF so the image quality varies from poor to good. I don't want the table formatting, I'm looking for a way to remove the formatting and get to just the image text, I want to convert that to text using OCR, Tesseract or otherwise. My programming environment is Java but can shell out to other programs if I need to. Would the approach in the paper solve this problem space? How practical is the software solution for a one man effort? Thanks, -Dave On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja vicky.vi...@gmail.com wrote: Hello, I used this paper (for pre-processing): Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240 - 1256 Best Regards, Vicky -Original Message- From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Daphne Sent: Friday, March 11, 2011 01:15 To: tesseract-ocr Subject: how to get the character in an image file which is in table format. Hello, I have a scanned image file which contains table. When I OCR it using tessnet it doesn't give the desired output. It is not reading the characters in the table. Instead it give some numbers. How to read the character in table format image -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Tesseract 3.00
Actually, there's more than just VietOCR. Check this: http://en.wikipedia.org/wiki/Tesseract_(software)#User_interfaces Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 2:13 AM, Onion onionzwie...@gmail.com wrote: Ok, thanks. That will be too complicated for me to use. Will have to uninstall it. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Tesseract 3.00
You don't need to bother using *two together*. Tesseract is a basis FreeOCR is built on, so these two are together already. FreeOCR's graphic interface is quite user friendly. Just install and use. I don't know what else needs to be said )) Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 12:31 PM, Onion onionzwie...@gmail.com wrote: I have FreeOCR installed already. So somehow, this works with Tesseract? Can you explain in simpleton terms how I'd use the two together? Or is it too geeky? Thanks -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Customising Tesseract for character recognition
Ehmm... I don't get it. If you've succeeded in using iterators, it's at your full disposal to format the output in any way you want programmatically, isn't it? Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 1:56 PM, Jose diox...@gmail.com wrote: *I only modify how the result is printed! nothing else... I grab all the info from the word and it's bounding box! that is ok right? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: how to get the character in an image file which is in table format.
Dave, Yep, quality is relatively poor so don't expect high accuracy from Tess. Do you need every table cell's contents? Or getting numbers is just enough and in a next step you can restore [predefined] item names? Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 4:19 PM, David Hoffer dhoff...@gmail.com wrote: Dmity, That would be great thanks for the offer, I'll attach two samples. These two are good examples of the range of quality. What I need to do is extract cell data for processing. I can generate these in any image format, tiff, jpeg if one should be preferred. Best regards, -Dave On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev daemons2...@gmail.com wrote: I suspect, this paper is a sledgehammer for a nut. It's quite universal and elaborated. Usually it may take a great deal of time to implement and debug it. Your images might require much simplier methods. I always say the same thing: send your sample images and the community will try to help. Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote: Hi Vicky, Can you tell me more about this paper? It looks like this is not a free document so I can't just read it to see if it would solve the problem I have. My problem is that I have grey-scale image data (tif/jpg/etc) that contains text within a table format, i.e. cells on the page. The documents where originally faxed then converted to PDF so the image quality varies from poor to good. I don't want the table formatting, I'm looking for a way to remove the formatting and get to just the image text, I want to convert that to text using OCR, Tesseract or otherwise. My programming environment is Java but can shell out to other programs if I need to. Would the approach in the paper solve this problem space? How practical is the software solution for a one man effort? Thanks, -Dave On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja vicky.vi...@gmail.com wrote: Hello, I used this paper (for pre-processing): Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240 - 1256 Best Regards, Vicky -Original Message- From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Daphne Sent: Friday, March 11, 2011 01:15 To: tesseract-ocr Subject: how to get the character in an image file which is in table format. Hello, I have a scanned image file which contains table. When I OCR it using tessnet it doesn't give the desired output. It is not reading the characters in the table. Instead it give some numbers. How to read the character in table format image -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: how to get the character in an image file which is in table format.
Dave, What is the format and resolution in which you initially get your images? For such poor quality every conversion makes an image even worse... Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 5:29 PM, David Hoffer dhoff...@gmail.com wrote: Dmitry, Would using a loss-less format like TIFF be preferred? (I'm going to give this a try but some of these steps might be a bit more than I can handle...I'm not an image processing guru.) -Dave On Mon, Mar 14, 2011 at 5:23 PM, Dmitry Silaev daemons2...@gmail.com wrote: Ehmm, actually I thought a bit more and now I say no to deskewing. It can be detrimental to such poor quality images - they are almost binary (almost probably because of the JPEG compression algo) and low-res. As far as I see, you only can have binary images. Therefore we need to assume a skew of an input image to be always within some narrow range and modify all our following steps to work in a skewed coordinate system. Dmitry On Mar 14, 4:19 pm, David Hoffer dhoff...@gmail.com wrote: Dmity, That would be great thanks for the offer, I'll attach two samples. These two are good examples of the range of quality. What I need to do is extract cell data for processing. I can generate these in any image format, tiff, jpeg if one should be preferred. Best regards, -Dave On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev daemons2...@gmail.com wrote: I suspect, this paper is a sledgehammer for a nut. It's quite universal and elaborated. Usually it may take a great deal of time to implement and debug it. Your images might require much simplier methods. I always say the same thing: send your sample images and the community will try to help. Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote: Hi Vicky, Can you tell me more about this paper? It looks like this is not a free document so I can't just read it to see if it would solve the problem I have. My problem is that I have grey-scale image data (tif/jpg/etc) that contains text within a table format, i.e. cells on the page. The documents where originally faxed then converted to PDF so the image quality varies from poor to good. I don't want the table formatting, I'm looking for a way to remove the formatting and get to just the image text, I want to convert that to text using OCR, Tesseract or otherwise. My programming environment is Java but can shell out to other programs if I need to. Would the approach in the paper solve this problem space? How practical is the software solution for a one man effort? Thanks, -Dave On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja vicky.vi...@gmail.com wrote: Hello, I used this paper (for pre-processing): Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240 - 1256 Best Regards, Vicky -Original Message- From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Daphne Sent: Friday, March 11, 2011 01:15 To: tesseract-ocr Subject: how to get the character in an image file which is in table format. Hello, I have a scanned image file which contains table. When I OCR it using tessnet it doesn't give the desired output. It is not reading the characters in the table. Instead it give some numbers. How to read the character in table format image -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en. hud1.jpeg 748KViewDownload hud2.jpeg 2046KViewDownload -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group
Re: Especial Characteres
I doubt there's a GUI which can help with what you want. As for programmatic way of doing this, please refer to the following thread where I already tried to answer a similar question: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/6322a29f28ba49dc/f98699a9caf36dbc#f98699a9caf36dbc If you see no clues in these posts then you need to send your sample images, there's no other way to help you. Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 5:22 PM, manuel...@gmail.com manuel...@gmail.com wrote: Thanks. I need a GUI that tells to tesseract to recognize just a specific column. I'm a Java and C++ developer. Can you point me a direction ? Regards Manuel Pardo Em 14/03/2011, às 04:50, Dmitry Silaev escreveu: Manuel, I'm afraid just chaining command line tools won't help in this case. I'm talking about programming. And yes, I did solve many practical problems related to layout analysis, and other fields of document image processing, and succeeded in it )) Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 7:55 AM, manuel...@gmail.com manuel...@gmail.com wrote: What would you recommend to use to split the columns? I think I will need to scan using tesseract column by column. So after that I will need to merge it to make correct rows. Can you point me a direction to help me? What tools (unix compatible tools) can I use to tell tesseract to scan a specific column? Later I will recompile to test, but first I need to find a way to scan correct these reports to generate CSV files to import later to a database. If it works I will spend more time tunning tesseract. Have you ever did this before? (scan reports using tesseract or other tools to generate csv files) Thanks Em 13/03/2011, às 11:20, Dmitry Silaev escreveu: Running via ports can cause diverse errors. Try to compile Tesseract natively. I use revision 549 and as I said it works fine. Such tables as you have present a challenge for simple layout processing algorithms, due to sparsely located text. A minimal skew which is almost inevitable could break all the logic. In such cases I prefer to devise a custom made segmentation logic specific to the document type being processed. In this way I do not depend on Tesseract's segmentation - Tesseract is being used as a raw classifier. Warm regards, Dmitry Silaev On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com manuel...@gmail.com wrote: I'm using the latest version tesseract @3.00_2+eng I installed using ports in MacOSX Another question Dmitry about this sample In this sample why doesn't tesseract recognize a complete row? It's not a perfect align, but it is impossible to get a image 100% aligned. Tesseract is breaking columns in new lines like : 1 test productA 2 test2 productB Do you know how to fix it? Regard Manuel Pardo Em 13/03/2011, às 08:32, Dmitry Silaev escreveu: Manuel, The sample you provided definitely has insufficient resolution. You may only expect some part of the heading to be recognized. So this is what happened when I've run the recognition of your image. But I haven't got any error or warning messages with my por.traineddata at all! However all this was tested under Windows. Probably I can try this under Ubuntu, but I don't know when I have enough time to reboot, set up a C++ compiler, build Tesseract and do some testing, sorry )) Are you sure you downloaded the latest stable version of Tesseract? Warm regards, Dmitry Silaev On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com manuel...@gmail.com wrote: I just replaced por.traineddata with your file por.traineddata. After that I'm getting this message error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault I haven't succeeded. I'm using version 3 - MacOSX 10.6 Attached Reported.tiff Regards Manuel Pardo Em 04/03/2011, às 03:19, Dmitry Silaev escreveu: Manuel, Is the error message generated by version 2.xx? Did you try to run version 3.xx with my por.traineddata file? I don't get it - have you succeeded or not? Please provide us with the image you are trying to recognize. Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com wrote: Hi Dmitry, I just replaced with your file por.traineddata But I'm getting an error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault It's seem to be interesting to convert old files from 2.0X to 3, because there isn't a brazillian portuguese for version 3, just portuguese. At least the dictionary por.traineeddata is working correctly in version 3. The special chars is being recognized by tesseract
Re: how to get the character in an image file which is in table format.
As I can see, your source data can be deemed as 1-bit (binary) losslessly compressed image. So a lossless conversion to any image format (makes no difference which) will do no harm. Warm regards, Dmitry Silaev On Tue, Mar 15, 2011 at 8:31 AM, David Hoffer dhoff...@gmail.com wrote: Dmitry, Originally the documents are PDF with these images CCITTFax encoded I decoded them using iText. At this point I have a BufferedImage which I can save in any format supported by Java. I assume Tiff would be one of the best. Best regards, -Dave On Tue, Mar 15, 2011 at 7:52 AM, Dmitry Silaev daemons2...@gmail.com wrote: Dave, What is the format and resolution in which you initially get your images? For such poor quality every conversion makes an image even worse... Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 5:29 PM, David Hoffer dhoff...@gmail.com wrote: Dmitry, Would using a loss-less format like TIFF be preferred? (I'm going to give this a try but some of these steps might be a bit more than I can handle...I'm not an image processing guru.) -Dave On Mon, Mar 14, 2011 at 5:23 PM, Dmitry Silaev daemons2...@gmail.com wrote: Ehmm, actually I thought a bit more and now I say no to deskewing. It can be detrimental to such poor quality images - they are almost binary (almost probably because of the JPEG compression algo) and low-res. As far as I see, you only can have binary images. Therefore we need to assume a skew of an input image to be always within some narrow range and modify all our following steps to work in a skewed coordinate system. Dmitry On Mar 14, 4:19 pm, David Hoffer dhoff...@gmail.com wrote: Dmity, That would be great thanks for the offer, I'll attach two samples. These two are good examples of the range of quality. What I need to do is extract cell data for processing. I can generate these in any image format, tiff, jpeg if one should be preferred. Best regards, -Dave On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev daemons2...@gmail.com wrote: I suspect, this paper is a sledgehammer for a nut. It's quite universal and elaborated. Usually it may take a great deal of time to implement and debug it. Your images might require much simplier methods. I always say the same thing: send your sample images and the community will try to help. Warm regards, Dmitry Silaev On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer dhoff...@gmail.com wrote: Hi Vicky, Can you tell me more about this paper? It looks like this is not a free document so I can't just read it to see if it would solve the problem I have. My problem is that I have grey-scale image data (tif/jpg/etc) that contains text within a table format, i.e. cells on the page. The documents where originally faxed then converted to PDF so the image quality varies from poor to good. I don't want the table formatting, I'm looking for a way to remove the formatting and get to just the image text, I want to convert that to text using OCR, Tesseract or otherwise. My programming environment is Java but can shell out to other programs if I need to. Would the approach in the paper solve this problem space? How practical is the software solution for a one man effort? Thanks, -Dave On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja vicky.vi...@gmail.com wrote: Hello, I used this paper (for pre-processing): Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240 - 1256 Best Regards, Vicky -Original Message- From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Daphne Sent: Friday, March 11, 2011 01:15 To: tesseract-ocr Subject: how to get the character in an image file which is in table format. Hello, I have a scanned image file which contains table. When I OCR it using tessnet it doesn't give the desired output. It is not reading the characters in the table. Instead it give some numbers. How to read the character in table format image -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups
Re: how to get the character in an image file which is in table format.
The first step in this technique is to threshold the image using a manually selected threshold value. Within the text of the article this step only deserved a line of code (pix1 = pixThresholdToBinary(pixs, 150)), but not a single word. However the fact that such a convenient threshold luckily exists is crucial for the whole subsequent method steps to work. I think the your source images do not enjoy such good separability conditions. I think this article is more an example of what can be done with Leptonica from user's, not developer's point of view. It's like you take one concrete image in Photoshop and try to achieve what you have in your head. You try various filters, apply transformations, effects, etc. However none of these can be applied automatically: every time you need to choose parameters manually and make decisions specifically for this very image. Imho this is the reason why the author chose morphology - oh, great! that's worked!. It's easier to use in one function call, but in the overwhelming majority of cases, using algorithmic approach gives much more precise results. In real situations morphology requires from you to do a great deal of cleaning after it has done its work, which can be a lot more complex and not so mathematically elegant than morphology algos themselves. Another reason why I try to stay away from morphology is that it is really slow by its nature compared to other methods, despite recent emerging of some fast methods. By the way, the article advertises the processing speed of 1 Mpix/sec, which I think is relatively slow for the intended goal even for yesterday's P4s. The moral is: you can use this article as a guideline or maybe just for several specific images. However it's not well suited for automatic processing. P.S.: This my own opinion, and it does not necessarily coincide with the views of other document image processing people. Warm regards, Dmitry Silaev On Sun, Mar 13, 2011 at 12:52 AM, TP wing...@gmail.com wrote: How about this technique mentioned in the Leptonica documentation (its even easier if you can use binary morphology): Removing dark lines from a light pencil drawing at http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html . -- TP On Sat, Mar 12, 2011 at 12:57 PM, Dmitry Silaev daemons2...@gmail.com wrote: Dave, There's a number of methods you can use to remove straight lines or borders, either individually or in combination. The most simple are: Hough line detector (http://en.wikipedia.org/wiki/Hough_transform), vertical/horizontal profile method (X and Y histograms of foreground pixel counts - detect lines by most bin count or table cell margins by least bin count), connected component analysis (detect nested CCs - outer ones serve as borders), methods based on alignment analysis. If your documents can have a skew, for some methods they need to be deskewed. After you detect table borders, you can get bounding boxes of individual cells and then pass them to Tesseract. I think for Tesseract, small single-row portions of text, yet allowing to determine the baseline and x-height, are often much easier to recognize than full-sized pages, even with no tables in them. This is because Tesseract's native layout analysis. To disable it (or to avoid it as much as possible) you would need to set pageseg_mode to PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own segmentation. I don't know if you are ready to dive into such serious development. HTH Warm regards, Dmitry Silaev On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer dhoff...@gmail.com wrote: Dmitry, Yeah, I was thinking too of preprocessing to remove all straight lines/borders but haven't found a good approach to this yet. I can clean up the margins, headers, footers but I haven't found a good way to remove table row lines. if you/others have any suggestions I would love to hear them. I will also experiment with the config file. Thanks much! -Dave On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev daemons2...@gmail.com wrote: Actually I think there's no fully user-friendly solution. Maybe you can try to use the first of the two possible methods currently seen to me. So the first method is to devise a special config file and include it in the command line for Tesseract. The following values need to be within this config file: tessedit_pageseg_mode 1 or 3 (I recommend 3) textord_tabfind_find_tables T textord_tablefind_recognize_tables T You can play with the last param trying the T or F values. Actually I give no guarantee for the whole method to work, only I found out some clues by studying the code. I suspect corresponding pieces
Re: Especial Characteres
Manuel, The sample you provided definitely has insufficient resolution. You may only expect some part of the heading to be recognized. So this is what happened when I've run the recognition of your image. But I haven't got any error or warning messages with my por.traineddata at all! However all this was tested under Windows. Probably I can try this under Ubuntu, but I don't know when I have enough time to reboot, set up a C++ compiler, build Tesseract and do some testing, sorry )) Are you sure you downloaded the latest stable version of Tesseract? Warm regards, Dmitry Silaev On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com manuel...@gmail.com wrote: I just replaced por.traineddata with your file por.traineddata. After that I'm getting this message error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault I haven't succeeded. I'm using version 3 - MacOSX 10.6 Attached Reported.tiff Regards Manuel Pardo Em 04/03/2011, às 03:19, Dmitry Silaev escreveu: Manuel, Is the error message generated by version 2.xx? Did you try to run version 3.xx with my por.traineddata file? I don't get it - have you succeeded or not? Please provide us with the image you are trying to recognize. Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com wrote: Hi Dmitry, I just replaced with your file por.traineddata But I'm getting an error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault It's seem to be interesting to convert old files from 2.0X to 3, because there isn't a brazillian portuguese for version 3, just portuguese. At least the dictionary por.traineeddata is working correctly in version 3. The special chars is being recognized by tesseract 3. regards, Manuel Pardo Em 03/03/2011, às 09:12, Dmitry Silaev escreveu: Manuel, It's quite an interesting question although it may seem to be an ordinary newbie-like one. I was always wondering if 2.xx files can be used with version 3.xx. The wiki states that the files in the traineddata file are different from the list used prior to 3.00, and will most likely change, possibly dramatically in future revisions. I have no time to investigate it in the code so I decided to act rather than to think. After some tinkering with all those files I slipped the resulted por.traineddata into my Tesseract algo I'm currently working at, and - guess what? - it worked! )) I must say it was tested only with a couple of *very simple* images and also it absolutely lacks any dictionary-related data. And my test images don't contain these specific Portuguese letters with diacritics. So in fact this file may perform poorly. Please test and report your results. The file is in the attachment. It was not difficult at all but also not so straight-forward to make this training data file, so probably this process deserves a separate article and later I'd like to post it in my blog. Warm regards, Dmitry Silaev On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote: Helo list, I can't find a solution for special chars I installed tesseract 3 in my MacOSX 10.6 It is running very well But I'm having problems with charset. I need tesseract working with brazillian portuguese. (ISO8859-1) I installed the portuguese dictionary but is not working with special chars like Ç Ã É é (ISO8859-1) Is there any solution ? There is an old dictionary special for brazilian portuguese in version 2.0.4. Is it possible to use in version 3? How? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. por.traineddata -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups
Re: Tesseract 3.00
Although Tesseract team struggles to get it more user-friendly, many obvious user issues are still opaque or hard to find an answer to... Tesseract is a console application, it has no GUI. You should open a Windows command line and type a command. Read more at http://code.google.com/p/tesseract-ocr/wiki/ReadMe#Windows Warm regards, Dmitry Silaev On Sun, Mar 13, 2011 at 11:36 PM, Onion onionzwie...@gmail.com wrote: I installed Tesseract 3.00 and the German and Czech languages as well as English. Now how do I run it? Are there directions somewhere? When I click Start Tesseract OCR, a DOS screen flashes for a split second, then nothing happens. Thanks -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Customising Tesseract for character recognition
Jose, I run Tesseract revision 549 from the command line under Windows with no special config and get the segmentation which is almost correct. What language file do you use? I used the following command line tesseract 3.tiff test3 -l eng with no pageseg_mode (-psm argument) as well as with it, and always the result was satisfactory. Let me know the details on your command line and OS. Warm regards, Dmitry Silaev On Sun, Mar 13, 2011 at 11:18 PM, patrickq patrick.questemb...@gmail.com wrote: You expect way too much from Tesseract: it's not Tesseract's job to slice and dice the text according to various organizational requirements of applications - that's for the application to handle. You can get all the coordinates of all characters and easily determine which one are in what you consider the first column and which are in the 2nd column. In ScanBizCards' case considering our target material, we treat each line as a single number formed of two sequences - but if we wanted to treat the input as columns, it would take us a mere 20 minutes of coding or organize the results that way. We actually don't even pay attention to where Tesseract thinks lines end and start, we figure that out ourselves based on coordinates. It's not hard. Patrick On Mar 13, 4:10 pm, Jose diox...@gmail.com wrote: Hi Patrick, yes the results are correct! but the format of the results it is not! that's my trouble -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: how to get the character in an image file which is in table format.
Dave, There's a number of methods you can use to remove straight lines or borders, either individually or in combination. The most simple are: Hough line detector (http://en.wikipedia.org/wiki/Hough_transform), vertical/horizontal profile method (X and Y histograms of foreground pixel counts - detect lines by most bin count or table cell margins by least bin count), connected component analysis (detect nested CCs - outer ones serve as borders), methods based on alignment analysis. If your documents can have a skew, for some methods they need to be deskewed. After you detect table borders, you can get bounding boxes of individual cells and then pass them to Tesseract. I think for Tesseract, small single-row portions of text, yet allowing to determine the baseline and x-height, are often much easier to recognize than full-sized pages, even with no tables in them. This is because Tesseract's native layout analysis. To disable it (or to avoid it as much as possible) you would need to set pageseg_mode to PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own segmentation. I don't know if you are ready to dive into such serious development. HTH Warm regards, Dmitry Silaev On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer dhoff...@gmail.com wrote: Dmitry, Yeah, I was thinking too of preprocessing to remove all straight lines/borders but haven't found a good approach to this yet. I can clean up the margins, headers, footers but I haven't found a good way to remove table row lines. if you/others have any suggestions I would love to hear them. I will also experiment with the config file. Thanks much! -Dave On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev daemons2...@gmail.com wrote: Actually I think there's no fully user-friendly solution. Maybe you can try to use the first of the two possible methods currently seen to me. So the first method is to devise a special config file and include it in the command line for Tesseract. The following values need to be within this config file: tessedit_pageseg_mode 1 or 3 (I recommend 3) textord_tabfind_find_tables T textord_tablefind_recognize_tables T You can play with the last param trying the T or F values. Actually I give no guarantee for the whole method to work, only I found out some clues by studying the code. I suspect corresponding pieces of code may not work perfectly, or there are some more parameters that can influence table recognition. Please try this yourself. It would be nice if you share your results with the community. Sample images are also appreciated. The second method is to pre-process your images. You need to remove lines and borders and pass the cleaned image to Tesseract. There can arise many issues related to this process, but I think there's no need to tell anything else now, except if you express some interest in it. Warm regards, Dmitry Silaev On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer dhoff...@gmail.com wrote: I have the same problem, I posted a message a few day's ago titled Working with FAX images with lines/borders. If you find a solution please let me know. Thanks, -Dave On Thu, Mar 10, 2011 at 10:44 PM, Daphne flower.dap...@gmail.com wrote: Hello, I have a scanned image file which contains table. When I OCR it using tessnet it doesn't give the desired output. It is not reading the characters in the table. Instead it give some numbers. How to read the character in table format image -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr
Re: how to get the character in an image file which is in table format.
Actually I think there's no fully user-friendly solution. Maybe you can try to use the first of the two possible methods currently seen to me. So the first method is to devise a special config file and include it in the command line for Tesseract. The following values need to be within this config file: tessedit_pageseg_mode 1 or 3 (I recommend 3) textord_tabfind_find_tables T textord_tablefind_recognize_tables T You can play with the last param trying the T or F values. Actually I give no guarantee for the whole method to work, only I found out some clues by studying the code. I suspect corresponding pieces of code may not work perfectly, or there are some more parameters that can influence table recognition. Please try this yourself. It would be nice if you share your results with the community. Sample images are also appreciated. The second method is to pre-process your images. You need to remove lines and borders and pass the cleaned image to Tesseract. There can arise many issues related to this process, but I think there's no need to tell anything else now, except if you express some interest in it. Warm regards, Dmitry Silaev On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer dhoff...@gmail.com wrote: I have the same problem, I posted a message a few day's ago titled Working with FAX images with lines/borders. If you find a solution please let me know. Thanks, -Dave On Thu, Mar 10, 2011 at 10:44 PM, Daphne flower.dap...@gmail.com wrote: Hello, I have a scanned image file which contains table. When I OCR it using tessnet it doesn't give the desired output. It is not reading the characters in the table. Instead it give some numbers. How to read the character in table format image -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: are there parameters to increase the chances for white space between words?
Try textord_words_min_minspace, fraction of x-height Warm regards, Dmitry Silaev On Mon, Mar 7, 2011 at 8:28 PM, JMW white.j...@gmail.com wrote: I'm having some consistent problems with lack of whte space between words. I.e. Thisisyour statementthatshows theamount you owe foryour. are there tuening parameters that will help increase the change of getting whitespace between words? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: noise output
Zdravko, You should do text-detection before passing images to Tesseract. Text-detection is a process of determining of image regions containing text. Even if an image contains no text, Tesseract anyways will treat it as an image of text. Before recognition Tess applies a so-called binarization algorithm, which converts an RGB image to monochrome one (black for text and white for background). For your sample image the Otsu binarization used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would certainly give a number of skewed vertical lines resembling backslashes and further recognition classifies them as such. textord_heavy_nr and some other variables control size-based noise removal but work satisfactory only in case when there's a significant body of good text surrounded but some amount of noise. In your image everything is noise, so it won't work. Therefore you need to extend your pre-processing in order to feed Tess with images indeed containing text. Decisions can be made based on contrast estimation, distinctive color distribution, etc. HTH Warm regards, Dmitry Silaev On Fri, Mar 4, 2011 at 5:25 PM, zdravco zdra...@gmail.com wrote: Hello, I am using tesseract in my project after some image pre-processing. There are some false negatives I was hoping tesseract would eliminate by producing no output. However, sometimes there is a strange output that I get from almost blank images. Here is the sample image: https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274 When I run it with tesseract rev. 552 using English language I get: R \. Does anyone know if there are some options in tesseract that could eliminate this noise? Or maybe if I could improve my input image with some further pre-processing. I have also tried to recompile tesseract with textord_heavy_nr set to TRUE, but then the output is: an \\“ R \. Thanks, Zdravko -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Fwd: noise output
There are tons of. And I believe, no ready recipe can be used universally, this is very task-specific, especially in photographic images. Also I believe, to do good text detection your algo should in some extent mimic human behavior so it probably should be multi-stage, gradually refining results at every stage. Don't account on getting a working code snippet from the internet, most likely you'd have to write the code yourself. Some articles I had picked out when I was self-studying this field of document image processing. For the moment, there might be newer ones, but these can provide you with the basis. Apologies, I've no time to provide you with direct references and author names - I only listed my file system directory on this topic. You can Google for exact article titles to find links. 1990 Scale-Space and Edge Detection Using Anisotropic Diffusion.pdf 1998 Edge detection and ridge detection with automatic scale selection.pdf 2001 Edge-Based Method for Text Detection from Complex Document Images.pdf 2001 TEXT EXTRACTION FROM GREY SCALE PAGE IMAGES BY SIMPLE EDGE DETECTORS.pdf 2002 Gaussian-Based Edge-Detection Methods - A Survey.pdf 2003 Fast Computation of Scale Normalised Gaussian Receptive Fields.pdf 2003 Real-time scale selection in hybrid multi-scale representations.pdf 2003 Recognition of text in 3-D scenes.pdf 2004 A method for ridge extraction.pdf 2004 A Review of Vessel Extraction Techniques and Algorithms.pdf 2004 Distinctive Image Features from Scale-Invariant Keypoints.pdf 2004 Scene Text Extraction in Natural Scene Images using Hierarchical Feature Combining and Verification.PDF 2004 Text Detection from Natural Scene Images - Towards a System for Visually Impaired Persons.PDF 2005 A novel approach for text detection in images using structural features.pdf 2005 Color Text Extraction from Camera-based Images - the Impact of the Choice of the Clustering Distance.PDF 2005 Improved Text-Detection Methods for a Camera-based Text Reading System for Blind Persons.PDF 2005 Text Extraction from Gray Scale Historical Document Images Using Adaptive Local Connectivity Map.pdf 2006 Multiscale Edge-Based Text Extraction from Complex Images.PDF 2006 Spatial and Color Spaces Combination for Natural Scene Text Extraction.PDF 2008 A double-threshold image binarization method based on edge detector.PDF HTH Warm regards, Dmitry Silaev On Sat, Mar 5, 2011 at 8:56 AM, Saurabh Gandhi saurabh...@gmail.com wrote: Hey, Any algorithm / whitepaper suggestions for text extraction, especially if the text is not over-lay text but a part of the image itself. Most algorithms I saw on the internet are compute intensive. -- Regards, Saurabh Gandhi On Sat, Mar 5, 2011 at 11:20 AM, Dmitry Silaev daemons2...@gmail.com wrote: Zdravko, You should do text-detection before passing images to Tesseract. Text-detection is a process of determining of image regions containing text. Even if an image contains no text, Tesseract anyways will treat it as an image of text. Before recognition Tess applies a so-called binarization algorithm, which converts an RGB image to monochrome one (black for text and white for background). For your sample image the Otsu binarization used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would certainly give a number of skewed vertical lines resembling backslashes and further recognition classifies them as such. textord_heavy_nr and some other variables control size-based noise removal but work satisfactory only in case when there's a significant body of good text surrounded but some amount of noise. In your image everything is noise, so it won't work. Therefore you need to extend your pre-processing in order to feed Tess with images indeed containing text. Decisions can be made based on contrast estimation, distinctive color distribution, etc. HTH Warm regards, Dmitry Silaev On Fri, Mar 4, 2011 at 5:25 PM, zdravco zdra...@gmail.com wrote: Hello, I am using tesseract in my project after some image pre-processing. There are some false negatives I was hoping tesseract would eliminate by producing no output. However, sometimes there is a strange output that I get from almost blank images. Here is the sample image: https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274 When I run it with tesseract rev. 552 using English language I get: R \. Does anyone know if there are some options in tesseract that could eliminate this noise? Or maybe if I could improve my input image with some further pre-processing. I have also tried to recompile tesseract with textord_heavy_nr set to TRUE, but then the output is: an \\“ R \. Thanks, Zdravko -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr
Re: Especial Characteres
Sriranga, Thanks for letting me know. You are the first one then, and I invented the bicycle )) However an article might be still of use instead of verbose forum discussion... May be you'd like to write it then? Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold) withblessi...@gmail.com wrote: Dimitry, I had generated traineddata(Kannada) files sucessfully from the old datafiles of 2.xx last year. There is discussion by spohorsky in the forum how to do. sriranga(78) ♫ On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com wrote: Manuel, It's quite an interesting question although it may seem to be an ordinary newbie-like one. I was always wondering if 2.xx files can be used with version 3.xx. The wiki states that the files in the traineddata file are different from the list used prior to 3.00, and will most likely change, possibly dramatically in future revisions. I have no time to investigate it in the code so I decided to act rather than to think. After some tinkering with all those files I slipped the resulted por.traineddata into my Tesseract algo I'm currently working at, and - guess what? - it worked! )) I must say it was tested only with a couple of *very simple* images and also it absolutely lacks any dictionary-related data. And my test images don't contain these specific Portuguese letters with diacritics. So in fact this file may perform poorly. Please test and report your results. The file is in the attachment. It was not difficult at all but also not so straight-forward to make this training data file, so probably this process deserves a separate article and later I'd like to post it in my blog. Warm regards, Dmitry Silaev On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote: Helo list, I can't find a solution for special chars I installed tesseract 3 in my MacOSX 10.6 It is running very well But I'm having problems with charset. I need tesseract working with brazillian portuguese. (ISO8859-1) I installed the portuguese dictionary but is not working with special chars like Ç Ã É é (ISO8859-1) Is there any solution ? There is an old dictionary special for brazilian portuguese in version 2.0.4. Is it possible to use in version 3? How? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Especial Characteres
Sriranga, Actually I don't understand why one needs to refer to the forum discussion you've just mentioned above, as I managed to build this traineddata file without writing a single line of code and even without a compiler, say Visual C++... The value I can add is in that any user inexperienced in programming can make this traineddata file himself )) Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 5:08 PM, Sriranga(78yrsold) withblessi...@gmail.com wrote: Dmitry, No I am NOT the first invented but actually credited to spohor...@sjm.com -who helped me very lot including creating vcproj for combined traineddata for windows. I am very thankful to him for his help/guidance rendered from time to time. Without his help I would not succeeded to generate traineddata file out of old datafiles All credits should go to Steve. Steve has already explained in detail how to do in the forum discussion are available. -sriranga(78yrs) On Thu, Mar 3, 2011 at 6:36 PM, Dmitry Silaev daemons2...@gmail.com wrote: Sriranga, Thanks for letting me know. You are the first one then, and I invented the bicycle )) However an article might be still of use instead of verbose forum discussion... May be you'd like to write it then? Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold) withblessi...@gmail.com wrote: Dimitry, I had generated traineddata(Kannada) files sucessfully from the old datafiles of 2.xx last year. There is discussion by spohorsky in the forum how to do. sriranga(78) ♫ On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com wrote: Manuel, It's quite an interesting question although it may seem to be an ordinary newbie-like one. I was always wondering if 2.xx files can be used with version 3.xx. The wiki states that the files in the traineddata file are different from the list used prior to 3.00, and will most likely change, possibly dramatically in future revisions. I have no time to investigate it in the code so I decided to act rather than to think. After some tinkering with all those files I slipped the resulted por.traineddata into my Tesseract algo I'm currently working at, and - guess what? - it worked! )) I must say it was tested only with a couple of *very simple* images and also it absolutely lacks any dictionary-related data. And my test images don't contain these specific Portuguese letters with diacritics. So in fact this file may perform poorly. Please test and report your results. The file is in the attachment. It was not difficult at all but also not so straight-forward to make this training data file, so probably this process deserves a separate article and later I'd like to post it in my blog. Warm regards, Dmitry Silaev On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote: Helo list, I can't find a solution for special chars I installed tesseract 3 in my MacOSX 10.6 It is running very well But I'm having problems with charset. I need tesseract working with brazillian portuguese. (ISO8859-1) I installed the portuguese dictionary but is not working with special chars like Ç Ã É é (ISO8859-1) Is there any solution ? There is an old dictionary special for brazilian portuguese in version 2.0.4. Is it possible to use in version 3? How? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract
Re: Especial Characteres
Manuel, Is the error message generated by version 2.xx? Did you try to run version 3.xx with my por.traineddata file? I don't get it - have you succeeded or not? Please provide us with the image you are trying to recognize. Warm regards, Dmitry Silaev On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com wrote: Hi Dmitry, I just replaced with your file por.traineddata But I'm getting an error: manuel$ tesseract input.tiff output -l por actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55 Segmentation fault It's seem to be interesting to convert old files from 2.0X to 3, because there isn't a brazillian portuguese for version 3, just portuguese. At least the dictionary por.traineeddata is working correctly in version 3. The special chars is being recognized by tesseract 3. regards, Manuel Pardo Em 03/03/2011, às 09:12, Dmitry Silaev escreveu: Manuel, It's quite an interesting question although it may seem to be an ordinary newbie-like one. I was always wondering if 2.xx files can be used with version 3.xx. The wiki states that the files in the traineddata file are different from the list used prior to 3.00, and will most likely change, possibly dramatically in future revisions. I have no time to investigate it in the code so I decided to act rather than to think. After some tinkering with all those files I slipped the resulted por.traineddata into my Tesseract algo I'm currently working at, and - guess what? - it worked! )) I must say it was tested only with a couple of *very simple* images and also it absolutely lacks any dictionary-related data. And my test images don't contain these specific Portuguese letters with diacritics. So in fact this file may perform poorly. Please test and report your results. The file is in the attachment. It was not difficult at all but also not so straight-forward to make this training data file, so probably this process deserves a separate article and later I'd like to post it in my blog. Warm regards, Dmitry Silaev On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote: Helo list, I can't find a solution for special chars I installed tesseract 3 in my MacOSX 10.6 It is running very well But I'm having problems with charset. I need tesseract working with brazillian portuguese. (ISO8859-1) I installed the portuguese dictionary but is not working with special chars like Ç Ã É é (ISO8859-1) Is there any solution ? There is an old dictionary special for brazilian portuguese in version 2.0.4. Is it possible to use in version 3? How? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. por.traineddata -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: image binarization
Without any image samples, you can only get a vague advice. Provide the community with samples and you might get a satisfactory concrete response. Warm regards, Dmitry Silaev On Wed, Mar 2, 2011 at 1:43 PM, Cong Nguyen congnguye...@gmail.com wrote: Please be careful with the Otsu algorithm, because we use only one threshold value for whole image. No method is best for all cases J. You should do and compare the results between Otsu algorithm and adaptive threshold algorithm. About adaptive threshold algorithm, you can be based on integral image (known by Paul Viola et. al) to increase performance. Cong. From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Saurabh Gandhi Sent: Wednesday, March 02, 2011 3:34 PM To: tesseract-ocr@googlegroups.com Cc: Bikash Bag Subject: Re: image binarization You can use Otsu's binarisation algorithm: http://www.sas.bg/code-snippets/image-binarization-the-otsu-method.html -- Regards, Saurabh Gandhi On Wed, Mar 2, 2011 at 1:45 PM, Bikash Bag bikash...@gmail.com wrote: Hi, I am new to OCR, can anyone please tell me a good image binarization algorithm. regards, bikash -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Customising Tesseract for character recognition
I don't know if it's affordable for you, but imho decent results can only be achieved if you do segmentation yourself and then pass image fragments to Tesseract on a word-by-word basis. Problems may appear when you have words that are too short, however, as I can see, it's not your case. Long time ago, I had started my project relying on Tess's segmentation and struggled much with it, until I came to a word-by-word approach. Finally, I even switched to the character-wise recognition which at last produces decent results. Mostly this transition was caused by specifics of input images I'm working on (photos, usually of low quality), but I think this is almost required for ideally scanned images too. There are some fruitful math ideas behind Tess's segmentation, but I think the current implementation is not mature enough to be used extensively in the production mode. Warm regards, Dmitry Silaev On Thu, Feb 24, 2011 at 1:05 PM, Jose diox...@gmail.com wrote: Hi, (as you now Saurabh because we talked in private the other day) I tried the PSM_SINGLE_COLUMN and the accuracy drops dramatically... I can't afford to loose that accuracy. Is it possible to change the way the output is display? Looking a the code it seems rather hard to change it... perhaps I could print the pos x,y of the word found and then I could work out the horizontal/vertial layout? What are your thoughts? regards -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Customising Tesseract for character recognition
Unfortunately not only text output order can suffer from Tess's segmentation, but also extents of some text fragments can be identified incorrectly (say one segmented row can span over two real rows, probably in partial way), and that in turn can lead to *completely* irrelevant recognition results. However you can run as many as possible tests on your images and prove that this probably is not the case, and hope that segmentation errors are won't be destructive and only will introduce this kind of disorder. Then certainly you can use your (x,y)-sort method and be happy )) Warm regards, Dmitry Silaev On Thu, Feb 24, 2011 at 1:50 PM, Jose diox...@gmail.com wrote: Dmitry the recognition works the only thing is the way it is parsing it... :S I think segmentation of the images would be too much painful! I only won't to change the other that is display or the bounding boxes so I could now the x and y of the word recognized and thereby can organise the results better myself! don't you think it's a good aproach? thank you very much for you help -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Customising Tesseract for character recognition
The best way to explain everything would be just to send your source image examples, describe what information you want to get from them and provide the community with the code snippets you use to interface with Tess. And please be as detailed as possible. Warm regards, Dmitry Silaev On Thu, Feb 24, 2011 at 2:17 PM, Jose diox...@gmail.com wrote: In my particular case is just a matter that the first word of each column is in one font and the other is in another so instead of reading column by column it reads all the columns of the first row and then all the columns of the second row! My god is really hard to explain in english. I get an accurate result: 90% but instead I get the concat of the column 1 and column 2! I'm trying my best to understand the OCR but it's really hard for me as I don't have any OCR background. I don't see any other approach than printing where is the word ridden and try to postprocess all the results after, please correct me if I'm wrong or you see some improvements that can be made. please excuse my bad english regards, jose -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: [Tesseract 3] English training text
Interesting. I was wondering about Cube since its traces began to appear in the source code but had no enough time to investigate it thorougly Zdenko, would you please kindly share your other findings on Cube? Regards, Dmitry On Tue, Feb 22, 2011 at 11:13 AM, zdenko podobny zde...@gmail.com wrote: I doubt that google will release their (full) training set :-( Have a look at svn to file eng.cube.size [1]. You can see there name of fonts that was training for English in 3.01. As far as I understood there is (unpublished/not released) possibility to train language data directly on font files. Unfortunately there are no detail for cube part of training. Zd. [1] 12,4Mb! http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/eng.cube.size On Wed, Feb 9, 2011 at 5:48 PM, Sly_bzh sl...@laposte.net wrote: I would like to train tesseract for English with some special fonts. Tesseract training documentation says that a text should be prepared and it must follow some important points (see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images) Could someone provide to the community the content of a good and efficient text for english training ? Note : I think it could be useful to provide the texts that have been used to build the training files that could be downloaded in the Download section (http://code.google.com/p/tesseract-ocr/downloads/ list). What do you think about that ? Thanks ! -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: problem in single word recognition
I might not understood you fully, but this is an obvious excerpt from baseapi.h: Each SetRectangle clears the recogntion results so multiple rectangles can be recognized with the same image Indeed, SetRectangle() calls ClearResults() which deletes the pageres and clears the block list ready for a new page As of 3.01 reconize_a_word() is obsolete and was refactored out Regards, Dmitry On Feb 18, 9:11 pm, Jacob George jg1...@gmail.com wrote: Hi, I am using Tesseract 3.0 for my project which requires real time text recognition from a video stream of a web camera. I need only to recognize a single word or character after finding lines. In my code the function findlines() is called first (after setting the image using SetImage) and using this the TBOX of the word to be recognized is found out. Using the coordinates from the bounding box the setrectangle function is called and the text is recognized. but I face few problems: . If small words(such as are, is, be etc) are taken they are either recognized wrongly or not at all recognized. But if I give the bounding box of the same line these words are recognized accurately. why is it so? .does setrectangle clears all the result of the findlines()? . I came across a function reconize_a_word to recognize a single word. Does reconize_ a_word recognize all the page and returns only the text of the target box or does it recognize only the word given in the target box? thanks in advance, jacob -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Adaptive Data
Hi Zvezdoslav, Check out the code of the Classify::EndAdaptiveClassifier() and Classify::InitAdaptiveClassifier() methods. Also search for classify_use_pre_adapted_templates and classify_save_adapted_templates HTH Regards, Dmitry On Feb 16, 4:50 pm, Zvezdoslav Kunov z.ku...@gmail.com wrote: Hi all, I'm using tesseract api v3.0.1 under Linux. Does anybody knows a way to save/load the adaptive data that tesseract accumulates while running? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Image pre-processing for good OCR results
Jon, I don't know if it's intended but all your links to images report We're sorry. The page you tried to access is not available. In that way nothing can be advised on your issue... Warm regards, Dmitry Silaev On Mon, Feb 21, 2011 at 5:02 AM, Jon Andersen jande...@gmail.com wrote: Hi, My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Example images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract. Any advice on improving OCR accuracy through pre-processing steps? Thanks so much, -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Wrappers for tessearct3.01?
devTess, I'd not ask questions like this as Tess is undergoing transition from the old code base and is under hard development of new features. I've no enough time to investigate but the prev_word_best_choice_ data member seems to be related to best segmentation search based on the language model. Instead of rummaging in Tess's guts I'd better use a pretty convenient and high-level interface provided by ResultIterator (see GetIterator() in baseapi.h and then read all comments in resultiterator.h and pageiterator.h) Warm regards, Dmitry Silaev On Wed, Feb 16, 2011 at 5:34 AM, devTess jim...@googlemail.com wrote: Question: where can I find out more about (see below) tesseract_-prev_word_best_choice_ What is the purpose of doing that? Why is it that it is not sufficient just to page_res_ = new PAGE_RES(block_list_); Thank you. = int TessBaseAPI::RecognizeText(ETEXT_DESC* monitor) { if (tesseract_ == NULL) return -1; if (page_res_ != NULL) delete page_res_; block_list_ =FindLinesCreateBlockList(); tesseract_-SetBlackAndWhitelist(); recognition_done_ = true; page_res_ = new PAGE_RES(block_list_, tesseract_- prev_word_best_choice_); // Now run the main recognition. tesseract_-recog_all_words(page_res_, monitor, NULL, NULL, 1); return 0; } -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Provide/visualize baseline info?
Sriranga, I'm glad you've succeeded. Thanks to Zdenko for his guiding thought. I was aware of this Tess's debugging capability but also strangely kept overlooking it. I think the empty test1.txt is mostly a normal situation. I noticed this fact also, but as I can remember there were times when it was filled with recognized text. Probably it depends on actions you perform during the ScrollView session or their specific sequence - I really have no time to investigate. Probably I'll publish more tutorials on what can be done using ScrollView. But for my own needs this tool adds almost no value (( *** I'm still seeking for somebody's help regarding this topic's subject. *** Warm regards, Dmitry Silaev 2011/2/8 Sriranga(78yrsold) withblessi...@gmail.com Dmitry, Congratulations !! successfully installed in winXP and tried using phototest.tif 1st commandline tesseract phototest.tif test1 segdemo inter works well 2nd command line tesseract phototest.tif test1 matdemo inter wokrs well however it is observed that output text1 is zero KB - where i made a mistake? Just Now tested for Kannda script as well as Khem script - it displayed correctly and only text file does contains 0 KB It would be nice if the screenshots of Modes, Display and others are reproduced for benefit of newbies/users - this will enable to view their own lang other than english.Really it is boon to newbies like me. It would be better to have your extract of your blog published in the wiki section for benefit of users of forum by the owner. With Warmest Regards, -sriranga(78yrs) 2011/2/6 Dmitry Silaev daemons2...@gmail.com Here are the brief instructions on how to set up the Tesseract interactive debug environment (ScrollView) on Windows: 1. Make sure you have Java Runtime Environment installed 2. Download my home-brewed single archived installation suite from http://www.4shared.com/get/Z4gnbJdP/tess_debug.html 3. Unpack the installation suit 4. Run cmd.exe 5. Change working directory to where you've unpacked the installation suit 6. Follow the instructions in http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging to run Tesseract+ScrollView from the command line To keep the reasonable forum post size here in Google Groups, I placed the more verbose and overall nicer looking instructions in my blog at http://rdaemons.blogspot.com/2011/02/tesseract-ocr-setting-up-interactive.html Warm regards, Dmitry Silaev 2011/2/6 Sriranga(78yrsold) withblessi...@gmail.com Dear dmitry, Though it may or may not help me much atleast it will be benefited for users of tesseract-ocr - for which users of the forum/newbies shall be thankful to you. With Warmest Regards, -sriranga(78yrs) On Sun, Feb 6, 2011 at 1:47 AM, Dmitry Silaev daemons2...@gmail.comwrote: Dear Sriranga, I've just managed to start the interactive Tess's visualizer. I don't really know if it might help you much, but I can publish the step-by-step instructions on how to make it work. At least these instructions may help some of Tess community newbies. Most likely, I'll be able to publish this within the next 24 hours. However it's not a workable solution for me. I still in desperate need to know if I can provide Tess with my own baseline info using some high-level structures and methods. Or whatever information you may have on this subject. Warm regards, Dmitry Silaev 2011/2/5 Sriranga(78yrsold) withblessi...@gmail.com Tried to install in WinXP but failed. extract of cmd is reproduced below for further guidance please. C:\set JAVA_HOME=C:\jdk1.4 C:\.\build.bat all (win32) '.\build.bat' is not recognized as an internal or external command, operable program or batch file. C:\ C:\j: J:\tesseract-ocr-3.01alpha-r527\java.\build.bat all (win32) Piccolo Build System --- Building with classpath C:\jdk1.4\lib\tools.jar;.\lib\ant.jar;.\lib\junit.jar; Starting Ant... The system cannot find the path specified. J:\tesseract-ocr-3.01alpha-r527\java J:\tesseract-ocr-3.01alpha-r527\java I may kindly be intimated where I made a mistake? with warmest regards, -sriranga(78yrs) On Sat, Feb 5, 2011 at 7:28 PM, Sriranga(78yrsold) withblessi...@gmail.com wrote: As per wiki instruction on debug mode , On Windows: The build process for building ScrollView.jar is not defined. Instead copy piccolo-1.2.jar and piccolox-1.2.jar to tesseract/java - which appears prescribed for*tesseract 2.04 * . It is presumed whether by coping piccolo-1.2jar and piccolox-1.2 to tesseract/java folder of tesserac-3.01Alpha ( r527) will work? For this purpose whether picolo.java1.2( compiled source 4.3MB)have to be downloaded for WinXP? Kindly confirm - since I am not programmer/developer. With Regards, -Sriranga(78yrs) 2011/2/5 Zdenko Podobný zde...@gmail.com I am not sure what you if it helps you, but did you try debug mode ( http://code.google.com
Re: Wrappers for tessearct3.01?
devTess be careful with coffee, don't overdose )) Q1 Init(datapath, language, OcrEngineMode); What is the normal setting of OcrEngineMode? Currently OEM_OcrEngineMode = TESSERACT_ONLY would be sufficient for all cases. Q2: which of the following is USED In normal running mode of tessearct.exe to recognize text The values of the variables you can see within the code of Recognize() (e.g. tesseract_-tessedit_resegment_from_boxes) are often loaded from config files. Usually recognition runs with no config files at all, so you can assume all these variables to be false. In that way you can examine the control paths and figure out what procedures get called at the recognition stage. Q3: which of the following is USED In normal running mode of tessearct.exe to recognize text You meant to train - copy-paste. Training is a 2-stage process: 1) Making box files. Requires two config files: batch.nochop and makebox 2) Generation of .tr files. Needs nobatch and box.train You can find the above configs inside the tessdata/configs and tessdata/tessconfigs directories in Tess's distribution. Check these files and you'll understand what usually happens while training. Plain old step-by-step debugging is also of use )) Warm regards, Dmitry Silaev On Tue, Feb 8, 2011 at 6:44 PM, devTess jim...@googlemail.com wrote: Hi Dimitry, with the guidelines provided from you, I prepared a strong cup of coffee and start reading the top part of baseapi.h Q1 Init(datapath, language, OcrEngineMode); What is the normal setting of OcrEngineMode? I try to use the :Recognize(ETEXT_DESC* monitor) method. There are two PARTS to the Recognize method Part ONE: Q2: which of the following is USED In normal running mode of tessearct.exe to recognize text if (tesseract_-tessedit_resegment_from_line_boxes) page_res_ = tesseract_-ApplyBoxes(*input_file_, true, block_list_); else if (tesseract_-tessedit_resegment_from_boxes) page_res_ = tesseract_-ApplyBoxes(*input_file_, false, block_list_); else page_res_ = new PAGE_RES(block_list_, tesseract_- prev_word_best_choice_); My guess if (tesseract_-tessedit_make_boxes_from_boxes) { tesseract_-CorrectClassifyWords(page_res_); return 0; } Part TWO: Q3: which of the following is USED In normal running mode of tessearct.exe to recognize text if (tesseract_-interactive_mode) { tesseract_-pgeditor_main(rect_width_, rect_height_, page_res_); // The page_res is invalid after an interactive session, so cleanup // in a way that lets us continue to the next page without crashing. delete page_res_; page_res_ = NULL; return -1; } else if (tesseract_-tessedit_train_from_boxes) { tesseract_-ApplyBoxTraining(*output_file_, page_res_); } else if (tesseract_-tessedit_ambigs_training) { FILE *training_output_file = tesseract_- init_recog_training(*input_file_); // OCR the page segmented into words by tesseract. tesseract_-recog_training_segmented( *input_file_, page_res_, monitor, training_output_file); fclose(training_output_file); } else { // Now run the main recognition. tesseract_-recog_all_words(page_res_, monitor, NULL, NULL, 0); My guess } -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Provide/visualize baseline info?
Here are the brief instructions on how to set up the Tesseract interactive debug environment (ScrollView) on Windows: 1. Make sure you have Java Runtime Environment installed 2. Download my home-brewed single archived installation suite from http://www.4shared.com/get/Z4gnbJdP/tess_debug.html 3. Unpack the installation suit 4. Run cmd.exe 5. Change working directory to where you've unpacked the installation suit 6. Follow the instructions in http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging to run Tesseract+ScrollView from the command line To keep the reasonable forum post size here in Google Groups, I placed the more verbose and overall nicer looking instructions in my blog at http://rdaemons.blogspot.com/2011/02/tesseract-ocr-setting-up-interactive.html Warm regards, Dmitry Silaev 2011/2/6 Sriranga(78yrsold) withblessi...@gmail.com Dear dmitry, Though it may or may not help me much atleast it will be benefited for users of tesseract-ocr - for which users of the forum/newbies shall be thankful to you. With Warmest Regards, -sriranga(78yrs) On Sun, Feb 6, 2011 at 1:47 AM, Dmitry Silaev daemons2...@gmail.comwrote: Dear Sriranga, I've just managed to start the interactive Tess's visualizer. I don't really know if it might help you much, but I can publish the step-by-step instructions on how to make it work. At least these instructions may help some of Tess community newbies. Most likely, I'll be able to publish this within the next 24 hours. However it's not a workable solution for me. I still in desperate need to know if I can provide Tess with my own baseline info using some high-level structures and methods. Or whatever information you may have on this subject. Warm regards, Dmitry Silaev 2011/2/5 Sriranga(78yrsold) withblessi...@gmail.com Tried to install in WinXP but failed. extract of cmd is reproduced below for further guidance please. C:\set JAVA_HOME=C:\jdk1.4 C:\.\build.bat all (win32) '.\build.bat' is not recognized as an internal or external command, operable program or batch file. C:\ C:\j: J:\tesseract-ocr-3.01alpha-r527\java.\build.bat all (win32) Piccolo Build System --- Building with classpath C:\jdk1.4\lib\tools.jar;.\lib\ant.jar;.\lib\junit.jar; Starting Ant... The system cannot find the path specified. J:\tesseract-ocr-3.01alpha-r527\java J:\tesseract-ocr-3.01alpha-r527\java I may kindly be intimated where I made a mistake? with warmest regards, -sriranga(78yrs) On Sat, Feb 5, 2011 at 7:28 PM, Sriranga(78yrsold) withblessi...@gmail.com wrote: As per wiki instruction on debug mode , On Windows: The build process for building ScrollView.jar is not defined. Instead copy piccolo-1.2.jar and piccolox-1.2.jar to tesseract/java - which appears prescribed for*tesseract 2.04 * . It is presumed whether by coping piccolo-1.2jar and piccolox-1.2 to tesseract/java folder of tesserac-3.01Alpha ( r527) will work? For this purpose whether picolo.java1.2( compiled source 4.3MB)have to be downloaded for WinXP? Kindly confirm - since I am not programmer/developer. With Regards, -Sriranga(78yrs) 2011/2/5 Zdenko Podobný zde...@gmail.com I am not sure what you if it helps you, but did you try debug mode ( http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging)? Zd. Dňa 05.02.2011 01:33, daemon-s wrote / napísal(a): Hi! I train Tess using separate images for every text line. Recognition is also ran over single text line images. Recognition performs pretty well, however there are many errors that, I believe, related to misdetected baselines, during training or recognition - I don't know. These include: (double quote) detected as n S detected as s (and vice versa) V detected as v (and vice versa) etc. Is there any (preferably high-level) way to provide Tess with baseline info? Or at least obtain baseline info from Tess in order to visualize it further for debugging? Thanks, Dmitry -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr
Re: Tesseract Training
Dear Sochenda, Glad you have succeeded in training for Khmer and thanks for your kind words. Could you please share with us the images and .box files you used for training? Also some sample input images and respective recognition results would be of much use. Sriranga, I see your *training* process is doing pretty well. Most of your problems are in the dictionary facility. However I do not feel proficient in this field. I mean I know how it works (to be exact how it *should* work), I understand the theoretical basis besides it, but I avoided using it as much as could. When I was getting ready to start using Tess in my project, I read much of the tesseract-XXX groups and I understood that dictionary facility is far from being perfect, at list it's not ready to use yet. Fortunately my project involves much image processing and the specifics of my task imply block/line/letter segmentation so I managed to keep off most of dubious Tess's parts and used it solely as a raw classifier. And I think, classification is what Tess does quite well. Unfortunately I think you will have much struggling with various inconsistencies and cryptic errors, but anyway I think it's worth it. You should report your every error to the team and wait until it's fixed, at the same time trying to found your way around it. Or you can leave the dictionary facility and rely completely on some home brewed post-processing. If you choose this, your problem turns into a small RD project so you need to find appropriate people to do this job. Warm regards, Dmitry Silaev -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Tesseract Training
Dear Sochenda, Please provide us with every file you use to make up your traineddata. Also we need all command lines with what you run Tess and Tess tools. Please be sure to be as detailed as possible. Internship is a good opportunity for everyone here; I'd probably try to apply also but I'm not much of a recent graduate already (( Warm regards, Dmitry Silaev -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Tesseract Training
Dear Sochenda, In addition to what Sriranga said I'd remind that you should do a lot of manual work: In pyTesseractTrainer check that no bounding boxes intersect glyphs; if some does - correct its BB coordinates manually. In cases of BB overlap you should space out participating glyphs in the training image (see the attached picture for examples). You should use manual spacing if participating glyphs are dependent characters (in your language - vowels) and the number of possible combinations is practically uncountable. Then you would assign every glyph its own code. Tess would consider these glyphs as separate characters and you should post-process the resulting code sequence to obtain a well-formed dependent Unicode pair (or triplet). If there can be only few such combinations - you can merge these BBs into one to encompass all the required glyphs and assign a single code to the entire glyph combination. Then during the post-processing you'll need to replace this single code with a predefined dependent Unicode pair. Hope I've managed to express myself clearly. Warm regards, Dmitry Silaev -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. attachment: figure01.GIF
Re: Tesseract Training
Dear Sochenda, I've checked the Unicode table range you've sent and now I see what the problem is. I'd agree that in such algorithmic writing system (contrasted with simpler positional systems like say Roman or Cyrillic) the stages of pre-/post-processing are inevitable. I'd suggest making special hand-crafted or generated training images. In these images you would properly space out all the joint character combinations as well as character components that can make up Khmer characters. Then you would edit the resulting box files to assign codes according to your coding system. The noted process should be repeated as many times as required to achieve the sample count of 15-20 for every glyph. At the recognition stage, if trained properly, overlapping bounding boxes is not a problem for Tess. My experience shows that it is very inventive in character segmentation even in cases of BB overlap. So I hope you should have no severe difficulties with partially overhanging or underlying glyphs. Your post-processor should be able to decode recognition output using an algorithmic approach to form good Unicode characters. You can also use some Khmer bigram or trigram statistics to do error correction. Probably you'd want to play around with Tess's dictionary facility but I doubt it would be helpful in your case. Dmitry -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Tesseract Training
Dear Sochenda, I'm not sure what's the ultimate goal of your code assignment but a formal answer to your question is Yes. You can assign k001 or k002 to a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded character sequence. In Tess version 3.0x (current) the only restriction is a 24 byte limit for the entire char sequence length. This also allows you to use not only an abstract code like k001 but a meaningful character sequence from your real language (e.g. a well-known fi ligature in some Latin fonts) which then relieves you from using the pre- and post-processing. If you still prefer using abstract codes then pre-/post-processing can be done without tinkering with Tess's code. Since training as well as recognition result in generation of output files, you can develop a couple of file processing command-line utilities which then can be used along with calls to the Tesseract executable within shell scripts (or .bat files in Windows). For further details you definitely should study thoroughly the TrainingTesseract3 and ReadMe (section Installation Notes - Tesseract 3.00) documents ( http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite easy searchable documents but they contain all the info you might need. Warm regards, Dmitry Silaev On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda khemsoche...@gmail.comwrote: Dear Dmitry, Thank you very much for a comprehensive explanation. Let say, to go straight, does it sound ok by assigning a code like 'k001' or 'k002' to the glype obtain from tesseract segmentation? For post processing, touching the code tesseract, could you please point me out which I files I should modify to work on. Advice me if the last version of tesseract will do fine. Thank you very much in advance for your time and response back. Best Regards, Sochenda On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev daemons2...@gmail.comwrote: Chenda, In fact Tesseract doesn't care if you do training for a real language's letter and which language this letter belongs to. Simplistically saying Tess only saves the mapping of feature sets obtained from training to Unicode ids. This implies that during training you can assign virtually any character code to virtually any glyph (to be exact, to a connected component or to a set of connected components). If your language script is comprised by a reasonable number of joint character combinations then while training you can assign every such combination a predefined Unicode id (some restrictions apply). Later, when running recognition, you should do some post-processing to decode your predefined ids into real language's character sequences. For good results all this requires you to develop a training file pre-processor (mapping: language char combinations - provisional ids) and a recognition result post-processor (mapping: provisional ids - language char sequences). I'm not sure but this also may require correcting character property bit masks in the unicharset file (I don't know exactly how this information is used by Tess as I don't need it in my project). Warm regards, Dmitry Silaev On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda khemsoche...@gmail.comwrote: Dear Tesseract Team, In training new language step, we have to assign a unicode value to each box. I would like to know if a shape that is composed of *several unicode characters? Is there anyway to assign only an id for each box in tesseract? Thank you very much in advance for your response. Best Regards, Chenda * 1. ** -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post
Re: Tesseract Training
Chenda, In fact Tesseract doesn't care if you do training for a real language's letter and which language this letter belongs to. Simplistically saying Tess only saves the mapping of feature sets obtained from training to Unicode ids. This implies that during training you can assign virtually any character code to virtually any glyph (to be exact, to a connected component or to a set of connected components). If your language script is comprised by a reasonable number of joint character combinations then while training you can assign every such combination a predefined Unicode id (some restrictions apply). Later, when running recognition, you should do some post-processing to decode your predefined ids into real language's character sequences. For good results all this requires you to develop a training file pre-processor (mapping: language char combinations - provisional ids) and a recognition result post-processor (mapping: provisional ids - language char sequences). I'm not sure but this also may require correcting character property bit masks in the unicharset file (I don't know exactly how this information is used by Tess as I don't need it in my project). Warm regards, Dmitry Silaev On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda khemsoche...@gmail.comwrote: Dear Tesseract Team, In training new language step, we have to assign a unicode value to each box. I would like to know if a shape that is composed of *several unicode characters? Is there anyway to assign only an id for each box in tesseract? Thank you very much in advance for your response. Best Regards, Chenda * 1. ** -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Can't get the user dictionary to work
On the plus side, it turns out that there are functions buried in the code to serialise/deserialise the classifier state, so it might be useful to run a whole corpus of short images through tess in one batch, save the state, and load that at startup. Could you please be more specific, what are your findings: which functions and what they do? I think it might be of interest for many subscribers... Thanks, Dmitry -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: tesseract output correction gear
there was a minor bug which prevented display of magnified textline images in the viewport after save now it's fixed eh, development version as i said On Tue, Jul 13, 2010 at 3:36 PM, Jimmy O'Regan jore...@gmail.com wrote: On 13 July 2010 11:55, daemon-s daemons2...@gmail.com wrote: Please check out this little tool at http://www.c2scan.com/pokupedia.ru/index.php?r=site/ocr As an output it produces an XML file containing bounding box descriptions and character mappings. For the current demo image anyone can make corrections and save. These corrections will be visible to successive visitors as well. Crowd sourced OCR training? Cool :) It is going to be a part of my new project pokupedia.ru. Since this is a development version bugs and inconsistencies may exist. Currently works in Firefox only. Sometimes the web site may be down as I use my home computer as a web server. If this tool gains interest, I will probably make it available to the public domain. Until then your feedback is expected - bug reports and improvement proposals are welcome. This is something that Distributed Proofreaders might be interested in; you might get more testers i you mention it on the forums there. Regards, Dmitry -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- Leftmost jimregan, that's because deep inside you, you are evil. Leftmost Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.comtesseract-ocr%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.