What would you recommend to use to split the columns? I think I will need to scan using tesseract column by column. So after that I will need to merge it to make correct rows.
Can you point me a direction to help me? What tools (unix compatible tools) can I use to tell tesseract to scan a specific column? Later I will recompile to test, but first I need to find a way to scan correct these reports to generate CSV files to import later to a database. If it works I will spend more time tunning tesseract. Have you ever did this before? (scan reports using tesseract or other tools to generate csv files) Thanks Em 13/03/2011, às 11:20, Dmitry Silaev escreveu: > Running via ports can cause diverse errors. Try to compile Tesseract > natively. I use revision 549 and as I said it works fine. > > Such tables as you have present a challenge for simple layout > processing algorithms, due to sparsely located text. A minimal skew > which is almost inevitable could break all the logic. In such cases I > prefer to devise a custom made segmentation logic specific to the > document type being processed. In this way I do not depend on > Tesseract's segmentation - Tesseract is being used as a raw > classifier. > > Warm regards, > Dmitry Silaev > > > > > > On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com > <manuel...@gmail.com> wrote: >> I'm using the latest version tesseract @3.00_2+eng >> I installed using ports in MacOSX >> >> Another question Dmitry about this sample >> In this sample why doesn't tesseract recognize a complete row? It's not a >> perfect align, but it is impossible to get a image 100% aligned. >> Tesseract is breaking columns in new lines like : >> >> 00001 test productA >> 00002 test2 >> productB >> >> Do you know how to fix it? >> >> Regard >> Manuel Pardo >> >> >> Em 13/03/2011, às 08:32, Dmitry Silaev escreveu: >> >>> Manuel, >>> >>> The sample you provided definitely has insufficient resolution. You >>> may only expect some part of the heading to be recognized. So this is >>> what happened when I've run the recognition of your image. But I >>> haven't got any error or warning messages with my "por.traineddata" at >>> all! >>> >>> However all this was tested under Windows. Probably I can try this >>> under Ubuntu, but I don't know when I have enough time to reboot, set >>> up a C++ compiler, build Tesseract and do some testing, sorry )) >>> >>> Are you sure you downloaded the latest stable version of Tesseract? >>> >>> Warm regards, >>> Dmitry Silaev >>> >>> >>> >>> >>> >>> On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com >>> <manuel...@gmail.com> wrote: >>>> I just replaced por.traineddata with your file por.traineddata. >>>> After that I'm getting this message error: >>>> >>>>>> manuel$ tesseract input.tiff output -l por >>>>>> actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert >>>>>> failed:in file tessdatamanager.cpp, line 55 >>>>>> Segmentation fault >>>> >>>> I haven't succeeded. I'm using version 3 - MacOSX 10.6 >>>> >>>> >>>> >>>> Attached Reported.tiff >>>> >>>> >>>> >>>> >>>> >>>> >>>> Regards >>>> Manuel Pardo >>>> >>>> Em 04/03/2011, às 03:19, Dmitry Silaev escreveu: >>>> >>>>> Manuel, >>>>> >>>>> Is the error message generated by version 2.xx? Did you try to run >>>>> version 3.xx with my "por.traineddata" file? >>>>> I don't get it - have you succeeded or not? >>>>> Please provide us with the image you are trying to recognize. >>>>> >>>>> Warm regards, >>>>> Dmitry Silaev >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com <manuel...@gmail.com> >>>>> wrote: >>>>>> Hi Dmitry, >>>>>> >>>>>> I just replaced with your file por.traineddata >>>>>> But I'm getting an error: >>>>>> >>>>>> manuel$ tesseract input.tiff output -l por >>>>>> actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert >>>>>> failed:in file tessdatamanager.cpp, line 55 >>>>>> Segmentation fault >>>>>> >>>>>> It's seem to be interesting to convert old files from 2.0X to 3, because >>>>>> there isn't a brazillian portuguese for version 3, just "portuguese". >>>>>> At least the dictionary por.traineeddata is working correctly in version >>>>>> 3. >>>>>> The special chars is being recognized by tesseract 3. >>>>>> >>>>>> regards, >>>>>> Manuel Pardo >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Em 03/03/2011, às 09:12, Dmitry Silaev escreveu: >>>>>> >>>>>>> Manuel, >>>>>>> >>>>>>> It's quite an interesting question although it may seem to be an >>>>>>> ordinary newbie-like one. >>>>>>> >>>>>>> I was always wondering if 2.xx files can be used with version 3.xx. >>>>>>> The wiki states that "the files in the traineddata file are different >>>>>>> from the list used prior to 3.00, and will most likely change, >>>>>>> possibly dramatically in future revisions." >>>>>>> >>>>>>> I have no time to investigate it in the code so I decided to act >>>>>>> rather than to think. After some tinkering with all those files I >>>>>>> slipped the resulted "por.traineddata" into my Tesseract algo I'm >>>>>>> currently working at, and - guess what? - it worked! )) >>>>>>> >>>>>>> I must say it was tested only with a couple of *very simple* images >>>>>>> and also it absolutely lacks any dictionary-related data. And my test >>>>>>> images don't contain these specific Portuguese letters with >>>>>>> diacritics. So in fact this file may perform poorly. Please test and >>>>>>> report your results. The file is in the attachment. >>>>>>> >>>>>>> It was not difficult at all but also not so straight-forward to make >>>>>>> this training data file, so probably this process deserves a separate >>>>>>> article and later I'd like to post it in my blog. >>>>>>> >>>>>>> Warm regards, >>>>>>> Dmitry Silaev >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp <manuel...@gmail.com> wrote: >>>>>>>> Helo list, >>>>>>>> I can't find a solution for special chars >>>>>>>> >>>>>>>> I installed tesseract 3 in my MacOSX 10.6 >>>>>>>> It is running very well >>>>>>>> >>>>>>>> But I'm having problems with charset. >>>>>>>> I need tesseract working with brazillian portuguese. (ISO8859-1) >>>>>>>> >>>>>>>> I installed the portuguese dictionary but is not working with special >>>>>>>> chars like Ç Ã É é .... (ISO8859-1) >>>>>>>> Is there any solution ? >>>>>>>> >>>>>>>> There is an old dictionary special for brazilian portuguese in version >>>>>>>> 2.0.4. Is it possible to use in version 3? How? >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>>>>>> For more options, visit this group at >>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>>> To unsubscribe from this group, send email to >>>>>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>> >>>>>>> <por.traineddata> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>> >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google Groups >>>>> "tesseract-ocr" group. >>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>> To unsubscribe from this group, send email to >>>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "tesseract-ocr" group. >>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>> >>>> >>>> >> >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to tesseract-ocr@googlegroups.com. > To unsubscribe from this group, send email to > tesseract-ocr+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.