Manuel,

I'm afraid just chaining command line tools won't help in this case.
I'm talking about programming.

And yes, I did solve many practical problems related to layout
analysis, and other fields of document image processing, and succeeded
in it ))

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 7:55 AM, manuel...@gmail.com
<manuel...@gmail.com> wrote:
> What would you recommend to use to split the columns?
>
> I think I will need to scan using tesseract column by column.
> So after that I will need to merge it to make correct rows.
>
> Can you point me a direction to help me?
> What tools (unix compatible tools) can I use to tell tesseract to scan a 
> specific  column?
>
> Later I will recompile to test, but first I need to find a way to scan 
> correct these reports to generate CSV files to import later to a database.
> If it works I will spend more time tunning tesseract.
>
> Have you ever did this before? (scan reports using tesseract or other tools 
> to generate csv files)
>
> Thanks
>
>
>
> Em 13/03/2011, às 11:20, Dmitry Silaev escreveu:
>
>> Running via ports can cause diverse errors. Try to compile Tesseract
>> natively. I use revision 549 and as I said it works fine.
>>
>> Such tables as you have present a challenge for simple layout
>> processing algorithms, due to sparsely located text. A minimal skew
>> which is almost inevitable could break all the logic. In such cases I
>> prefer to devise a custom made segmentation logic specific to the
>> document type being processed. In this way I do not depend on
>> Tesseract's segmentation - Tesseract is being used as a raw
>> classifier.
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>>
>> On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com
>> <manuel...@gmail.com> wrote:
>>> I'm using the latest version tesseract @3.00_2+eng
>>> I installed using ports in MacOSX
>>>
>>> Another question Dmitry about this sample
>>> In this sample why doesn't tesseract recognize a complete row? It's not a 
>>> perfect align, but it is impossible to get a image 100% aligned.
>>> Tesseract is breaking columns in new lines like :
>>>
>>> 00001           test    productA
>>> 00002           test2
>>> productB
>>>
>>> Do you know how to fix it?
>>>
>>> Regard
>>> Manuel Pardo
>>>
>>>
>>> Em 13/03/2011, às 08:32, Dmitry Silaev escreveu:
>>>
>>>> Manuel,
>>>>
>>>> The sample you provided definitely has insufficient resolution. You
>>>> may only expect some part of the heading to be recognized. So this is
>>>> what happened when I've run the recognition of your image. But I
>>>> haven't got any error or warning messages with my "por.traineddata" at
>>>> all!
>>>>
>>>> However all this was tested under Windows. Probably I can try this
>>>> under Ubuntu, but I don't know when I have enough time to reboot, set
>>>> up a C++ compiler, build Tesseract and do some testing, sorry ))
>>>>
>>>> Are you sure you downloaded the latest stable version of Tesseract?
>>>>
>>>> Warm regards,
>>>> Dmitry Silaev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com
>>>> <manuel...@gmail.com> wrote:
>>>>> I just replaced por.traineddata with your file por.traineddata.
>>>>> After that I'm getting this message error:
>>>>>
>>>>>>> manuel$ tesseract input.tiff output -l por
>>>>>>> actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert 
>>>>>>> failed:in file tessdatamanager.cpp, line 55
>>>>>>> Segmentation fault
>>>>>
>>>>> I haven't succeeded. I'm using version 3 - MacOSX 10.6
>>>>>
>>>>>
>>>>>
>>>>> Attached Reported.tiff
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Manuel Pardo
>>>>>
>>>>> Em 04/03/2011, às 03:19, Dmitry Silaev escreveu:
>>>>>
>>>>>> Manuel,
>>>>>>
>>>>>> Is the error message generated by version 2.xx? Did you try to run
>>>>>> version 3.xx with my "por.traineddata" file?
>>>>>> I don't get it - have you succeeded or not?
>>>>>> Please provide us with the image you are trying to recognize.
>>>>>>
>>>>>> Warm regards,
>>>>>> Dmitry Silaev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com 
>>>>>> <manuel...@gmail.com> wrote:
>>>>>>> Hi Dmitry,
>>>>>>>
>>>>>>> I just replaced with your file por.traineddata
>>>>>>> But I'm getting an error:
>>>>>>>
>>>>>>> manuel$ tesseract input.tiff output -l por
>>>>>>> actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert 
>>>>>>> failed:in file tessdatamanager.cpp, line 55
>>>>>>> Segmentation fault
>>>>>>>
>>>>>>> It's seem to be interesting to convert old files from 2.0X to 3, 
>>>>>>> because there isn't a brazillian portuguese for version 3,  just 
>>>>>>> "portuguese".
>>>>>>> At least the dictionary por.traineeddata is working correctly in 
>>>>>>> version 3.
>>>>>>> The special chars is being recognized by tesseract 3.
>>>>>>>
>>>>>>> regards,
>>>>>>> Manuel Pardo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:
>>>>>>>
>>>>>>>> Manuel,
>>>>>>>>
>>>>>>>> It's quite an interesting question although it may seem to be an
>>>>>>>> ordinary newbie-like one.
>>>>>>>>
>>>>>>>> I was always wondering if 2.xx files can be used with version 3.xx.
>>>>>>>> The wiki states that "the files in the traineddata file are different
>>>>>>>> from the list used prior to 3.00, and will most likely change,
>>>>>>>> possibly dramatically in future revisions."
>>>>>>>>
>>>>>>>> I have no time to investigate it in the code so I decided to act
>>>>>>>> rather than to think. After some tinkering with all those files I
>>>>>>>> slipped the resulted "por.traineddata" into my Tesseract algo I'm
>>>>>>>> currently working at, and - guess what? - it worked! ))
>>>>>>>>
>>>>>>>> I must say it was tested only with a couple of *very simple* images
>>>>>>>> and also it absolutely lacks any dictionary-related data. And my test
>>>>>>>> images don't contain these specific Portuguese letters with
>>>>>>>> diacritics. So in fact this file may perform poorly. Please test and
>>>>>>>> report your results. The file is in the attachment.
>>>>>>>>
>>>>>>>> It was not difficult at all but also not so straight-forward to make
>>>>>>>> this training data file, so probably this process deserves a separate
>>>>>>>> article and later I'd like to post it in my blog.
>>>>>>>>
>>>>>>>> Warm regards,
>>>>>>>> Dmitry Silaev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp <manuel...@gmail.com> wrote:
>>>>>>>>> Helo list,
>>>>>>>>> I can't find a solution for special chars
>>>>>>>>>
>>>>>>>>> I installed tesseract 3 in my MacOSX 10.6
>>>>>>>>> It is running very well
>>>>>>>>>
>>>>>>>>> But I'm having problems with charset.
>>>>>>>>> I need tesseract working with brazillian portuguese. (ISO8859-1)
>>>>>>>>>
>>>>>>>>> I installed the portuguese dictionary but is not working with special
>>>>>>>>> chars like  Ç Ã É é ....  (ISO8859-1)
>>>>>>>>> Is there any solution ?
>>>>>>>>>
>>>>>>>>> There is an old dictionary special for brazilian portuguese in version
>>>>>>>>> 2.0.4. Is it possible to use in version 3? How?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>>>> To unsubscribe from this group, send email to 
>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>>>>> For more options, visit this group at 
>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>>> To unsubscribe from this group, send email to 
>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>>>> For more options, visit this group at 
>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>
>>>>>>>> <por.traineddata>
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>> To unsubscribe from this group, send email to 
>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>>> For more options, visit this group at 
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>> To unsubscribe from this group, send email to 
>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>> For more options, visit this group at 
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "tesseract-ocr" group.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> To unsubscribe from this group, send email to 
>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> For more options, visit this group at 
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> To unsubscribe from this group, send email to 
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to 
> tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to