Re: Especial Characteres

manuel...@gmail.com Sun, 13 Mar 2011 22:10:11 -0700

What would you recommend to use to split the columns?

I think I will need to scan using tesseract column by column.
So after that I will need to merge it to make correct rows.


Can you point me a direction to help me?
What tools (unix compatible tools) can I use to tell tesseract to scan a 
specific  column?

Later I will recompile to test, but first I need to find a way to scan correct 
these reports to generate CSV files to import later to a database.
If it works I will spend more time tunning tesseract.

Have you ever did this before? (scan reports using tesseract or other tools to 
generate csv files)

Thanks



Em 13/03/2011, às 11:20, Dmitry Silaev escreveu:

> Running via ports can cause diverse errors. Try to compile Tesseract
> natively. I use revision 549 and as I said it works fine.
> 
> Such tables as you have present a challenge for simple layout
> processing algorithms, due to sparsely located text. A minimal skew
> which is almost inevitable could break all the logic. In such cases I
> prefer to devise a custom made segmentation logic specific to the
> document type being processed. In this way I do not depend on
> Tesseract's segmentation - Tesseract is being used as a raw
> classifier.
> 
> Warm regards,
> Dmitry Silaev
> 
> 
> 
> 
> 
> On Sun, Mar 13, 2011 at 4:47 PM, manuel...@gmail.com
> <manuel...@gmail.com> wrote:
>> I'm using the latest version tesseract @3.00_2+eng
>> I installed using ports in MacOSX
>> 
>> Another question Dmitry about this sample
>> In this sample why doesn't tesseract recognize a complete row? It's not a 
>> perfect align, but it is impossible to get a image 100% aligned.
>> Tesseract is breaking columns in new lines like :
>> 
>> 00001           test    productA
>> 00002           test2
>> productB
>> 
>> Do you know how to fix it?
>> 
>> Regard
>> Manuel Pardo
>> 
>> 
>> Em 13/03/2011, às 08:32, Dmitry Silaev escreveu:
>> 
>>> Manuel,
>>> 
>>> The sample you provided definitely has insufficient resolution. You
>>> may only expect some part of the heading to be recognized. So this is
>>> what happened when I've run the recognition of your image. But I
>>> haven't got any error or warning messages with my "por.traineddata" at
>>> all!
>>> 
>>> However all this was tested under Windows. Probably I can try this
>>> under Ubuntu, but I don't know when I have enough time to reboot, set
>>> up a C++ compiler, build Tesseract and do some testing, sorry ))
>>> 
>>> Are you sure you downloaded the latest stable version of Tesseract?
>>> 
>>> Warm regards,
>>> Dmitry Silaev
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Mar 10, 2011 at 9:32 PM, manuel...@gmail.com
>>> <manuel...@gmail.com> wrote:
>>>> I just replaced por.traineddata with your file por.traineddata.
>>>> After that I'm getting this message error:
>>>> 
>>>>>> manuel$ tesseract input.tiff output -l por
>>>>>> actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert 
>>>>>> failed:in file tessdatamanager.cpp, line 55
>>>>>> Segmentation fault
>>>> 
>>>> I haven't succeeded. I'm using version 3 - MacOSX 10.6
>>>> 
>>>> 
>>>> 
>>>> Attached Reported.tiff
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards
>>>> Manuel Pardo
>>>> 
>>>> Em 04/03/2011, às 03:19, Dmitry Silaev escreveu:
>>>> 
>>>>> Manuel,
>>>>> 
>>>>> Is the error message generated by version 2.xx? Did you try to run
>>>>> version 3.xx with my "por.traineddata" file?
>>>>> I don't get it - have you succeeded or not?
>>>>> Please provide us with the image you are trying to recognize.
>>>>> 
>>>>> Warm regards,
>>>>> Dmitry Silaev
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com <manuel...@gmail.com> 
>>>>> wrote:
>>>>>> Hi Dmitry,
>>>>>> 
>>>>>> I just replaced with your file por.traineddata
>>>>>> But I'm getting an error:
>>>>>> 
>>>>>> manuel$ tesseract input.tiff output -l por
>>>>>> actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert 
>>>>>> failed:in file tessdatamanager.cpp, line 55
>>>>>> Segmentation fault
>>>>>> 
>>>>>> It's seem to be interesting to convert old files from 2.0X to 3, because 
>>>>>> there isn't a brazillian portuguese for version 3,  just "portuguese".
>>>>>> At least the dictionary por.traineeddata is working correctly in version 
>>>>>> 3.
>>>>>> The special chars is being recognized by tesseract 3.
>>>>>> 
>>>>>> regards,
>>>>>> Manuel Pardo
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:
>>>>>> 
>>>>>>> Manuel,
>>>>>>> 
>>>>>>> It's quite an interesting question although it may seem to be an
>>>>>>> ordinary newbie-like one.
>>>>>>> 
>>>>>>> I was always wondering if 2.xx files can be used with version 3.xx.
>>>>>>> The wiki states that "the files in the traineddata file are different
>>>>>>> from the list used prior to 3.00, and will most likely change,
>>>>>>> possibly dramatically in future revisions."
>>>>>>> 
>>>>>>> I have no time to investigate it in the code so I decided to act
>>>>>>> rather than to think. After some tinkering with all those files I
>>>>>>> slipped the resulted "por.traineddata" into my Tesseract algo I'm
>>>>>>> currently working at, and - guess what? - it worked! ))
>>>>>>> 
>>>>>>> I must say it was tested only with a couple of *very simple* images
>>>>>>> and also it absolutely lacks any dictionary-related data. And my test
>>>>>>> images don't contain these specific Portuguese letters with
>>>>>>> diacritics. So in fact this file may perform poorly. Please test and
>>>>>>> report your results. The file is in the attachment.
>>>>>>> 
>>>>>>> It was not difficult at all but also not so straight-forward to make
>>>>>>> this training data file, so probably this process deserves a separate
>>>>>>> article and later I'd like to post it in my blog.
>>>>>>> 
>>>>>>> Warm regards,
>>>>>>> Dmitry Silaev
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp <manuel...@gmail.com> wrote:
>>>>>>>> Helo list,
>>>>>>>> I can't find a solution for special chars
>>>>>>>> 
>>>>>>>> I installed tesseract 3 in my MacOSX 10.6
>>>>>>>> It is running very well
>>>>>>>> 
>>>>>>>> But I'm having problems with charset.
>>>>>>>> I need tesseract working with brazillian portuguese. (ISO8859-1)
>>>>>>>> 
>>>>>>>> I installed the portuguese dictionary but is not working with special
>>>>>>>> chars like  Ç Ã É é ....  (ISO8859-1)
>>>>>>>> Is there any solution ?
>>>>>>>> 
>>>>>>>> There is an old dictionary special for brazilian portuguese in version
>>>>>>>> 2.0.4. Is it possible to use in version 3? How?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>>> To unsubscribe from this group, send email to 
>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>>>> For more options, visit this group at 
>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>> To unsubscribe from this group, send email to 
>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>>> For more options, visit this group at 
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>> 
>>>>>>> <por.traineddata>
>>>>>> 
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>> To unsubscribe from this group, send email to 
>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>> For more options, visit this group at 
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "tesseract-ocr" group.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> To unsubscribe from this group, send email to 
>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> For more options, visit this group at 
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "tesseract-ocr" group.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> To unsubscribe from this group, send email to 
>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>> 
>>>> 
>>>> 
>> 
>> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to 
> tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Especial Characteres

Reply via email to