Re: [tesseract-ocr] Re: Inconsistencies in detection and extraction of text using tesseract

Sundara Ganesh Mon, 17 Jun 2024 21:50:51 -0700

Hello Ger Hobbelt,

Your meta question is very reasonable.  However, reality is very different, 
IMO.


For example, many banks and brokerage firms don't retain personal financial 
account statements/documents for more than 5 years or so.  However, you may 
have a printed copies of the same received at that time by mail.  We should 
be able to OCR them as accurately as possible.
Same is true for OCR'ing the scanned receipts for personal accounting.

I would be very interested in OCR'ing my 10 year old financial documents 
and statements.
Tesseract is great and far better than other ones that I've tried, but 
certainly it is not anywhere near perfect - expects human intervention and 
special handling of inputs based on human verification of every output.

Sundar
On Monday, June 3, 2024 at 1:55:41 PM UTC-7 [email protected] wrote:

> Re image size, etc.: see:
> - https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ 
> -- which' report / chart suggests it's beneficial to rescale any input 
> image to produce a text size of about 30px vertical.
> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html -- which 
> also links to the report above, plus it has some table-related info.
> for *why* resizing, etc. are often beneficial to OCR confidence numbers & 
> quality.
>
> Re your last question about the first column in your reconstructed table: 
> https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ -- 
> your reconstruction of the first column would be part of the 
> [3]PostProcessing phase, as tesseract is book/paper/word focused, so it 
> will only reconstruct words from character sequences.
> AFAIK the latest release doesn't have an advanced table reconstruction 
> module like you need. See also the end of the  
> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html document for 
> more info / links.
>
>
>
> Quick meta-question though as I am quite surprised that people feed 
> financial data into any kind of (fundamentally statistical and thus 
> noise-injecting) OCR process (not just tesseract but any and all of them 
> out there): wouldn't it be more business-smart to scrape financial 
> performance reports like these or even better: get the direct data export 
> from the SAP software at that company, so that you forego the entire 
> machine-render-text-to-image + image-to-text OCR risky and costly process 
> altogether? 
> That financial performance stuff is usually reported in PDF/A format for 
> obvious reasons (chamber of commerce, stock exchange, investors, those 
> kinds of folks who all like their data as *virginal* as can be) and when 
> you grab that output you're one straight text extract away from success, 
> instead of wrangling a risky OCR process chain, which, by definition, 
> cannot deliver a 100% accurate reconstruction all the time.
> As this clearly is corporate financial data you're processing (and we can 
> thus safely assume this reported data will be fed into follow-up processes 
> where the actual numbers are of some import), I would expect nobody 
> involved will appreciate the implicit risk factors introduced by injecting 
> a inherently noisy statistical filter in the number crunching process, 
> which opens one to the forever clear and present risk of random number 
> value inaccuracies due to the nature of any neural net's output?
>
> You're certainly not the only one attempting to apply OCR to financial 
> data around here (the mailing list is brimming with it), but when I see 
> annual / quarterly corporate performance reports being processed like that, 
> I start to worry a wee bit more than usual. Not for tesseract (it does its 
> job just fine), but for the one who came up with the idea to plonk such 
> data into an image file and feed it to any kind of OCR machinery. Sounds 
> like an already previously failed due diligence execution to me, where the 
> question should have been asked: can we get this data in any type of text 
> format straight from the source, as that is a company and machine-produced 
> already. txt, csv, pdf, excel, anything? At what cost?
> Or can't you get the text data (why?! if you get the page images, it's 
> published material, correct?) and do you intend to use tesseract / your OCR 
> process as an *assistive process* where the OCR output is reviewed / vetted 
> by a human before deemed of sufficient quality for further use?
>
>
>
>
>
>
> Met vriendelijke groeten / Best regards,
>
> Ger Hobbelt
>
> --------------------------------------------------
> web:    http://www.hobbelt.com/
>         http://www.hebbut.net/
> mail:   [email protected]
> mobile: +31-6-11 120 978
> --------------------------------------------------
>
>
> On Mon, Jun 3, 2024 at 3:51 PM Saanvi Bhagat <[email protected]> wrote:
>
>> Thank you so much for your help!! Using interpolation improved my results 
>> to a great extent. I would like one more suggestion from you. I have 
>> extracted the text from the table in the image. Now I am trying to save it 
>> in a CSV. For that, I am using the coordinates of the detected text and 
>> reconstructing the table structure. 
>> I am providing the input image and the screenshot of the resultant output 
>> in the CSV file. As it can be seen in the output_in_csv image, the facts 
>> and figures are being saved correctly, however, the first column is very 
>> absurd. A new column is being generated for each word. That might be 
>> because tesseract detects the text word by word and hence creates a new 
>> column for each word. Could you please suggest a way to optimize my 
>> results? (majorly the first column)
>> The main issues are repetition in the column values and a new column 
>> being created for each word rather than just 1 column.  
>> On Saturday, June 1, 2024 at 11:21:17 AM UTC+5:30 [email protected] 
>> wrote:
>>
>>> Try to resize the image increase it size, use interpolation with 
>>> inter_area or inter_cubic the bigger the image the better tesseract 
>>> perform. PSM 6 is the right setting
>>>
>>> On Saturday 1 June 2024 at 00:19:32 UTC+12 [email protected] wrote:
>>>
>>>>
>>>> In order to improve the results, I have implemented canny edge 
>>>> detection and Hough Lines Transform on the images. Then I fed the 
>>>> binarized 
>>>> image to the tesseract model.
>>>>
>>>> text = pytesseract.image_to_string(cropped_frame,lang='eng', config =' 
>>>> --psm 6 --oem 3')
>>>> The results have improved a bit, but are still far from perfect. The 
>>>> negative symbols are being omitted, some of them are being misunderstood 
>>>> as 
>>>> ~. Similarly some decimal points are also being omitted. 22.5 was 
>>>> extracted 
>>>> as 225.
>>>> On Friday, May 31, 2024 at 1:07:01 PM UTC+5:30 [email protected] 
>>>> wrote:
>>>>
>>>>> Its hard to give opinion withour seeing how you setup tesseract, what 
>>>>> PSM did you specify, .. etc?
>>>>>
>>>>> On Friday 31 May 2024 at 02:34:36 UTC+12 [email protected] wrote:
>>>>>
>>>>>> I have provided the image from which I am trying to extract text 
>>>>>> from, using tesseract ocr (input.jpeg). Along with that, I have also 
>>>>>> provided the result or the extracted text from the image. As it can be 
>>>>>> observed from the images, the extracted text is not very accurate. 
>>>>>> Negative 
>>>>>> symbols have been omitted, some undesired characters are also there in 
>>>>>> the 
>>>>>> extracted text. (I have marked some of the incorrect results with blue 
>>>>>> boxes)
>>>>>>
>>>>>> I have tried to improve the results by preprocessing and bringing 
>>>>>> changes in the parameters of the model. I have tried:
>>>>>>
>>>>>> 1. Binarizing the images
>>>>>>
>>>>>> 2. HDR processing of the processes
>>>>>>
>>>>>> Even then, such inconsistencies remain.
>>>>>>
>>>>>> How to improve the detection and extraction of text in tesseract? I 
>>>>>> have also tried paddleocr for the same task. Even then, symbols such as 
>>>>>> euro, some negative signs are not being detected.
>>>>>>
>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2e1f6325-91e5-44ba-9eaa-b64e1b2a4401n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2e1f6325-91e5-44ba-9eaa-b64e1b2a4401n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/63107187-714e-4c83-9af1-1bd0dc937264n%40googlegroups.com.

Re: [tesseract-ocr] Re: Inconsistencies in detection and extraction of text using tesseract

Reply via email to