Re: [tesseract-ocr] Re: Inconsistencies in detection and extraction of text using tesseract

Ger Hobbelt Mon, 03 Jun 2024 13:55:35 -0700

Re image size, etc.: see:
- https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ --
which' report / chart suggests it's beneficial to rescale any input image
to produce a text size of about 30px vertical.
- https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html -- which also
links to the report above, plus it has some table-related info.
for *why* resizing, etc. are often beneficial to OCR confidence numbers &
quality.

Re your last question about the first column in your reconstructed table:
https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ --
your reconstruction of the first column would be part of the
[3]PostProcessing phase, as tesseract is book/paper/word focused, so it
will only reconstruct words from character sequences.
AFAIK the latest release doesn't have an advanced table reconstruction
module like you need. See also the end of the
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html document for
more info / links.

Quick meta-question though as I am quite surprised that people feed
financial data into any kind of (fundamentally statistical and thus
noise-injecting) OCR process (not just tesseract but any and all of them
out there): wouldn't it be more business-smart to scrape financial
performance reports like these or even better: get the direct data export
from the SAP software at that company, so that you forego the entire
machine-render-text-to-image + image-to-text OCR risky and costly process
altogether?
That financial performance stuff is usually reported in PDF/A format for
obvious reasons (chamber of commerce, stock exchange, investors, those
kinds of folks who all like their data as *virginal* as can be) and when
you grab that output you're one straight text extract away from success,
instead of wrangling a risky OCR process chain, which, by definition,
cannot deliver a 100% accurate reconstruction all the time.
As this clearly is corporate financial data you're processing (and we can
thus safely assume this reported data will be fed into follow-up processes
where the actual numbers are of some import), I would expect nobody
involved will appreciate the implicit risk factors introduced by injecting
a inherently noisy statistical filter in the number crunching process,
which opens one to the forever clear and present risk of random number
value inaccuracies due to the nature of any neural net's output?

You're certainly not the only one attempting to apply OCR to financial data
around here (the mailing list is brimming with it), but when I see annual /
quarterly corporate performance reports being processed like that, I start
to worry a wee bit more than usual. Not for tesseract (it does its job just
fine), but for the one who came up with the idea to plonk such data into an
image file and feed it to any kind of OCR machinery. Sounds like an already
previously failed due diligence execution to me, where the question should
have been asked: can we get this data in any type of text format straight
from the source, as that is a company and machine-produced already. txt,
csv, pdf, excel, anything? At what cost?
Or can't you get the text data (why?! if you get the page images, it's
published material, correct?) and do you intend to use tesseract / your OCR
process as an *assistive process* where the OCR output is reviewed / vetted
by a human before deemed of sufficient quality for further use?

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   [email protected]
mobile: +31-6-11 120 978
--------------------------------------------------

On Mon, Jun 3, 2024 at 3:51 PM Saanvi Bhagat <[email protected]>
wrote:

> Thank you so much for your help!! Using interpolation improved my results
> to a great extent. I would like one more suggestion from you. I have
> extracted the text from the table in the image. Now I am trying to save it
> in a CSV. For that, I am using the coordinates of the detected text and
> reconstructing the table structure.
> I am providing the input image and the screenshot of the resultant output
> in the CSV file. As it can be seen in the output_in_csv image, the facts
> and figures are being saved correctly, however, the first column is very
> absurd. A new column is being generated for each word. That might be
> because tesseract detects the text word by word and hence creates a new
> column for each word. Could you please suggest a way to optimize my
> results? (majorly the first column)
> The main issues are repetition in the column values and a new column being
> created for each word rather than just 1 column.
> On Saturday, June 1, 2024 at 11:21:17 AM UTC+5:30 [email protected]
> wrote:
>
>> Try to resize the image increase it size, use interpolation with
>> inter_area or inter_cubic the bigger the image the better tesseract
>> perform. PSM 6 is the right setting
>>
>> On Saturday 1 June 2024 at 00:19:32 UTC+12 [email protected] wrote:
>>
>>>
>>> In order to improve the results, I have implemented canny edge detection
>>> and Hough Lines Transform on the images. Then I fed the binarized image to
>>> the tesseract model.
>>>
>>> text = pytesseract.image_to_string(cropped_frame,lang='eng', config ='
>>> --psm 6 --oem 3')
>>> The results have improved a bit, but are still far from perfect. The
>>> negative symbols are being omitted, some of them are being misunderstood as
>>> ~. Similarly some decimal points are also being omitted. 22.5 was extracted
>>> as 225.
>>> On Friday, May 31, 2024 at 1:07:01 PM UTC+5:30 [email protected] wrote:
>>>
>>>> Its hard to give opinion withour seeing how you setup tesseract, what
>>>> PSM did you specify, .. etc?
>>>>
>>>> On Friday 31 May 2024 at 02:34:36 UTC+12 [email protected] wrote:
>>>>
>>>>> I have provided the image from which I am trying to extract text from,
>>>>> using tesseract ocr (input.jpeg). Along with that, I have also provided 
>>>>> the
>>>>> result or the extracted text from the image. As it can be observed from 
>>>>> the
>>>>> images, the extracted text is not very accurate. Negative symbols have 
>>>>> been
>>>>> omitted, some undesired characters are also there in the extracted text. 
>>>>> (I
>>>>> have marked some of the incorrect results with blue boxes)
>>>>>
>>>>> I have tried to improve the results by preprocessing and bringing
>>>>> changes in the parameters of the model. I have tried:
>>>>>
>>>>> 1. Binarizing the images
>>>>>
>>>>> 2. HDR processing of the processes
>>>>>
>>>>> Even then, such inconsistencies remain.
>>>>>
>>>>> How to improve the detection and extraction of text in tesseract? I
>>>>> have also tried paddleocr for the same task. Even then, symbols such as
>>>>> euro, some negative signs are not being detected.
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2e1f6325-91e5-44ba-9eaa-b64e1b2a4401n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/2e1f6325-91e5-44ba-9eaa-b64e1b2a4401n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fp6BVh9B_sVuC1D5c-gTBgEXPvVyOWuGme0s6%2BjTk1FoQ%40mail.gmail.com.

Re: [tesseract-ocr] Re: Inconsistencies in detection and extraction of text using tesseract

Reply via email to