You might consider using the lines to identify columns in the table using 
OpenCV. There is an example here [1] of removing lines, but you can also use 
the same approach to try to identify line coordinates. With the coordinates, 
you could then try to extract the columns of numbers and work through them from 
there. Your sample image is challenging, my sense is that Tesseract could do a 
lot if you can segment the table into individual numbers and leverage 
Tesseract's accuracy metrics. You would probably want a lot of very consistent 
layouts to justify the effort to do that.

Best,

art
---
1. 
https://stackoverflow.com/questions/60521925/how-to-detect-the-horizontal-and-vertical-lines-of-a-table-and-eliminate-the-noi

From: [email protected] <[email protected]> On Behalf 
Of Sean Pham
Sent: Thursday, October 3, 2024 1:17 PM
To: tesseract-ocr <[email protected]>
Subject: [tesseract-ocr] Table Extraction using Tesseract

You don't often get email from 
[email protected]<mailto:[email protected]>. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Hello,

I am trying to use Tesseract OCR to read archaic climate data in the form of a 
table.  I am new to this technology and would appreciate any guidance.

What I have tried so far:
I am using packages like img2table to extract a table structure and then using 
Tesseract to identify text.  There are various issues with both processes and 
the result is not very accurate.

Questions
1) Is this solution using Tesseract feasible?
2) Is there any process / technologies that may be beneficial for this use case?

Any insight / advice would be greatly appreciated!
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/09b8e0d0-cea5-4c86-95cf-f6b8e95fd0d4n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/09b8e0d0-cea5-4c86-95cf-f6b8e95fd0d4n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YQBPR0101MB9902BBE6E9890A6F90D1AC1FDC722%40YQBPR0101MB9902.CANPRD01.PROD.OUTLOOK.COM.

Reply via email to