In fact I should say, extract the whole left bit first and see how you are.
If you continue to find issues it's probably the borders of your table
interfering by being too close to the text, hence the reason I am saying
totally clearing the table away might help.
You might try to just remove the
Yes, you should think of it in those terms. That would remove the noise you
are seeing to the right hand side of your result as Tesseract likes to turn
shapes into text if it can ;)
Even if you add new rows into the left side, the x,y top corner intervals
are still consistent enough to just keep g
I could, the only issue is that based on the number of people scheduled the
box can grow, which would change all the x,y coords...
What can be easily done is to narrow down the scope of the ocr by only
getting the horizontal table part and omitting the rest, I'm guessing that
might also help?
In my opinion, given you have a very fixed layout/template this gives you
more control over how you perform the OCR. Rather than give Tesseract the
entire spreadsheet here why not program a preprocessing stage where you
extract the text you want out cleanly into a new image (given you know all
(X,
Hey everyone,
I've got this pdf document which is a schedule. I'm trying to extract the
text from it via tesseract but I'm not having that good results.
I've tried a lot of different things, in my inexperienced opinion the image
seems very high quality as I can zoom in a lot without seeing pixe
5 matches
Mail list logo