Re: [tesseract-ocr] Help OCR'in an image

2016-07-12 Thread Allistair
In fact I should say, extract the whole left bit first and see how you are. If you continue to find issues it's probably the borders of your table interfering by being too close to the text, hence the reason I am saying totally clearing the table away might help. You might try to just remove the

Re: [tesseract-ocr] Help OCR'in an image

2016-07-12 Thread Allistair
Yes, you should think of it in those terms. That would remove the noise you are seeing to the right hand side of your result as Tesseract likes to turn shapes into text if it can ;) Even if you add new rows into the left side, the x,y top corner intervals are still consistent enough to just keep g

Re: [tesseract-ocr] Help OCR'in an image

2016-07-12 Thread Raphael Budd
I could, the only issue is that based on the number of people scheduled the box can grow, which would change all the x,y coords... What can be easily done is to narrow down the scope of the ocr by only getting the horizontal table part and omitting the rest, I'm guessing that might also help?

Re: [tesseract-ocr] Help OCR'in an image

2016-07-12 Thread Allistair
In my opinion, given you have a very fixed layout/template this gives you more control over how you perform the OCR. Rather than give Tesseract the entire spreadsheet here why not program a preprocessing stage where you extract the text you want out cleanly into a new image (given you know all (X,

[tesseract-ocr] Help OCR'in an image

2016-07-11 Thread Raphael Budd
Hey everyone, I've got this pdf document which is a schedule. I'm trying to extract the text from it via tesseract but I'm not having that good results. I've tried a lot of different things, in my inexperienced opinion the image seems very high quality as I can zoom in a lot without seeing pixe