Yes, you should think of it in those terms. That would remove the noise you
are seeing to the right hand side of your result as Tesseract likes to turn
shapes into text if it can ;)

Even if you add new rows into the left side, the x,y top corner intervals
are still consistent enough to just keep going down the image creating
rectangles of input. At some point those rectangles will be white
rectangles - you can easily check to see if a rectangle is full of white
pixels or anything non-white to control when rows have ended etc.

The only thing I can see as disrupting your template is the title
"Managers" - if you have variants where there could be zero or many of
these titles for different sections then you x,y finding method will need
to be more complex. However it seems like it could be easy to spot these
sections as you have a chunk of white space between the bottom most border
of the upper section and the bold black header area of the next box.

   1. Extract left hand portion of the image with the boxes
   2. Identify a pixel column that provides structural table information
   (not where text would be encountered) - you have plenty of these due to the
   layout
   3. Apply logic to find section headers (pixelN and pixelN+1 are black)
   4. Apply logic to find rows (pixelN == grey)
   5. Find your rectangles of text based on fixed column widths and the
   previous row-finding logic

Something like that :)


[image: Inline images 1]

On 12 July 2016 at 13:41, Raphael Budd <woderpi...@gmail.com> wrote:

> I could, the only issue is that based on the number of people scheduled
> the box can grow, which would change all the x,y coords...
>
> What can be easily done is to narrow down the scope of the ocr by only
> getting the horizontal table part and omitting the rest, I'm guessing that
> might also help?
>
>
> Thanks for the help by the way!
>
> On Tuesday, July 12, 2016 at 5:14:01 AM UTC-4, Allistair C wrote:
>>
>> In my opinion, given you have a very fixed layout/template this gives you
>> more control over how you perform the OCR. Rather than give Tesseract the
>> entire spreadsheet here why not program a preprocessing stage where you
>> extract the text you want out cleanly into a new image (given you know all
>> (X, Y, WIDTH, HEIGHT) rectangle locations for such an input image?
>>
>> On 11 July 2016 at 22:00, Raphael Budd <woder...@gmail.com> wrote:
>>
>>> Hey everyone,
>>>
>>> I've got this pdf document which is a schedule. I'm trying to extract
>>> the text from it via tesseract but I'm not having that good results.
>>>
>>> I've tried a lot of different things, in my inexperienced opinion the
>>> image seems very high quality as I can zoom in a lot without seeing pixels.
>>> I've also tried to convert the pdf->tiff and add grayscale filter (all via
>>> java).
>>>
>>> I've attached both the end result and the original pdf here along with a
>>> sample of the output, any help making the output better would be
>>> appreciated.
>>>
>>> The tiff file is too big for the attachement; see this link:
>>> http://wltd.org/Daily%20schedule-14.tiff
>>>
>>> ---Begin text---
>>> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
>>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
>>> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
>>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
>>> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
>>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
>>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
>>> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
>>> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
>>> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
>>> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
>>> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
>>> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _
>>>
>>> --END TEXT---
>>>
>>> As you can see tesseract becomes quite creative with its attempt at
>>> parsing this, earlier in the document it even parsed the letter "N" as
>>> "|\|", creative but useless for parsing!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d3270fa9-7706-4260-9f90-c8b8d0f350d6%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d3270fa9-7706-4260-9f90-c8b8d0f350d6%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vgf95Z93_c2ccFdZnvaeW_9TN-PgEHW0vg2bOu5X_CqKA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to