I'm afraid that's about the limit of what I can suggest - there are a great
many "engine settings" available that can be tweaked to alter the OCR but
they are not very well documented. Perhaps someone more familiar with these
kinds of mistakes can try and help. Did the scaling fix the M issue even
though it caused the new issue?

OCR is and should never be considered perfect or reliable in my opinion and
today generally needs a helping hand - you might be expecting too much :)

One more suggestion - if you know the font being used for your sheet you
could do some dedicated training to generate a training file for Tesseract.

Cheers

On 14 July 2016 at 14:48, Raphael Budd <woderpi...@gmail.com> wrote:

> So I have added the scaling and with the scaling it makes a mistake of
> somehow interpreting "11:00" as "11 :00", something my program doesn't take
> too kindly too.
>
> I'm not sure what else I can do to make it work as I feel I'm already
> spoon feeding the text to it, unless there is noise on the image or
> something I'm not aware of?
> Thanks
>
> On Tuesday, July 12, 2016 at 2:10:27 AM UTC-4, Raphael Budd wrote:
>
>> Hey everyone,
>>
>> I've got this pdf document which is a schedule. I'm trying to extract the
>> text from it via tesseract but I'm not having that good results.
>>
>> I've tried a lot of different things, in my inexperienced opinion the
>> image seems very high quality as I can zoom in a lot without seeing pixels.
>> I've also tried to convert the pdf->tiff and add grayscale filter (all via
>> java).
>>
>> I've attached both the end result and the original pdf here along with a
>> sample of the output, any help making the output better would be
>> appreciated.
>>
>> The tiff file is too big for the attachement; see this link:
>> http://wltd.org/Daily%20schedule-14.tiff
>>
>> ---Begin text---
>> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
>> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
>> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
>> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
>> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
>> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
>> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
>> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
>> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _
>>
>> --END TEXT---
>>
>> As you can see tesseract becomes quite creative with its attempt at
>> parsing this, earlier in the document it even parsed the letter "N" as
>> "|\|", creative but useless for parsing!
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/86f2ff29-e666-4136-8fc1-43ef6a509e75%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/86f2ff29-e666-4136-8fc1-43ef6a509e75%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vgMj2t_xAnx9DwvqK7BY%2B8X2Yj7HDTA%3DLDmyNxqjFXEZA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to