Alright so I did a bunch of testing and I've found weirdly enough that 
running tesseract via console produces 100% accuracy via my preprocessing.. 
just not when I do it via api call in java. I now suspect old version of 
tesseract screwing stuff up, if that is the case hopefully there is a more 
updated version of tess4j or else this is going to be really painful to do 
this via java through cmd (possible but a pain in the ****).

On Tuesday, July 12, 2016 at 2:10:27 AM UTC-4, Raphael Budd wrote:
>
> Hey everyone,
>
> I've got this pdf document which is a schedule. I'm trying to extract the 
> text from it via tesseract but I'm not having that good results.
>
> I've tried a lot of different things, in my inexperienced opinion the 
> image seems very high quality as I can zoom in a lot without seeing pixels. 
> I've also tried to convert the pdf->tiff and add grayscale filter (all via 
> java).
>
> I've attached both the end result and the original pdf here along with a 
> sample of the output, any help making the output better would be 
> appreciated. 
>
> The tiff file is too big for the attachement; see this link: 
> http://wltd.org/Daily%20schedule-14.tiff
>
> ---Begin text---
> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _
>
> --END TEXT---
>
> As you can see tesseract becomes quite creative with its attempt at 
> parsing this, earlier in the document it even parsed the letter "N" as 
> "|\|", creative but useless for parsing!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5bd7d57d-f3eb-47db-b568-97daba08e87f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to