Did you find out the versions being used?

Tess4J changelog suggests:

Recompile Tesseract 3.04.01 DLL against Leptonica 1.73

How does that compare with your CLI?

Is any config file or option being injected anywhere? Are you pushing the
same page segmentation model param (psm) or using automatic (I would
recommend choosing one and punching it in).

Cheers

On 15 July 2016 at 06:54, Raphael Budd <woderpi...@gmail.com> wrote:

> Update; I really don't understand the difference between these two
> installs. I am using the absolute latest version of Tess4j and it just does
> not work whereas literally the SAME IMAGE works with tesseract command
> line. 100% confirmed that this is the behaviour every time, I can setup the
> java app and the console at the same time on the same image and the console
> gets it completely right while Tess4j screws it up. I should add that they
> are using the same tessdata, as I copy passed it over.
>
> Anyways I've had a long night of messing with this and thats enough for my
> poor soul.
>
> (thanks for helping me through this, by the way. Getting closer and closer
> to the goal)
>
>
> On Friday, July 15, 2016 at 12:48:21 AM UTC-4, Raphael Budd wrote:
>>
>> Alright so I did a bunch of testing and I've found weirdly enough that
>> running tesseract via console produces 100% accuracy via my preprocessing..
>> just not when I do it via api call in java. I now suspect old version of
>> tesseract screwing stuff up, if that is the case hopefully there is a more
>> updated version of tess4j or else this is going to be really painful to do
>> this via java through cmd (possible but a pain in the ****).
>>
>> On Tuesday, July 12, 2016 at 2:10:27 AM UTC-4, Raphael Budd wrote:
>>>
>>> Hey everyone,
>>>
>>> I've got this pdf document which is a schedule. I'm trying to extract
>>> the text from it via tesseract but I'm not having that good results.
>>>
>>> I've tried a lot of different things, in my inexperienced opinion the
>>> image seems very high quality as I can zoom in a lot without seeing pixels.
>>> I've also tried to convert the pdf->tiff and add grayscale filter (all via
>>> java).
>>>
>>> I've attached both the end result and the original pdf here along with a
>>> sample of the output, any help making the output better would be
>>> appreciated.
>>>
>>> The tiff file is too big for the attachement; see this link:
>>> http://wltd.org/Daily%20schedule-14.tiff
>>>
>>> ---Begin text---
>>> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
>>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
>>> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
>>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
>>> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
>>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
>>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
>>> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
>>> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
>>> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
>>> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
>>> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
>>> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _
>>>
>>> --END TEXT---
>>>
>>> As you can see tesseract becomes quite creative with its attempt at
>>> parsing this, earlier in the document it even parsed the letter "N" as
>>> "|\|", creative but useless for parsing!
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9019638c-8c74-44f5-b887-3430a0f63d4a%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/9019638c-8c74-44f5-b887-3430a0f63d4a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vhTAdPHJ%2BHxLiyJvSgcyxSCFLODHOfrbgs3W7pyXFNvSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to