Re: [tesseract-ocr] Re: tessdata/eng.traineddata question

newbie Tue, 20 Jan 2015 12:00:25 -0800

I found that vip1200.jpg works at  scale Width(8654px) and height(5748px), 
but most of the time I either get an "Invalid mem access" or out of 
mem(heap) error before I am able to rescale to the optimal scale.
I need to come up with some other generic way to upscale and ocr images. 
Any ideas are appreciated.


On Tuesday, January 20, 2015 at 11:38:54 AM UTC-5, newbie wrote:
>
> Thanks folks to all who have taken the time to respond.
>
> This is what I am trying to do now, I upscale the image then feed it to 
> the ocr and then run it against a dictionary of words I have, if it does 
> not match, I iteratively upscale and feed it to the ocr. I cannot upscale 
> it very big as there are 3 problems.
>
> 1. The text I am trying to seek gets very blurred and ocr will fail
> 2. I run out of memory upscaling.(I have the heap size increased to the 
> max).
> 3. This process is time consuming
>
>  My upscale multiple(by how many pixels i upscale  the entire image) is 
> also set based on the max dimension of the original image(i,e if vertical 
> dimension is more then vertical pixels become my max dimension, likewise 
> with horizontal, eg height is 29 and width 67, max dimension=67).
> if (maxDimension <100)
>     scaledMultiple=10;
>     else if (maxDimension >100 && maxDimension<1000) 
>     scaledMultiple=50;
>     else  if (maxDimension > 1000)
>     scaledMultiple=100;
>
> This works for most of the images I have currently, but fails for a few. I 
> will attach the failing ones(needs to read VIP1200 in VIP1200R.png and 
> VIP1200R_cropped).  Appreciate it if any of you could tell me, how I can 
> get this to work. Also if there is another way to go about this, as my 
> images are varying in size drastically(ofcourse I ahve put across the 
> suggestion of cropping  the model number within a text box, as Allistair 
> has suggested and they are mulling over it(so I guess the idea is not well 
> received)).
>
> I do maintain the aspect ratio of the original image when I upscale....so 
> the ovalizing the text is not done, may be should try that ? Also I am now 
> converting jpg to png files, do you know which format works the best ? 
> Thanks
>
> Appreciate it.
>
>
>
> On Sunday, January 18, 2015 at 1:59:28 PM UTC-5, Flash Thunder wrote:
>>
>> Oh, sorry for double post... wrong key. I have to say, that for example 
>> for captcha recognation, I do resize images to 200% or even 300%... same 
>> image not resized does not give any results. Not sure why. Probably, 
>> because font changes to more ... "oval".
>>
>> 2015-01-18 19:57 GMT+01:00 Marek FlashT Rucinski <przys...@gmail.com>:
>>
>>> Don't use DPI metric, as it does not really count for Tesseract. The 
>>> best results (that is from my experience) are obtained when font size is 
>>> 70-90px (so it is a bit large for normal usage).
>>>
>>> 2015-01-15 1:58 GMT+01:00 Quan Nguyen <nguy...@gmail.com>:
>>>
>>>> You can use the command combine_tessdata 
>>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/combine_tessdata.1.html>
>>>>  
>>>> to unpack a traineddata file to examine its components.
>>>>
>>>> The eng.traineddata bundled with Tess4J is of 3.01 version. You may 
>>>> want to try 3.02 and see if it can produce better results for you (check 
>>>> in 
>>>> https://code.google.com/p/tesseract-ocr/downloads/list).
>>>>
>>>> On Monday, January 12, 2015 at 10:18:18 AM UTC-6, newbie wrote:
>>>>>
>>>>> Does anyone know that if  tessdata/eng.traineddata(the final crunched 
>>>>> data) in tess4j comes with all the below files included ?
>>>>>
>>>>>
>>>>>    - tessdata/eng.config
>>>>>    - tessdata/eng.unicharset
>>>>>    - tessdata/eng.unicharambigs
>>>>>    - tessdata/eng.inttemp
>>>>>    - tessdata/eng.pffmtable
>>>>>    - tessdata/eng.normproto
>>>>>    - tessdata/eng.punc-dawg
>>>>>    - tessdata/eng.word-dawg
>>>>>    - tessdata/eng.number-dawg
>>>>>    - tessdata/eng.freq-dawg
>>>>>
>>>>> Also is this enough to identify any of the normal fonts(images 
>>>>> attached) ? Appreciate your help.
>>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/991f0517-29d9-440b-97e4-8e2616c30033%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/991f0517-29d9-440b-97e4-8e2616c30033%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/93c8ef96-cb73-41c4-b9e7-747a7b4c661f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: tessdata/eng.traineddata question

Reply via email to