Re: [tesseract-ocr] Re: Extraction of two different language text from single image using tesseract

Pankaj Gupta Wed, 19 Aug 2020 10:04:22 -0700

Hi Shree,

Thank you for your suggestion. As per the suggested method, it improves the 
pass percentage of the test cases. but the consistency of the extraction of 
mixed language text is not up to the mark. Some times tesseract is able to 
extract the characters correctly but not all the time. 
e.g. in one of the scenarios, it is able to detect English alphabets that 
come at the start of the text but in the next text, the English alphabet 
coming at the end of the text is not getting extracted properly.


One more problem we have identified that in a few of the images we have 
numbers present in the superscripts, while applying OCR, the superscripts 
numbers are not getting extracted.

Please suggest.
On Wednesday, August 19, 2020 at 1:40:14 PM UTC+5:30 shree wrote:

> For multiple languages the standard invocation is to use the two language 
> codes with + sign. 
>
> Eg. -l ara+eng or -l eng+jpn 
>
> Alternately you can also try the script traineddata files eg. Devanagari 
> includes eng+hin+san+mar+nep
>
> However, multiple languages recognition takes more time and is not perfect.
>
> On Wed, Aug 19, 2020, 13:20 Pankaj Gupta <pan...@gaurishiv.org> wrote:
>
>> Dear Team,
>>
>> Waiting for your suggestions.  Need your help.
>>
>> Thank you in advance.
>>
>> Regards,
>> Pankaj
>>
>> On Friday, August 14, 2020 at 12:45:05 AM UTC+5:30 Pankaj Gupta wrote:
>>
>>> Dear Team,
>>>
>>> Me and team is developing a tool that extract the text from the given 
>>> images (containing data related to single language) using tesseract/ The 
>>> tool is able to extract the text in 14 different languages with a higher 
>>> accuracy greater than 95%.
>>>
>>> We have got a new challenge in the development that there are images 
>>> that contain text in more than one language (Japanese - English or Arabic - 
>>> English). due to copyright issues, I am not able to attach the original 
>>> image, A sample image is attached along with this thread which contains 
>>> text in Japanese and English depicting the actual scenarios. Request your 
>>> support in identifying the technique to extract the text accurately in both 
>>> the language.
>>>
>>> I am using Python 3+, open CV, and tesseract for development.
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Pankaj Gupta
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/48b5420c-f1cb-4f0d-bfdc-5612ffcf5661n%40googlegroups.com.

Re: [tesseract-ocr] Re: Extraction of two different language text from single image using tesseract

Reply via email to