Re: [tesseract-ocr] approches used for language detection on images ...

2020-02-04 Thread Albretch Mueller
On 2/1/20, Zdenko Podobny  wrote:
> You did not provide any example Image

 OK, this one would do. On this pdf file there are images of varying
quality and with text embedded in various ways. This would be the
typical text I would be dealing with:

 https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf

 another example of textual file I work with would be:

 https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes
and Texts.pdf

 on that file pdftohtml produces one background file per page, but
when you stratify the content (simply using hash signatures) you
realize most files are of the same kind (just blank background images
or files containing a single line (for example, underlining a title)
or framing a blocked message), then there are full-page blank images
with segments of greek text, ...

 I don't quite understand why poppler utils don't just underline a
word. Of course, you could easily write some code to figure out which
segments of text should be underlined, but understanding the obvious
tends to pay in the long run

> , neither what kind of tools you would
> like to use (open source or proprietary)

 the poppler's pdftohtml tools:

 https://poppler.freedesktop.org/

 are pretty good, but there is always an extra twicking you need.
Authors write texts in whichever way they want and this is a good
thing

>4. I guess you will have problem with texts with mixed languages.

 Yes, I do, but a few heuristics included in metadata (extracted from
the names and/or headings of files) are of great help

 At the end of the day you can't fully automate such a process. You
will need a GUI and let "users" eye ball the data . . .

>5. If  proprietary tools (and budget ;-) ) are not problem you can try
>to use  google vision [6] or Microsoft cognitive services [7] or Amazon
>Rekognition. Dataturks made some test for them [9]...

 I am trying to write up a set of bash scripts and code as part of a
pretty complete all-purpose library. Ideally the back end text will be
formatter as ODT since it is very easy to convert it to any other
format anyway

 Do you know of such a library?

> [1] ... [9]

 Thank you,
 lbrtchx

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwj_b5uxQaP-%3DYv_1VP6%3DNG5B1OYjCOT2LLJAdKr%2BTX66A%40mail.gmail.com.


Re: [tesseract-ocr] approches used for language detection on images ...

2020-02-01 Thread Zdenko Podobny
You did not provide any example image, neither what kind of tools you would
like to use (open source or proprietary), so... Just some additional tips
to Lozenzo:

   1. You can try to use tesseract for text detection too. See[1]. Maybe
   just use RIL_TEXTLINE, RIL_PARA or RIL_BLOCK instead of RIL_WORD.Be aware
   of some limitations[1] There are some rumors that legacy models[3] (from
   tesseract 3.5) are better for this task than LSTM (which provide better OCR
   quality).
   2. If you do not know script, you can try to use tesseract feature
   "Orientation and script detection".Tesseract tries to identify Han, Latin,
   Katakana, Katakana, Hiragana, Hangul scripts, so you can narrow down
   possible languages of text
   3. It you do not want to use "brutal force" as described by Lozenzo
   (tesseract provide best OCR result when you specify language of text) and
   you know that text is written in latin script you can use latin language
   model from script directory[4]. After OCR you can try to identify language
   by other tools[5]
   4. I guess you will have problem with texts with mixed languages.
   5. If  proprietary tools (and budget ;-) ) are not problem you can try
   to use  google vision [6] or Microsoft cognitive services [7] or Amazon
   Rekognition. Dataturks made some test for them [9]...

[1]
https://github.com/zdenop/SimpleTesseractPythonWrapper/blob/master/SimpleTesseractPythonWrapper.ipynb
[2] https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md
[3]  https://github.com/tesseract-ocr/tessdata
[4] https://github.com/tesseract-ocr/tessdata_best/tree/master/script
[5]
https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447
[6] https://cloud.google.com/vision/docs/ocr
[7] https://azure.microsoft.com/en-us/services/cognitive-services/
[8] https://aws.amazon.com/rekognition/
[9] https://dataturks.com/blog/compare-image-text-recognition-apis.php

Zdenko


so 1. 2. 2020 o 16:22 Lorenzo Bolzani  napĂ­sal(a):

> You can try some machine learning based text detection, like this one for
> example:
>
>
> https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
> https://github.com/argman/EAST
>
> It's not so easy to use because, as you can see in the images, you are
> going to get multiple boxes. So you need a threshold based aggregation step
> to put together the real text blocks.
>
> If your text is simple and uniform, like the one from pdf or html
> rendering, something like this may work too:
>
>
> https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/
> https://github.com/qzane/text-detection/blob/master/TextDetect.py
>
>
> About the language of the text a brute force approach could be to try
> different languages with tesseract and see which one gives you the highest
> confidence. Other than that you might try a simple "character detection"
> with a few key characters for each language and see where you get the best
> score (for example with opencv template matching) but I would expect a lot
> of errors if the text uses different fonts and sizes.
>
> If all the languages use the same alphabet, latin for example, you can use
> a generic one ("eng") and do a character distribution analysis to find the
> original language and process it again with the correct tesseract language:
>
>
> https://data-science-blog.com/blog/2018/11/12/language-detecting-with-sklearn-by-determining-letter-frequencies/
>
> https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/
>
> Finally, for different alphabets, you could also train a very simple
> neural network to do the classification (google "MNIST CNN"), the most
> complex part being preparing the dataset.
>
>
> Lorenzo
>
>
>
>
>
>
> Il giorno sab 1 feb 2020 alle ore 12:26 Albretch Mueller <
> lbrt...@gmail.com> ha scritto:
>
>>  pdftohtml produces background images which (x,y) position specified
>> on the page's mark up. It creates images for the underlines of text
>> and also for blocked sections (with visible frames), foreign language
>> text, . . .
>>
>>  programmatically scanning those background images to find out lines
>> and boxes is easy, but how could you detect text (other than by
>> exclusion) and the language of that text?
>>
>>  I asked basically the same question on a gimpusers's forum:
>>
>>
>> https://www.gimpusers.com/forums/gimp-user/21659-approches-used-for-language-detection-on-images
>>
>>  they told me OCR kinds of folks should know best:
>>
>>  lbrtchx
>>  tesseract-ocr@googlegroups.com:approches used for language detection
>> on images ...
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> 

Re: [tesseract-ocr] approches used for language detection on images ...

2020-02-01 Thread Lorenzo Bolzani
You can try some machine learning based text detection, like this one for
example:

https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
https://github.com/argman/EAST

It's not so easy to use because, as you can see in the images, you are
going to get multiple boxes. So you need a threshold based aggregation step
to put together the real text blocks.

If your text is simple and uniform, like the one from pdf or html
rendering, something like this may work too:

https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/
https://github.com/qzane/text-detection/blob/master/TextDetect.py


About the language of the text a brute force approach could be to try
different languages with tesseract and see which one gives you the highest
confidence. Other than that you might try a simple "character detection"
with a few key characters for each language and see where you get the best
score (for example with opencv template matching) but I would expect a lot
of errors if the text uses different fonts and sizes.

If all the languages use the same alphabet, latin for example, you can use
a generic one ("eng") and do a character distribution analysis to find the
original language and process it again with the correct tesseract language:

https://data-science-blog.com/blog/2018/11/12/language-detecting-with-sklearn-by-determining-letter-frequencies/
https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/

Finally, for different alphabets, you could also train a very simple neural
network to do the classification (google "MNIST CNN"), the most complex
part being preparing the dataset.


Lorenzo






Il giorno sab 1 feb 2020 alle ore 12:26 Albretch Mueller 
ha scritto:

>  pdftohtml produces background images which (x,y) position specified
> on the page's mark up. It creates images for the underlines of text
> and also for blocked sections (with visible frames), foreign language
> text, . . .
>
>  programmatically scanning those background images to find out lines
> and boxes is easy, but how could you detect text (other than by
> exclusion) and the language of that text?
>
>  I asked basically the same question on a gimpusers's forum:
>
>
> https://www.gimpusers.com/forums/gimp-user/21659-approches-used-for-language-detection-on-images
>
>  they told me OCR kinds of folks should know best:
>
>  lbrtchx
>  tesseract-ocr@googlegroups.com:approches used for language detection
> on images ...
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwhJN1JWMHg-h3nsS8t0FEpP%2BkGZXUjsvJOy%2BKb2w_f0JQ%40mail.gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzAopK0f22rBa7fmyK6jZ3JY1oLKVTBs-HpQCsyCxCs%3DQ%40mail.gmail.com.


[tesseract-ocr] approches used for language detection on images ...

2020-02-01 Thread Albretch Mueller
 pdftohtml produces background images which (x,y) position specified
on the page's mark up. It creates images for the underlines of text
and also for blocked sections (with visible frames), foreign language
text, . . .

 programmatically scanning those background images to find out lines
and boxes is easy, but how could you detect text (other than by
exclusion) and the language of that text?

 I asked basically the same question on a gimpusers's forum:

 
https://www.gimpusers.com/forums/gimp-user/21659-approches-used-for-language-detection-on-images

 they told me OCR kinds of folks should know best:

 lbrtchx
 tesseract-ocr@googlegroups.com:approches used for language detection
on images ...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwhJN1JWMHg-h3nsS8t0FEpP%2BkGZXUjsvJOy%2BKb2w_f0JQ%40mail.gmail.com.