OCR of source code with tesseract is a problem:

   - tesseract is not focused on keeping spaces/indentation - you have to
   reconstruct it by yourself (e.g. by parsing horcr output)
   - tesseract is focused more on "real" text, while source code is more
   symbolic with a lot of extra character, case sensitive etc. So  I am quite
   sure you will need to correct the tesseract output manually.


Zdenko


po 22. 11. 2021 o 6:54 J S <[email protected]> napĂ­sal(a):

> Hi all,
> I am trying to OCR some code wrote in Python. I ve read the Tesseract doc
> many times and applied 3 pre processing script with Image Magick. The
> result image is attached.
> I then send it to Tesseract with ```--psm 4``` which seems to be the more
> adapted segmentation mode for what I am trying to do. The result is quite
> ok but I don't have indentations and I think it could be still improved.
>
> I would be glad to have some adivce to improve the result. Thanks a lot
>
> Best,
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c07b4f66-7e6e-4634-a4ee-b8a8db003f20n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c07b4f66-7e6e-4634-a4ee-b8a8db003f20n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wEeKskfWWOZxTu%3DpmT-chCnhs_PuKKQnLzDR4GcY%3DP2g%40mail.gmail.com.

Reply via email to