nvm, the config --oem 3 --psm 6 extract text real good but if the image like bellow, it combine 2 paragraph to 1 , so i use config --oem 3 --psm 4 , work great but skip lot of text in page . Now the problem i have is the image i read sometimes have both 2 kind of text: -Text read from left to right -Text read from top to bottom
How can i detect it to switch between tessdata (if i remember correctly: jpn used to read left to right text and jpn_vert used to read top to bottom text). Thanks [image: Screen Shot 2023-12-26 at 10.28.28.png] Vào lúc 11:01:18 UTC+7 ngày Thứ Hai, 25 tháng 12, 2023, g...@hobbelt.com đã viết: > See also discussion in mailing list at > https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer > > Plus https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md, > which is the most important documentation page that addresses all kinds of > OCR result quality issues such as this. > > > > > On Fri, 22 Dec 2023, 05:58 Hoang Pham Huy, <akiray...@gmail.com> wrote: > >> Currently i'm trying to read this image in Japanese for translating, but >> the result kinda odd. What should i do to improve it? >> >> I'm only using this code for extract text from the image using Japanese >> tessdata_best <https://github.com/tesseract-ocr/tessdata_best> and some >> others: >> >> ``` >> def extract_text_from_image(self, image_path): >> img = cv2.imread(image_path) >> text = pytesseract.image_to_string(img, >> lang='jpn+jpn_vert+jpn_ver5+eng+osd+equ') >> return text.strip() >> ``` >> >> >> [image: Screen Shot 2023-12-22 at 10.12.00.png] >> >> -- >> > You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/06f86c3c-4b4c-4a99-b2fa-50f38b13d54bn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/06f86c3c-4b4c-4a99-b2fa-50f38b13d54bn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/afbc1c77-a1c5-43a1-8130-86eec8e94ad0n%40googlegroups.com.