Dave Meikle created TIKA-2357: --------------------------------- Summary: Allow Tesseract PSM up to 13 Key: TIKA-2357 URL: https://issues.apache.org/jira/browse/TIKA-2357 Project: Tika Issue Type: Improvement Components: ocr Affects Versions: 1.14 Reporter: Dave Meikle Priority: Minor Fix For: 1.15
>From https://github.com/apache/tika/pull/177 by Rafael Ferreira Extend support for increased PSM options up to 13 for modern versions of Tesseract. {code} $ tesseract --version tesseract 3.05.00 leptonica-1.74.1 libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8 $ tesseract --help-psm Page segmentation modes: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text. 7 Treat the image as a single text line. 8 Treat the image as a single word. 9 Treat the image as a single word in a circle. 10 Treat the image as a single character. 11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD. 13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)