August Valera created TIKA-2696: ----------------------------------- Summary: Support output of Tesseract OSD output for psm mode 0 Key: TIKA-2696 URL: https://issues.apache.org/jira/browse/TIKA-2696 Project: Tika Issue Type: Improvement Components: ocr Reporter: August Valera
TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does not perform OCR, just outputs orientation and script information. An example usage of mode 0: {code:java} $ tesseract infile.png outfile --psm 0 -l osd {code} In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like so: {code:java} Page 1 Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 212 Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 13.73 Script: Latin Script confidence: 4.78 {code} However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input. This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)