[jira] [Created] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

August Valera (JIRA) Thu, 26 Jul 2018 16:52:28 -0700

August Valera created TIKA-2696:
-----------------------------------

             Summary: Support output of Tesseract OSD output for psm mode 0
                 Key: TIKA-2696
                 URL: https://issues.apache.org/jira/browse/TIKA-2696
             Project: Tika
          Issue Type: Improvement
          Components: ocr
            Reporter: August Valera



TIKA-2357 added support for additional PSM (page segmentation modes) for 
Tesseract OCR, including mode 0, which is {{Orientation and script detection 
(OSD) only}}, meaning it does not perform OCR, just outputs orientation and 
script information.

An example usage of mode 0:
{code:java}
$ tesseract infile.png outfile --psm 0 -l osd
{code}
In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to 
other modes that run OSD in addition to extraction, the result is an 
{{outfile.osd}} file, like so:
{code:java}
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 212
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 13.73
Script: Latin
Script confidence: 4.78
{code}
However, {{TesseractOCRParser#parse(...)}} is 
[coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437]
 to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) 
in all modes, so mode 0 outputs nothing regardless of input.

This is consistent with Tika's goal to output extracted text, but against the 
intention of the user expecting OSD output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

Reply via email to