[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

ASF GitHub Bot (Jira) Thu, 28 Mar 2024 23:13:43 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832050#comment-17832050
 ]


ASF GitHub Bot commented on TIKA-2696:
--------------------------------------

Tarik37 commented on PR #246:
URL: https://github.com/apache/tika/pull/246#issuecomment-2026729362

   Hello, I am currently using the Tika 2.9.1 server version and need the 
output of the OSD in my metadata, particularly the value of the script (Latin, 
Cyrillic, etc.). So my questions are the following:
   Does my server version of Tika integrate it? Is it possible?
   If yes, how can I configure my Tika server?
   Thanks for your work (and also english is not m'y native language)




> Support output of Tesseract OSD output for psm mode 0
> -----------------------------------------------------
>
>                 Key: TIKA-2696
>                 URL: https://issues.apache.org/jira/browse/TIKA-2696
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: August Valera
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> TIKA-2357 added support for additional PSM (page segmentation modes) for 
> Tesseract OCR, including mode 0, which is {{Orientation and script detection 
> (OSD) only}}, meaning it does not perform OCR, just outputs orientation and 
> script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar 
> to other modes that run OSD in addition to extraction, the result is an 
> {{outfile.osd}} file, like so:
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is 
> [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437]
>  to only read the contents of {{outfile.txt}} (alternatively 
> {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the 
> intention of the user expecting OSD output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

Reply via email to