[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

Tilman Hausherr (Jira) Tue, 22 Oct 2019 11:15:19 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957261#comment-16957261
 ]


Tilman Hausherr commented on TIKA-2624:
---------------------------------------

None from me. Re "This will have a memory and temporary disk space impact", be 
aware that rendering will be slower too, but 300dpi is a minimum for decent OCR.

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2624
>                 URL: https://issues.apache.org/jira/browse/TIKA-2624
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Ewan Mellor
>            Assignee: Tim Allison
>            Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

Reply via email to