[ 
https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766980#comment-15766980
 ] 

Bipul Kumar commented on TIKA-2190:
-----------------------------------

Thanks Tim. I know that option and am using it but issue with hocr is that 
sometimes the y co-ordinate donot match for the words on the same line. So the 
TXT format can be used as extra info instead of writing code to predict the 
words on same line.

Moreover many users can simply use TXT format with space info for simple and 
straight forward usecases instead of writing code to parse HOCR output. Simple 
user friendly.  

> Add "preserve_interword_spaces" option of tesseract
> ---------------------------------------------------
>
>                 Key: TIKA-2190
>                 URL: https://issues.apache.org/jira/browse/TIKA-2190
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Bipul Kumar
>            Assignee: Tim Allison
>             Fix For: 2.0, 1.15
>
>
> This option will preserve the spaces for TXT output type so that the layout 
> or context can be inferred while further parsing. 
> to enable :: -c preserve_interword_spaces=1
> to disable :: -c preserve_interword_spaces=0 or simply don't mention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to