[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529025#comment-17529025
 ] 

Tim Allison commented on TIKA-3571:
-----------------------------------

I'm now thinking that we need to add "page start" and "page end"  parameters to 
the interface as well as a "render it all" option.  I don't like this, but the 
need is that the PDFParser should be able to decide after trying to extract the 
text, that it needs to run OCR only on that one page.  I don't want to render 
the full document, if the user doesn't want the rendered images and OCR only 
needs one page.

The question is: is this too pdf centric?  I think it isn't awful.  If there 
are formats that are single paged, this should be ok.  PPT/PPTX page # = slide 
#.  Thoughts?

> Add an interface for rendering engines
> --------------------------------------
>
>                 Key: TIKA-3571
>                 URL: https://issues.apache.org/jira/browse/TIKA-3571
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to