[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529977#comment-17529977
 ] 

Nick Burch commented on TIKA-3571:
----------------------------------

Some formats support the concept of pages and we can pass that along (eg pdf, 
ppt), some don't store page related info in the file format so we can't no 
matter how much people might like us to (eg doc, rtf), and some don't have any 
real concept of a page / are only ever single page (eg jpg, mp3). Potentially 
also the category of ones which don't normally have a concept of a page until 
you try to print (eg xls, ods, CAD formats)

Paged formats are a bit of a special case, but in some systems also a common 
one!

> Add an interface for rendering engines
> --------------------------------------
>
>                 Key: TIKA-3571
>                 URL: https://issues.apache.org/jira/browse/TIKA-3571
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to