[ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517641#comment-17517641 ]
Luís Filipe Nassif commented on TIKA-3571: ------------------------------------------ A possible implementation library, Apache licensed: https://github.com/sbraconnier/jodconverter > Add an interface for rendering engines > -------------------------------------- > > Key: TIKA-3571 > URL: https://issues.apache.org/jira/browse/TIKA-3571 > Project: Tika > Issue Type: Wish > Reporter: Tim Allison > Priority: Major > > We've now seen a few requests for extracting text _and_ rendering PDFs, and > certainly it might be useful to have alternatives for rendering files (e.g. > this [Alfresco > study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]), > including MSOffice or at least PPTx... > And there are cases where users don't want the rendered images, but they do > want OCR to be run against the rendered images. > I doubt I'll have a chance to work on this for a while, but I wanted to open > an issue for discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001)