[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512428#comment-17512428
 ] 

Tim Allison commented on TIKA-3571:
---

I'm starting to hack on this now.  Given that images can be fairly large and 
given that there can be many pages/slides per document, it feels like we need 
to have the renderer write to local files. We'll need to do this anyways for 
ocr (at least with tesseract), etc., and wrappers around commandline renderers 
(e.g mutool) will naturally write to files.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512728#comment-17512728
 ] 

Luís Filipe Nassif commented on TIKA-3571:
--

If we are going to add a PDF renderer implementation using PDFBox, using a JRE 
>= 11 (or avoiding Oracle JRE 1.8), at least on Windows, should make a huge 
performance difference:

[https://github.com/sepinf-inc/IPED/issues/119#issuecomment-822670734]

I wonder if [~tilman] knows anything about java 8 low level font bottlenecks...

For office files, currently we are using LibreOffice by command line.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512729#comment-17512729
 ] 

Luís Filipe Nassif commented on TIKA-3571:
--

This JRE 8 bottleneck brought us a thread leak and then OOME:

https://github.com/sepinf-inc/IPED/issues/1032

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-03-26 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512730#comment-17512730
 ] 

Tilman Hausherr commented on TIKA-3571:
---

We don't use java font rendering, we have our own. I have observed some weird 
temporary slowness in my own multithreaded rendering regression tests but never 
investigated this.

PDFBox will always be one of the slowest rendering options available. I watched 
a presentation by Dr. Michael Vrhel (artifex) a few months ago who compared the 
rendering times, we or PDF.js were the slowest. I'm not really surprised, 
because the other ones tested were written in native code.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517641#comment-17517641
 ] 

Luís Filipe Nassif commented on TIKA-3571:
--

A possible implementation library, Apache licensed: 
https://github.com/sbraconnier/jodconverter

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517644#comment-17517644
 ] 

Tim Allison commented on TIKA-3571:
---

Oo...looks like everything has a target of png!

https://github.com/sbraconnier/jodconverter/wiki/Getting-Started

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517646#comment-17517646
 ] 

Tim Allison commented on TIKA-3571:
---

Hahahahahaha, they also have text and xhtml as output...  Maybe we can retire?

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517647#comment-17517647
 ] 

Tim Allison commented on TIKA-3571:
---

But seriously, this is embarrassing.  Do you have to have 
LibreOffice/OpenOffice installed or are the jars sufficient?

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517818#comment-17517818
 ] 

Nick Burch commented on TIKA-3571:
--

It has been a quite a while since I last used jodconverter, but the underlying 
OpenOffice would crash or infinite loop rather more often than you'd normally 
like. Docker and a restart watchdog ought to help with that though!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528425#comment-17528425
 ] 

Tim Allison commented on TIKA-3571:
---

I've started work on this on our TIKA-3571 branch.  This is targeted for 2.5.0, 
not 2.4.0.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529025#comment-17529025
 ] 

Tim Allison commented on TIKA-3571:
---

I'm now thinking that we need to add "page start" and "page end"  parameters to 
the interface as well as a "render it all" option.  I don't like this, but the 
need is that the PDFParser should be able to decide after trying to extract the 
text, that it needs to run OCR only on that one page.  I don't want to render 
the full document, if the user doesn't want the rendered images and OCR only 
needs one page.

The question is: is this too pdf centric?  I think it isn't awful.  If there 
are formats that are single paged, this should be ok.  PPT/PPTX page # = slide 
#.  Thoughts?

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529465#comment-17529465
 ] 

Tim Allison commented on TIKA-3571:
---

The other thing we need to account for is multiple renderings per page.  I'd 
rather not add this complexity from the beginning, but the API should be able 
to handle this.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529972#comment-17529972
 ] 

Tim Allison commented on TIKA-3571:
---

And another bit of complexity: if someone wants to extract full color rendered 
images, but they also want to run OCR on a grayscale or otherwise manipulated 
image.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529977#comment-17529977
 ] 

Nick Burch commented on TIKA-3571:
--

Some formats support the concept of pages and we can pass that along (eg pdf, 
ppt), some don't store page related info in the file format so we can't no 
matter how much people might like us to (eg doc, rtf), and some don't have any 
real concept of a page / are only ever single page (eg jpg, mp3). Potentially 
also the category of ones which don't normally have a concept of a page until 
you try to print (eg xls, ods, CAD formats)

Paged formats are a bit of a special case, but in some systems also a common 
one!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530010#comment-17530010
 ] 

Tim Allison commented on TIKA-3571:
---

Thank you, [~nick]!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530023#comment-17530023
 ] 

Tim Allison commented on TIKA-3571:
---

I'm now leaning towards a page+ output.  There can be multiple renderings per 
page (emf,wmf, etc).  Even for video or non-naturally-paged documents, when we 
go to render, we'll have one or more images. 

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-05-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530913#comment-17530913
 ] 

Tim Allison commented on TIKA-3571:
---

Planning for the future is complex. :P  Lol.  Video will be noisy if we go the 
page-based route.  Will think of something.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-05-06 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533099#comment-17533099
 ] 

Hudson commented on TIKA-3571:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #545 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/545/])
TIKA-3571 -- rollback puppycrawl -- requires java > 8 (tallison: 
[https://github.com/apache/tika/commit/b4c1c033f2890b5e263ad395477336c9e4e90ee3])
* (edit) tika-parent/pom.xml


> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.1
>
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-05-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533970#comment-17533970
 ] 

Hudson commented on TIKA-3571:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #551 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/551/])
TIKA-3571 -- cleanup (tallison: 
[https://github.com/apache/tika/commit/678898a9d2a0f56b7639c245f0cd9b1842ccddd2])
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/PDFRenderingState.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/Rendering.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/TextOnlyPDFRenderer.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/PDFRenderingState.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/PDFBoxRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/VectorGraphicsOnlyPDFRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/VectorGraphicsOnlyPDFRenderer.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PagedText.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/PDDocumentRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFRenderingTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/NoTextPDFRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/MuPDFRenderer.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/PDDocumentRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_rotated.pdf
* (add) tika-core/src/main/java/org/apache/tika/metadata/TikaPagedText.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/TextOnlyPDFRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/org/apache/tika/parser/pdf/tika-rendering-config.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/mutool/MuPDFRenderer.java
* (edit) 
tika-core/src/main/java/org/apache/tika/renderer/PageBasedRenderResults.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/PDFBoxRenderer.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/NoTextPDFRenderer.java


> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish