[jira] [Created] (TIKA-2623) get embedded resources in doc files

2018-04-02 Thread Ohad R (JIRA)
Ohad R created TIKA-2623:


 Summary: get embedded resources in doc files
 Key: TIKA-2623
 URL: https://issues.apache.org/jira/browse/TIKA-2623
 Project: Tika
  Issue Type: Improvement
  Components: core, parser
Reporter: Ohad R


according to 
[https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
 it is possible to recursively parse a document and save its sub-items (e.g. 
images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope 
of the above class is only in the TikaCLI.

I think it should be visible to the applications that uses Tika (not only to 
the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422231#comment-16422231
 ] 

Luis Filipe Nassif commented on TIKA-2620:
--

Hi [~tilman]. When printing PDFs to images before OCR, our default is to use 
300dpi. If the image is bigger than that, it will be scaled down at the end. 
Reading PDFBOX-4137, I understood images will be subsampled before being 
decoded and not when rendering, possibly saving lots of memory, or am I wrong?

Thanks

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422239#comment-16422239
 ] 

Tilman Hausherr commented on TIKA-2620:
---

The subsampling is when decoding, but this would influence rendering, 
obviously. The worst case would be a fine horizontal or vertical line that 
could be missing.

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422239#comment-16422239
 ] 

Tilman Hausherr edited comment on TIKA-2620 at 4/2/18 1:13 PM:
---

The subsampling is when decoding, but this would influence rendering, 
obviously. The worst case would be a fine horizontal or vertical line that 
could be missing. Yes this saves Memory because the decoded images are smaller.


was (Author: tilman):
The subsampling is when decoding, but this would influence rendering, 
obviously. The worst case would be a fine horizontal or vertical line that 
could be missing.

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


1.17 vs 1.18-SNAPSHOT reports finally available

2018-04-02 Thread Allison, Timothy B.
http://162.242.228.174/reports/reports_tika_1.17_vs_1.18-SNAPSHOT.zip

I haven't had a chance to look closely.

Timothy B. Allison, Ph.D.
Principal Artificial Intelligence Engineer
T835/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)




RE: 1.17 vs 1.18-SNAPSHOT reports finally available

2018-04-02 Thread Allison, Timothy B.
My quick take is that this basically looks good on exceptions, page counts, 
metadata counts, mime changes.

As for content:

1) lots more content in ppt probably because of the grouped text box fix
2) some more duplicate content in xls esp when embedded in ppt because we're 
now extracting labels (e.g. ZT52GNT25LXKSWMZRPQ2R56Y2CYG6HVS)
3) rfc822 content is hard to compare because we're now actually parsing and 
inlining the content where possible instead of treating as attachments so the 
alignment of embedded docs is somewhat challenging/hosed...I'll manually review 
a few more to see what I find.
4) we're doing better on a few handfuls of html files with better handling of 
"charset='unicode'"

Unless there are objections, I'll roll the RC tomorrow after reviewing a few 
more of the rfc822 files.

Cheers,

  Tim

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, April 2, 2018 11:41 AM
To: dev@tika.apache.org
Subject: 1.17 vs 1.18-SNAPSHOT reports finally available

http://162.242.228.174/reports/reports_tika_1.17_vs_1.18-SNAPSHOT.zip

I haven't had a chance to look closely.

Timothy B. Allison, Ph.D.
Principal Artificial Intelligence Engineer T835/Human Language Technology The 
MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)




RE: 1.17 vs 1.18-SNAPSHOT reports finally available

2018-04-02 Thread Allison, Timothy B.
Updated with common tokens comparison table by mime in contents/ directory

http://162.242.228.174/reports/reports_tika_1.17_vs_1.18-SNAPSHOT-b.zip

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, April 2, 2018 11:41 AM
To: dev@tika.apache.org
Subject: 1.17 vs 1.18-SNAPSHOT reports finally available

http://162.242.228.174/reports/reports_tika_1.17_vs_1.18-SNAPSHOT.zip

I haven't had a chance to look closely.



[jira] [Created] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2624:
-

 Summary: Rendering PDFs for OCR with Tesseract uses different DPI 
than claimed
 Key: TIKA-2624
 URL: https://issues.apache.org/jira/browse/TIKA-2624
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.17
Reporter: Ewan Mellor


Tika has two properties in `PDFParser.properties` that control what happens in 
AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract for 
OCR.  These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0).

`ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata 
in the image (i.e. it doesn't control scaling at all, it's just an advertised 
metadata field).

`ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses it 
to specify the scale for rendering.  This value is such that 1.0 == 72dpi, and 
therefore Tika's default is to request 144dpi for rendering.

This means that Tika is asking PDFBox to render at 144dpi, and then advertising 
300dpi in the image metadata.  This makes no sense to me, and is surely going 
to confuse Tesseract.

Instead of doing this, we should remove `ocrImageScale`, and use the same DPI 
value for rendering as we advertise in the image metadata.

We should keep the existing default DPI value, since Tesseract is trained at 
300dpi by default, so this will mean that all stages between PDFRenderer and 
Tesseract are defaulting to 300dpi.

This change will have the side-effect that the temporary images between the PDF 
rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will have a 
memory and temporary disk space impact, but I think that it's still best to 
have the whole pipeline using 300dpi.  People who have memory constraints will 
need to reduce ocrDPI and make the corresponding changes on the Tesseract side.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Mellor updated TIKA-2624:
--
Description: 
Tika has two properties in {{PDFParser.properties}} that control what happens 
in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
2.0).

{{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
metadata in the image (i.e. it doesn't control scaling at all, it's just an 
advertised metadata field).

{{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which uses 
it to specify the scale for rendering.  This value is such that 1.0 == 72dpi, 
and therefore Tika's default is to request 144dpi for rendering.

This means that Tika is asking PDFBox to render at 144dpi, and then advertising 
300dpi in the image metadata.  This makes no sense to me, and is surely going 
to confuse Tesseract.

Instead of doing this, we should remove {{ocrImageScale}}, and use the same DPI 
value in both places.

We should keep the existing default DPI value, since Tesseract is trained at 
300dpi by default, so this will mean that all stages between PDFRenderer and 
Tesseract are defaulting to 300dpi.

This change will have the side-effect that the temporary images between the PDF 
rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will have a 
memory and temporary disk space impact, but I think that it's still best to 
have the whole pipeline using 300dpi.  People who have memory constraints will 
need to reduce ocrDPI and make the corresponding changes on the Tesseract side.

 

  was:
Tika has two properties in `PDFParser.properties` that control what happens in 
AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract for 
OCR.  These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0).

`ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata 
in the image (i.e. it doesn't control scaling at all, it's just an advertised 
metadata field).

`ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses it 
to specify the scale for rendering.  This value is such that 1.0 == 72dpi, and 
therefore Tika's default is to request 144dpi for rendering.

This means that Tika is asking PDFBox to render at 144dpi, and then advertising 
300dpi in the image metadata.  This makes no sense to me, and is surely going 
to confuse Tesseract.

Instead of doing this, we should remove `ocrImageScale`, and use the same DPI 
value for rendering as we advertise in the image metadata.

We should keep the existing default DPI value, since Tesseract is trained at 
300dpi by default, so this will mean that all stages between PDFRenderer and 
Tesseract are defaulting to 300dpi.

This change will have the side-effect that the temporary images between the PDF 
rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will have a 
memory and temporary disk space impact, but I think that it's still best to 
have the whole pipeline using 300dpi.  People who have memory constraints will 
need to reduce ocrDPI and make the corresponding changes on the Tesseract side.

 


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4

[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Ewan Mellor (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422984#comment-16422984
 ] 

Ewan Mellor commented on TIKA-2620:
---

See TIKA-2624.  I think that the statement re 300 DPI from [~lfcnassif] is not 
quite correct and it's more complicated than it's meant to be.


> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-2624:
-

Assignee: Tim Allison

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423005#comment-16423005
 ] 

ASF GitHub Bot commented on TIKA-2624:
--

ewanmellor opened a new pull request #232: Fix for TIKA-2624 contributed by 
ewanmellor.
URL: https://github.com/apache/tika/pull/232
 
 
   Change AbstractPDF2XHTML.doOCROnCurrentPage to use the same DPI value
   (PDFParserConfig.ocrDPI) for both the PDF rendering and the image metadata.
   
   Previously, the PDF was being rendered using ocrImageScale (default 2.0 ==
   144dpi) and then putting ocrDPI (default 300) in the image metadata.  Having
   these two things be independent makes no sense, and is surely going to
   confuse Tesseract when the image metadata does not match the data.
   
   This change means that ocrDPI drives both values, and ocrImageScale is
   removed.  This also switches from PDFRenderer.renderImage to
   PDFRenderer.renderImageWithDPI, but that's just a stub to make it clearer
   what's going on.
   
   This change will have the side-effect that the temporary images between the
   PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will
   have a memory and temporary disk space impact, but it will ensure that the
   whole pipeline uses 300dpi by default.  People who have memory constraints
   will need to reduce ocrDPI and make the corresponding changes on the
   Tesseract side.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423026#comment-16423026
 ] 

Tim Allison commented on TIKA-2624:
---

Thank you for opening this and submitting a PR.  I ran into this on an image 
degradation study...where I (mis?)remembered that the first time I used dpi I 
was actually getting different images...in short it felt like it worked.  
However a year or so later w a different version of PDFBox, I found that dpi 
didn't work at all, and I had to rely on the scale instead, as you helpfully 
explain above.  I wonder if what I experienced was a diff btwn PDFBox 1.8.x and 
2.0.x?  Or did dpi actually work on TIFF on my first run but doesn't work on 
JPEG?  Or, this could have just been faulty memory...


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423041#comment-16423041
 ] 

Luis Filipe Nassif commented on TIKA-2624:
--

Wow, that is a major bug, not sure when it was introduced. Thank you, 
[~ewanmellor-2]!

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423040#comment-16423040
 ] 

Ewan Mellor commented on TIKA-2624:
---

There were definitely changes between 1.8 and 2.0, e.g. PDFBOX-1963.  I think 
it's always been 1 == 72dpi though; I see that in their doc-comments dating 
back to 2014.  Of course, it could easily have been buggy back then.




> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423051#comment-16423051
 ] 

Tim Allison commented on TIKA-2624:
---

I'm happy to remove {{setOcrImageScale}} throughout in 2.0.0.  Are we ok 
deprecating it in 1.18 and making it a no-op?  Should we log warn no-op in 1.18 
or leave deprecation and documentation as the only warning?

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423108#comment-16423108
 ] 

Ewan Mellor commented on TIKA-2624:
---

[~talli...@mitre.org] I don't know what your release policy / schedule is.  If 
2.0.0 is coming soon, I wouldn't bother with 1.18, just treat this as a 
breaking change for 2.0.  Tesseract 4 is such a big change, anyone doing new 
OCR work right now should be on the bleeding edge (IMHO) and you can just keep 
1.x for stability.  It's really up to you though.

If you're not planning to release 2.0 for a while, then a logged warning in 
1.18 makes sense.


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)