[jira] [Commented] (TIKA-2518) tika app outputs warnings by default

2017-12-27 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304958#comment-16304958 ] Ewan Mellor commented on TIKA-2518: --- It looks like this is being fixed under TIKA-2490.

[jira] [Created] (TIKA-2581) testOCROutputsHOCR fails with Tesseract 4.0

2018-02-21 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2581: - Summary: testOCROutputsHOCR fails with Tesseract 4.0 Key: TIKA-2581 URL: https://issues.apache.org/jira/browse/TIKA-2581 Project: Tika Issue Type: Bug Co

[jira] [Updated] (TIKA-2581) testOCROutputsHOCR fails with Tesseract 4.0

2018-02-21 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Mellor updated TIKA-2581: -- Description: TesseractOCRParserTest.testOCROutputsHOCR fails with Tesseract 4.0. With 3.x, the output is

[jira] [Created] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-02-21 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2582: - Summary: Tesseract 4.0 includes a FF character by default, breaking parsers Key: TIKA-2582 URL: https://issues.apache.org/jira/browse/TIKA-2582 Project: Tika Issu

[jira] [Updated] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-02-21 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Mellor updated TIKA-2582: -- Description: Tesseract 4.0 includes a change to use form feed characters to separate pages by default in

[jira] [Created] (TIKA-2580) SafeContentHandler documentation is incorrect about replacement character

2018-02-21 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2580: - Summary: SafeContentHandler documentation is incorrect about replacement character Key: TIKA-2580 URL: https://issues.apache.org/jira/browse/TIKA-2580 Project: Tika

[jira] [Created] (TIKA-2583) Tika readme should mention builds.apache.org

2018-02-21 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2583: - Summary: Tika readme should mention builds.apache.org Key: TIKA-2583 URL: https://issues.apache.org/jira/browse/TIKA-2583 Project: Tika Issue Type: Bug C

[jira] [Created] (TIKA-2584) Tika should have a way to pass arbitrary Tesseract options

2018-02-21 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2584: - Summary: Tika should have a way to pass arbitrary Tesseract options Key: TIKA-2584 URL: https://issues.apache.org/jira/browse/TIKA-2584 Project: Tika Issue Type: I

[jira] [Commented] (TIKA-2583) Tika readme should mention builds.apache.org

2018-02-21 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372074#comment-16372074 ] Ewan Mellor commented on TIKA-2583: --- I wasn't trying to tell users where to find builds,

[jira] [Created] (TIKA-2586) PDFParser documentation has incorrect DPI default

2018-02-21 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2586: - Summary: PDFParser documentation has incorrect DPI default Key: TIKA-2586 URL: https://issues.apache.org/jira/browse/TIKA-2586 Project: Tika Issue Type: Improvemen

[jira] [Created] (TIKA-2613) Tesseract 4.0 has removed -psm, so Tika must update

2018-03-26 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2613: - Summary: Tesseract 4.0 has removed -psm, so Tika must update Key: TIKA-2613 URL: https://issues.apache.org/jira/browse/TIKA-2613 Project: Tika Issue Type: Improvem

[jira] [Updated] (TIKA-2613) Tesseract 4.0 has removed -psm, so Tika must update

2018-03-26 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Mellor updated TIKA-2613: -- Description: Tesseract 4.0 (currently in beta-1) has removed the {{\-psm}} flag, in favor of {{\-\-psm}}

[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419501#comment-16419501 ] Ewan Mellor commented on TIKA-2620: --- [https://bugs.openjdk.java.net/browse/JDK-8041125]

[jira] [Commented] (TIKA-2613) Tesseract 4.0 has removed -psm, so Tika must update

2018-03-29 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419752#comment-16419752 ] Ewan Mellor commented on TIKA-2613: --- Build failures were not this change; they were from 

[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

2018-03-29 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419798#comment-16419798 ] Ewan Mellor commented on TIKA-2582: --- Build failures were not this change; they were from 

[jira] [Created] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2624: - Summary: Rendering PDFs for OCR with Tesseract uses different DPI than claimed Key: TIKA-2624 URL: https://issues.apache.org/jira/browse/TIKA-2624 Project: Tika I

[jira] [Updated] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Mellor updated TIKA-2624: -- Description: Tika has two properties in {{PDFParser.properties}} that control what happens in AbstractPD

[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422984#comment-16422984 ] Ewan Mellor commented on TIKA-2620: --- See TIKA-2624. I think that the statement re 300 DP

[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423040#comment-16423040 ] Ewan Mellor commented on TIKA-2624: --- There were definitely changes between 1.8 and 2.0, e

[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423108#comment-16423108 ] Ewan Mellor commented on TIKA-2624: --- [~talli...@mitre.org] I don't know what your release

[jira] [Created] (TIKA-2651) tika-translate jar contains duplicate classes from tika-core jar

2018-05-24 Thread Ewan Mellor (JIRA)
Ewan Mellor created TIKA-2651: - Summary: tika-translate jar contains duplicate classes from tika-core jar Key: TIKA-2651 URL: https://issues.apache.org/jira/browse/TIKA-2651 Project: Tika Issue