[jira] [Created] (TIKA-3272) Improve Rotation handling

2021-01-13 Thread Peter Kronenberg (Jira)
Peter Kronenberg created TIKA-3272: -- Summary: Improve Rotation handling Key: TIKA-3272 URL: https://issues.apache.org/jira/browse/TIKA-3272 Project: Tika Issue Type: Improvement

[jira] [Created] (TIKA-3273) Further metadat cleanup for TIka 2.0.0

2021-01-13 Thread Tim Allison (Jira)
Tim Allison created TIKA-3273: - Summary: Further metadat cleanup for TIka 2.0.0 Key: TIKA-3273 URL: https://issues.apache.org/jira/browse/TIKA-3273 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-3273) Further metadata cleanup for TIka 2.0.0

2021-01-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3273: -- Summary: Further metadata cleanup for TIka 2.0.0 (was: Further metadat cleanup for TIka 2.0.0) > Furth

[jira] [Created] (TIKA-3274) Tika 2.0.0 -- Move parser specific metadata out of tika-core to parser modules

2021-01-13 Thread Tim Allison (Jira)
Tim Allison created TIKA-3274: - Summary: Tika 2.0.0 -- Move parser specific metadata out of tika-core to parser modules Key: TIKA-3274 URL: https://issues.apache.org/jira/browse/TIKA-3274 Project: Tika

[jira] [Resolved] (TIKA-3273) Further metadata cleanup for TIka 2.0.0

2021-01-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3273. --- Fix Version/s: 2.0.0 Resolution: Fixed > Further metadata cleanup for TIka 2.0.0 >

[jira] [Commented] (TIKA-3270) Render non-text in PDFs for OCR

2021-01-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264228#comment-17264228 ] Luís Filipe Nassif commented on TIKA-3270: -- Is checking for missing ToUnicode map

Looking for PR code review for DWG parser changes

2021-01-13 Thread Nicholas DiPiazza
Looking for code review of: https://github.com/apache/tika/pull/395 This addresses TIKA-1735 and it also adds the ability for the dwg parser to utilize the LibreDWG library if it is configured. The DWG reading code is much too vast and complex to hope to port to Java. So similar to how we do tes

droste.zip

2021-01-13 Thread Tim Allison
All, Some corporate virus detectors are removing droste.zip ( a zip quine). We should probably make the tests that rely on that and our gz quine optional/or test that the file exists? Cheers, Tim

[jira] [Commented] (TIKA-3273) Further metadata cleanup for TIka 2.0.0

2021-01-13 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264280#comment-17264280 ] Hudson commented on TIKA-3273: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #1

OCR testing

2021-01-13 Thread Peter Kronenberg
What is the difference between TesseractOCRParserTest in tika-parser-ocr-module and tika-parsers-classic-package?

OCR Testing

2021-01-13 Thread Peter Kronenberg
What is the difference between TesseractOCRParserTest in tika-parser-ocr-module and tika-parsers-classic-package?

Re: OCR testing

2021-01-13 Thread Tim Allison
If a unit test requires parsers outside of its own module, we've moved those tests to tika-parsers-classic-package. On Wed, Jan 13, 2021 at 1:32 PM Peter Kronenberg wrote: > What is the difference between TesseractOCRParserTest in > tika-parser-ocr-module and tika-parsers-classic-package? > >

Re: droste.zip

2021-01-13 Thread Tim Allison
I added assume statements. Will push shortly. On Wed, Jan 13, 2021 at 11:38 AM Tim Allison wrote: > All, > Some corporate virus detectors are removing droste.zip ( a zip quine). > We should probably make the tests that rely on that and our gz quine > optional/or test that the file exists? > >

Re: Looking for PR code review for DWG parser changes

2021-01-13 Thread Tim Allison
Nicholas, I'm really grateful for your PR. Once I roll 2.0.0-ALPHA, I'll have time to take a look. I'm out a bit next week...so might not be until towards the end of next week. If there are other devs who want to take this, please do. Please don't take my lack of response as a failure of

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-13 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264425#comment-17264425 ] Hudson commented on TIKA-3258: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #1

[jira] [Resolved] (TIKA-3271) Change default image resize size in TesseractParser's pre-processing step

2021-01-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3271. --- Fix Version/s: 2.0.0 Resolution: Fixed > Change default image resize size in TesseractParser's

[jira] [Resolved] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3258. --- Fix Version/s: 2.0.0 Assignee: Tim Allison Resolution: Fixed PDFs are now OCR'd with '

[jira] [Commented] (TIKA-3271) Change default image resize size in TesseractParser's pre-processing step

2021-01-13 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264449#comment-17264449 ] Hudson commented on TIKA-3271: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #

[jira] [Commented] (TIKA-3271) Change default image resize size in TesseractParser's pre-processing step

2021-01-13 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264466#comment-17264466 ] Hudson commented on TIKA-3271: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #1

Re: Looking for PR code review for DWG parser changes

2021-01-13 Thread Nicholas DiPiazza
Definitely take your time! No pressure from my end, and I appreciate all that you do for this project! On Wed, Jan 13, 2021 at 2:48 PM Tim Allison wrote: > Nicholas, > > I'm really grateful for your PR. Once I roll 2.0.0-ALPHA, I'll have time > to take a look. I'm out a bit next week...so mi

Fwd: Python dependency

2021-01-13 Thread Peter Kronenberg
Any thoughts on this? Wonering if I can totally remove the python dependency or we still need it? From: Peter Kronenberg Sent: Wednesday, January 13, 2021, 11:20 AM To: talli...@apache.org Subject: Python dependency So I see that there are other Python script

[VOTE] Release Apache Tika 2.0.0-ALPHA Candidate #1

2021-01-13 Thread Tim Allison
All, A candidate for the Tika 2.0.0-ALPHA release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.0.0-ALPHA-rc1/ The SHA-512 checksum of the archive is ae018f4384d2cd63281422cc82ec7

Re: Python dependency

2021-01-13 Thread Tim Allison
IMHO, we should remove it entirely from the tesseract module. The advancedmedia module can handle finding it/configuring it/executing it. Or, longer term, as Nick proposed, we can have a centralized "common external commands" configuration somehow through TikaConfig...but that is for later. As I'

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-13 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264526#comment-17264526 ] Tim Allison commented on TIKA-3258: --- Updated the {{main}} branch and pushed this into 2.

[GitHub] [tika] PeterAlfredLee closed pull request #333: Adds github action CI builds on Ubuntu

2021-01-13 Thread GitBox
PeterAlfredLee closed pull request #333: URL: https://github.com/apache/tika/pull/333 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [tika] PeterAlfredLee closed pull request #380: Add two tests for `OPCPackageDetector`

2021-01-13 Thread GitBox
PeterAlfredLee closed pull request #380: URL: https://github.com/apache/tika/pull/380 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [tika] PeterAlfredLee closed pull request #382: Simplify some code in OPCPackageDetector#detect

2021-01-13 Thread GitBox
PeterAlfredLee closed pull request #382: URL: https://github.com/apache/tika/pull/382 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [tika] PeterAlfredLee merged pull request #369: Use IOException instead of IOExceptionWithCause

2021-01-13 Thread GitBox
PeterAlfredLee merged pull request #369: URL: https://github.com/apache/tika/pull/369 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t