Re: [VOTE] Release Apache Tika 2.0.0 Candidate #1
+1 On Fri, Jul 16, 2021 at 10:00 PM Tilman Hausherr wrote: > +1 > > Tilman > > Am 14.07.2021 um 20:16 schrieb Tim Allison: > > All, > >A candidate for the Tika 2.0.0 release is > available at: > >https://dist.apache.org/repos/dist/dev/tika/2.0.0 > > > >The release candidate is a zip archive of the > sources in: > >https://github.com/apache/tika/tree/2.0.0-rc1/ > > > >The SHA-512 checksum of the archive is > > > > > 31d1f2e3deb54c398fa2d4bf00c434aad3f08387debf2a34dabe6d36747bcc49f2874cbd3abe7d1209670db8284ea540bca3b574ccd1d6b8f8675bdc3f704568. > > > >In addition, a staged maven repository is > available here: > > > > https://repository.apache.org/content/repositories/orgapachetika-1070 > > > >Please vote on releasing this package as Apache > > Tika 2.0.0. > >The vote is open for the next 72 hours and > > passes if a majority of at > >least three +1 Tika PMC votes are cast. > > > >[ ] +1 Release this package as Apache Tika 2.0.0 > >[ ] -1 Do not release this package because... > > > > Here's my +1. > > > > Cheers, > > > >Tim > > >
Re: [VOTE] Release Apache Tika 2.0.0 Candidate #1
+1 Tilman Am 14.07.2021 um 20:16 schrieb Tim Allison: All, A candidate for the Tika 2.0.0 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.0.0 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.0.0-rc1/ The SHA-512 checksum of the archive is 31d1f2e3deb54c398fa2d4bf00c434aad3f08387debf2a34dabe6d36747bcc49f2874cbd3abe7d1209670db8284ea540bca3b574ccd1d6b8f8675bdc3f704568. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1070 Please vote on releasing this package as Apache Tika 2.0.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 2.0.0 [ ] -1 Do not release this package because... Here's my +1. Cheers, Tim
Re: Fwd: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows
Am 16.07.2021 um 21:47 schrieb Tim Allison: I can respin 2.0.0-rc2 on Monday if this is a non-starter for 2.0.0-rc1. I don't think this is needed, the two issues I fixed are for build tests only. And only on windows. Seems I'm the only one here who builds on windows. Tilman Has anyone else had a chance to give 2.0.0-rc1 a spin? Thank you, Tilman. Cheers, Tim -- Forwarded message - From: Tilman Hausherr (Jira) Date: Fri, Jul 16, 2021 at 2:55 PM Subject: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows To: [ https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-3485. --- Resolution: Fixed testBadJVMArgs fails on Windows --- Key: TIKA-3485 URL: https://issues.apache.org/jira/browse/TIKA-3485 Project: Tika Issue Type: Bug Components: core Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 2.0.1 testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, so I'll adjust this. (I mentioned this some time ago but can't remember where, and I remember that I looked at the logs that it does indeed fail because of the bad args) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO
[ https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3361. --- Fix Version/s: 2.0.1 Resolution: Fixed Thank you [~peterkronenberg] for your patience on this one. More remains to be done with PDFs and OCR'ing, but this looks great to me. Thank you. > Improve intelligence of OCRStrategy=AUTO > - > > Key: TIKA-3361 > URL: https://issues.apache.org/jira/browse/TIKA-3361 > Project: Tika > Issue Type: Improvement >Reporter: Peter Kronenberg >Priority: Major > Fix For: 2.0.1 > > > Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt > at improving OCRStrategy=Auto > Currently, this strategy performs the following test > {code:java} > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) { > doOCROnCurrentPage(AUTO); > } > {code} > I added a way to change the new numbers involved: the threshold for the total > characters per page (below which, we OCR the page), and the threshold for > unmapped characters (above which we OCR the page) > My main concern is with the unmapped characters. OCR adds a lot of overhead, > which might not be necessary for simply a few unmapped characters > I added a new config, *OCRStrategyAuto*, which is only used if > OCRStrategy=AUTO. Its format is > {code:java} > ocrStrategyAuto = best|fast|m[%], n > {code} > ‘best’ and ‘fast’ are shortcuts. More later > m, n – m is the threshold for the number of unmapped characters per page. It > can also be specified as a percentage. So, m=20 means if your page has more > than 20 unmapped characters, it will OCR. m=20% means if the unmapped > characters are more than 20% of the total characters, then it will OCR. > n is the threshold for the total number of characters on the page. n does not > need to be specified and defaults to 10 > {code:java} > 20 > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is shorthand for *20,10* > {code:java} > best > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is the default and is equivalent to the current behavior > *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number > of unmapped characters is greater than 10% > {code:java} > fast > {code} > is equivalent to > {code:java} > 10%, 10 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO
[ https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382345#comment-17382345 ] Hudson commented on TIKA-3361: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #285 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/285/]) TIKA-3361 Make ocrStrategy=Auto more intelligent (#447) (github: [https://github.com/apache/tika/commit/484a340a4643ed2335413ba4feddbe8d64f4e9d8]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java > Improve intelligence of OCRStrategy=AUTO > - > > Key: TIKA-3361 > URL: https://issues.apache.org/jira/browse/TIKA-3361 > Project: Tika > Issue Type: Improvement >Reporter: Peter Kronenberg >Priority: Major > > Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt > at improving OCRStrategy=Auto > Currently, this strategy performs the following test > {code:java} > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) { > doOCROnCurrentPage(AUTO); > } > {code} > I added a way to change the new numbers involved: the threshold for the total > characters per page (below which, we OCR the page), and the threshold for > unmapped characters (above which we OCR the page) > My main concern is with the unmapped characters. OCR adds a lot of overhead, > which might not be necessary for simply a few unmapped characters > I added a new config, *OCRStrategyAuto*, which is only used if > OCRStrategy=AUTO. Its format is > {code:java} > ocrStrategyAuto = best|fast|m[%], n > {code} > ‘best’ and ‘fast’ are shortcuts. More later > m, n – m is the threshold for the number of unmapped characters per page. It > can also be specified as a percentage. So, m=20 means if your page has more > than 20 unmapped characters, it will OCR. m=20% means if the unmapped > characters are more than 20% of the total characters, then it will OCR. > n is the threshold for the total number of characters on the page. n does not > need to be specified and defaults to 10 > {code:java} > 20 > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is shorthand for *20,10* > {code:java} > best > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is the default and is equivalent to the current behavior > *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number > of unmapped characters is greater than 10% > {code:java} > fast > {code} > is equivalent to > {code:java} > 10%, 10 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3485) testBadJVMArgs fails on Windows
[ https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382344#comment-17382344 ] Hudson commented on TIKA-3485: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #285 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/285/]) TIKA-3485: expect -1 on windows (tilman: [https://github.com/apache/tika/commit/5a497b9b32fac32efed1b15f0d7c890a0e884617]) * (edit) tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaServerIntegrationTest.java > testBadJVMArgs fails on Windows > --- > > Key: TIKA-3485 > URL: https://issues.apache.org/jira/browse/TIKA-3485 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.1 > > > testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, > so I'll adjust this. > (I mentioned this some time ago but can't remember where, and I remember that > I looked at the logs that it does indeed fail because of the bad args) -- This message was sent by Atlassian Jira (v8.3.4#803005)
Fwd: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows
I can respin 2.0.0-rc2 on Monday if this is a non-starter for 2.0.0-rc1. Has anyone else had a chance to give 2.0.0-rc1 a spin? Thank you, Tilman. Cheers, Tim -- Forwarded message - From: Tilman Hausherr (Jira) Date: Fri, Jul 16, 2021 at 2:55 PM Subject: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows To: [ https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-3485. --- Resolution: Fixed > testBadJVMArgs fails on Windows > --- > > Key: TIKA-3485 > URL: https://issues.apache.org/jira/browse/TIKA-3485 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.1 > > > testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, > so I'll adjust this. > (I mentioned this some time ago but can't remember where, and I remember that > I looked at the logs that it does indeed fail because of the bad args) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO
[ https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382306#comment-17382306 ] ASF GitHub Bot commented on TIKA-3361: -- tballison merged pull request #447: URL: https://github.com/apache/tika/pull/447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve intelligence of OCRStrategy=AUTO > - > > Key: TIKA-3361 > URL: https://issues.apache.org/jira/browse/TIKA-3361 > Project: Tika > Issue Type: Improvement >Reporter: Peter Kronenberg >Priority: Major > > Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt > at improving OCRStrategy=Auto > Currently, this strategy performs the following test > {code:java} > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) { > doOCROnCurrentPage(AUTO); > } > {code} > I added a way to change the new numbers involved: the threshold for the total > characters per page (below which, we OCR the page), and the threshold for > unmapped characters (above which we OCR the page) > My main concern is with the unmapped characters. OCR adds a lot of overhead, > which might not be necessary for simply a few unmapped characters > I added a new config, *OCRStrategyAuto*, which is only used if > OCRStrategy=AUTO. Its format is > {code:java} > ocrStrategyAuto = best|fast|m[%], n > {code} > ‘best’ and ‘fast’ are shortcuts. More later > m, n – m is the threshold for the number of unmapped characters per page. It > can also be specified as a percentage. So, m=20 means if your page has more > than 20 unmapped characters, it will OCR. m=20% means if the unmapped > characters are more than 20% of the total characters, then it will OCR. > n is the threshold for the total number of characters on the page. n does not > need to be specified and defaults to 10 > {code:java} > 20 > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is shorthand for *20,10* > {code:java} > best > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is the default and is equivalent to the current behavior > *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number > of unmapped characters is greater than 10% > {code:java} > fast > {code} > is equivalent to > {code:java} > 10%, 10 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] tballison merged pull request #447: TIKA-3361 Make ocrStrategy=Auto more intelligent
tballison merged pull request #447: URL: https://github.com/apache/tika/pull/447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3484) TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist
[ https://issues.apache.org/jira/browse/TIKA-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382302#comment-17382302 ] Hudson commented on TIKA-3484: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #284 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/284/]) TIKA-3484: escape second parameter so that it works on Windows 10 (tilman: [https://github.com/apache/tika/commit/29ec5a0c01c21977670f3d3224cf5c4e618ef32f]) * (edit) tika-integration-tests/tika-pipes-opensearch-integration-tests/src/test/java/org/apache/tika/pipes/opensearch/tests/TikaPipesOpenSearchTest.java > TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" > directory does not exist > > > Key: TIKA-3484 > URL: https://issues.apache.org/jira/browse/TIKA-3484 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.1 > > > I've been trying to build "main" on windows 10, and got this: > java.lang.RuntimeException: java.lang.IllegalArgumentException: "basePath" > directory does not exist: > X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files > at > org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.runPipes(TikaPipesOpenSearchTest.java:129) > at > org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.testFSToOpenSearch(TikaPipesOpenSearchTest.java:96) > Caused by: java.lang.IllegalArgumentException: "basePath" directory does not > exist: > X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files > The cause is that the file > tika\tika-integration-tests\tika-pipes-opensearch-integration-tests\target\ta-opensearch.xml > have two basepaths that doesn't exist. It contains my path but without any > "/" or "\". > The root cause is that .replaceAll needs some escaping in the second > parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3482) Improve handling of FetchException in pipes processor
[ https://issues.apache.org/jira/browse/TIKA-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382303#comment-17382303 ] Hudson commented on TIKA-3482: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #284 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/284/]) TIKA-3482 -- improve handling of fetch exceptions, add basic logging to tika-app -a (tallison: [https://github.com/apache/tika/commit/dd5f49fc5ac751a8aa67e29e4c4c6963ca8ea65e]) * (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesResult.java * (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java * (add) tika-core/src/test/resources/test-documents/subdir/example.xml * (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesClient.java * (edit) tika-core/src/test/java/org/apache/tika/pipes/pipesiterator/FileSystemPipesIteratorTest.java * (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java > Improve handling of FetchException in pipes processor > - > > Key: TIKA-3482 > URL: https://issues.apache.org/jira/browse/TIKA-3482 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > In the current implementation, if there's a fetch exception, that causes the > forked process to restart. We should transmit that exception back to the > forking process and not restart. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows
[ https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-3485. --- Resolution: Fixed > testBadJVMArgs fails on Windows > --- > > Key: TIKA-3485 > URL: https://issues.apache.org/jira/browse/TIKA-3485 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.1 > > > testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, > so I'll adjust this. > (I mentioned this some time ago but can't remember where, and I remember that > I looked at the logs that it does indeed fail because of the bad args) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3485) testBadJVMArgs fails on Windows
Tilman Hausherr created TIKA-3485: - Summary: testBadJVMArgs fails on Windows Key: TIKA-3485 URL: https://issues.apache.org/jira/browse/TIKA-3485 Project: Tika Issue Type: Bug Components: core Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 2.0.1 testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, so I'll adjust this. (I mentioned this some time ago but can't remember where, and I remember that I looked at the logs that it does indeed fail because of the bad args) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO
[ https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382256#comment-17382256 ] Peter Kronenberg commented on TIKA-3361: Finally got a chance to finish this Pull Request > Improve intelligence of OCRStrategy=AUTO > - > > Key: TIKA-3361 > URL: https://issues.apache.org/jira/browse/TIKA-3361 > Project: Tika > Issue Type: Improvement >Reporter: Peter Kronenberg >Priority: Major > > Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt > at improving OCRStrategy=Auto > Currently, this strategy performs the following test > {code:java} > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) { > doOCROnCurrentPage(AUTO); > } > {code} > I added a way to change the new numbers involved: the threshold for the total > characters per page (below which, we OCR the page), and the threshold for > unmapped characters (above which we OCR the page) > My main concern is with the unmapped characters. OCR adds a lot of overhead, > which might not be necessary for simply a few unmapped characters > I added a new config, *OCRStrategyAuto*, which is only used if > OCRStrategy=AUTO. Its format is > {code:java} > ocrStrategyAuto = best|fast|m[%], n > {code} > ‘best’ and ‘fast’ are shortcuts. More later > m, n – m is the threshold for the number of unmapped characters per page. It > can also be specified as a percentage. So, m=20 means if your page has more > than 20 unmapped characters, it will OCR. m=20% means if the unmapped > characters are more than 20% of the total characters, then it will OCR. > n is the threshold for the total number of characters on the page. n does not > need to be specified and defaults to 10 > {code:java} > 20 > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is shorthand for *20,10* > {code:java} > best > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is the default and is equivalent to the current behavior > *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number > of unmapped characters is greater than 10% > {code:java} > fast > {code} > is equivalent to > {code:java} > 10%, 10 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO
[ https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382254#comment-17382254 ] ASF GitHub Bot commented on TIKA-3361: -- peterkronenberg opened a new pull request #447: URL: https://github.com/apache/tika/pull/447 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve intelligence of OCRStrategy=AUTO > - > > Key: TIKA-3361 > URL: https://issues.apache.org/jira/browse/TIKA-3361 > Project: Tika > Issue Type: Improvement >Reporter: Peter Kronenberg >Priority: Major > > Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt > at improving OCRStrategy=Auto > Currently, this strategy performs the following test > {code:java} > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) { > doOCROnCurrentPage(AUTO); > } > {code} > I added a way to change the new numbers involved: the threshold for the total > characters per page (below which, we OCR the page), and the threshold for > unmapped characters (above which we OCR the page) > My main concern is with the unmapped characters. OCR adds a lot of overhead, > which might not be necessary for simply a few unmapped characters > I added a new config, *OCRStrategyAuto*, which is only used if > OCRStrategy=AUTO. Its format is > {code:java} > ocrStrategyAuto = best|fast|m[%], n > {code} > ‘best’ and ‘fast’ are shortcuts. More later > m, n – m is the threshold for the number of unmapped characters per page. It > can also be specified as a percentage. So, m=20 means if your page has more > than 20 unmapped characters, it will OCR. m=20% means if the unmapped > characters are more than 20% of the total characters, then it will OCR. > n is the threshold for the total number of characters on the page. n does not > need to be specified and defaults to 10 > {code:java} > 20 > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is shorthand for *20,10* > {code:java} > best > {code} > is equivalent to > {code:java} > 20, 10 > {code} > *best* is the default and is equivalent to the current behavior > *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number > of unmapped characters is greater than 10% > {code:java} > fast > {code} > is equivalent to > {code:java} > 10%, 10 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] peterkronenberg opened a new pull request #447: TIKA-3361 Make ocrStrategy=Auto more intelligent
peterkronenberg opened a new pull request #447: URL: https://github.com/apache/tika/pull/447 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (TIKA-3484) TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist
[ https://issues.apache.org/jira/browse/TIKA-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-3484. --- Resolution: Fixed > TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" > directory does not exist > > > Key: TIKA-3484 > URL: https://issues.apache.org/jira/browse/TIKA-3484 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.1 > > > I've been trying to build "main" on windows 10, and got this: > java.lang.RuntimeException: java.lang.IllegalArgumentException: "basePath" > directory does not exist: > X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files > at > org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.runPipes(TikaPipesOpenSearchTest.java:129) > at > org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.testFSToOpenSearch(TikaPipesOpenSearchTest.java:96) > Caused by: java.lang.IllegalArgumentException: "basePath" directory does not > exist: > X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files > The cause is that the file > tika\tika-integration-tests\tika-pipes-opensearch-integration-tests\target\ta-opensearch.xml > have two basepaths that doesn't exist. It contains my path but without any > "/" or "\". > The root cause is that .replaceAll needs some escaping in the second > parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3484) TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist
Tilman Hausherr created TIKA-3484: - Summary: TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist Key: TIKA-3484 URL: https://issues.apache.org/jira/browse/TIKA-3484 Project: Tika Issue Type: Bug Components: tika-pipes Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 2.0.1 I've been trying to build "main" on windows 10, and got this: java.lang.RuntimeException: java.lang.IllegalArgumentException: "basePath" directory does not exist: X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files at org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.runPipes(TikaPipesOpenSearchTest.java:129) at org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.testFSToOpenSearch(TikaPipesOpenSearchTest.java:96) Caused by: java.lang.IllegalArgumentException: "basePath" directory does not exist: X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files The cause is that the file tika\tika-integration-tests\tika-pipes-opensearch-integration-tests\target\ta-opensearch.xml have two basepaths that doesn't exist. It contains my path but without any "/" or "\". The root cause is that .replaceAll needs some escaping in the second parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382166#comment-17382166 ] ASF GitHub Bot commented on TIKA-3483: -- lewismc edited a comment on pull request #5: URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144 Hi @bynare the NetworkPolicy looks to be fine thanks. It helps other developers understand the impact of this PR if we describe it. For example, > This pull request proposes to create a network policy to restrict traffic to pods within the same namespace that include the label `-client: true` e.g. `tika-client: true` Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.0 > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika-helm] lewismc edited a comment on pull request #5: [TIKA-3483] Implement a network policy for Helm Chart
lewismc edited a comment on pull request #5: URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144 Hi @bynare the NetworkPolicy looks to be fine thanks. It helps other developers understand the impact of this PR if we describe it. For example, > This pull request proposes to create a network policy to restrict traffic to pods within the same namespace that include the label `-client: true` e.g. `tika-client: true` Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382165#comment-17382165 ] ASF GitHub Bot commented on TIKA-3483: -- lewismc commented on a change in pull request #5: URL: https://github.com/apache/tika-helm/pull/5#discussion_r671359490 ## File path: templates/networkpolicy.yaml ## @@ -0,0 +1,23 @@ +{{- if .Values.networkPolicy.enabled }} Review comment: We need an Apache License v2 header. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.0 > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika-helm] lewismc commented on a change in pull request #5: [TIKA-3483] Implement a network policy for Helm Chart
lewismc commented on a change in pull request #5: URL: https://github.com/apache/tika-helm/pull/5#discussion_r671359490 ## File path: templates/networkpolicy.yaml ## @@ -0,0 +1,23 @@ +{{- if .Values.networkPolicy.enabled }} Review comment: We need an Apache License v2 header. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382163#comment-17382163 ] ASF GitHub Bot commented on TIKA-3483: -- lewismc commented on pull request #5: URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144 Hi @bynare the NetworkPolicy looks to be fine thanks. Can you provide some further context on the pull request for the other developers? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.0 > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika-helm] lewismc commented on pull request #5: [TIKA-3483] Implement a network policy for Helm Chart
lewismc commented on pull request #5: URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144 Hi @bynare the NetworkPolicy looks to be fine thanks. Can you provide some further context on the pull request for the other developers? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-3483: --- Summary: Implement a network policy for Helm Chart (was: Implement a network policy) > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.0 > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3483) Implement a network policy
Lewis John McGibbney created TIKA-3483: -- Summary: Implement a network policy Key: TIKA-3483 URL: https://issues.apache.org/jira/browse/TIKA-3483 Project: Tika Issue Type: Improvement Components: helm Reporter: Lewis John McGibbney Fix For: 2.0.0 See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)