Re: [VOTE] Release Apache Tika 2.0.0 Candidate #1

2021-07-16 Thread Nicholas DiPiazza
+1

On Fri, Jul 16, 2021 at 10:00 PM Tilman Hausherr 
wrote:

> +1
>
> Tilman
>
> Am 14.07.2021 um 20:16 schrieb Tim Allison:
> > All,
> >A candidate for the Tika 2.0.0 release is
> available at:
> >https://dist.apache.org/repos/dist/dev/tika/2.0.0
> >
> >The release candidate is a zip archive of the
> sources in:
> >https://github.com/apache/tika/tree/2.0.0-rc1/
> >
> >The SHA-512 checksum of the archive is
> >
> >
> 31d1f2e3deb54c398fa2d4bf00c434aad3f08387debf2a34dabe6d36747bcc49f2874cbd3abe7d1209670db8284ea540bca3b574ccd1d6b8f8675bdc3f704568.
> >
> >In addition, a staged maven repository is
> available here:
> >
> > https://repository.apache.org/content/repositories/orgapachetika-1070
> >
> >Please vote on releasing this package as Apache
> > Tika 2.0.0.
> >The vote is open for the next 72 hours and
> > passes if a majority of at
> >least three +1 Tika PMC votes are cast.
> >
> >[ ] +1 Release this package as Apache Tika 2.0.0
> >[ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >Tim
>
>
>


Re: [VOTE] Release Apache Tika 2.0.0 Candidate #1

2021-07-16 Thread Tilman Hausherr

+1

Tilman

Am 14.07.2021 um 20:16 schrieb Tim Allison:

All,
   A candidate for the Tika 2.0.0 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/2.0.0

   The release candidate is a zip archive of the sources in:
   https://github.com/apache/tika/tree/2.0.0-rc1/

   The SHA-512 checksum of the archive is

31d1f2e3deb54c398fa2d4bf00c434aad3f08387debf2a34dabe6d36747bcc49f2874cbd3abe7d1209670db8284ea540bca3b574ccd1d6b8f8675bdc3f704568.

   In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1070

   Please vote on releasing this package as Apache
Tika 2.0.0.
   The vote is open for the next 72 hours and
passes if a majority of at
   least three +1 Tika PMC votes are cast.

   [ ] +1 Release this package as Apache Tika 2.0.0
   [ ] -1 Do not release this package because...

Here's my +1.

Cheers,

   Tim





Re: Fwd: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows

2021-07-16 Thread Tilman Hausherr

Am 16.07.2021 um 21:47 schrieb Tim Allison:

I can respin 2.0.0-rc2 on Monday if this is a non-starter for 2.0.0-rc1.


I don't think this is needed, the two issues I fixed are for build tests 
only. And only on windows. Seems I'm the only one here who builds on 
windows.


Tilman





Has anyone else had a chance to give 2.0.0-rc1 a spin?

Thank you, Tilman.

Cheers,

   Tim

-- Forwarded message -
From: Tilman Hausherr (Jira) 
Date: Fri, Jul 16, 2021 at 2:55 PM
Subject: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows
To: 



  [ 
https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tilman Hausherr resolved TIKA-3485.
---
 Resolution: Fixed


testBadJVMArgs fails on Windows
---

 Key: TIKA-3485
 URL: https://issues.apache.org/jira/browse/TIKA-3485
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 2.0.1


testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, so 
I'll adjust this.
(I mentioned this some time ago but can't remember where, and I remember that I 
looked at the logs that it does indeed fail because of the bad args)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)





[jira] [Resolved] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-07-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3361.
---
Fix Version/s: 2.0.1
   Resolution: Fixed

Thank you [~peterkronenberg] for your patience on this one.  More remains to be 
done with PDFs and OCR'ing, but this looks great to me.  Thank you.

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
> Fix For: 2.0.1
>
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-07-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382345#comment-17382345
 ] 

Hudson commented on TIKA-3361:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #285 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/285/])
TIKA-3361 Make ocrStrategy=Auto more intelligent (#447) (github: 
[https://github.com/apache/tika/commit/484a340a4643ed2335413ba4feddbe8d64f4e9d8])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3485) testBadJVMArgs fails on Windows

2021-07-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382344#comment-17382344
 ] 

Hudson commented on TIKA-3485:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #285 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/285/])
TIKA-3485: expect -1 on windows (tilman: 
[https://github.com/apache/tika/commit/5a497b9b32fac32efed1b15f0d7c890a0e884617])
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaServerIntegrationTest.java


> testBadJVMArgs fails on Windows
> ---
>
> Key: TIKA-3485
> URL: https://issues.apache.org/jira/browse/TIKA-3485
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.1
>
>
> testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, 
> so I'll adjust this.
> (I mentioned this some time ago but can't remember where, and I remember that 
> I looked at the logs that it does indeed fail because of the bad args)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Fwd: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows

2021-07-16 Thread Tim Allison
I can respin 2.0.0-rc2 on Monday if this is a non-starter for 2.0.0-rc1.

Has anyone else had a chance to give 2.0.0-rc1 a spin?

Thank you, Tilman.

Cheers,

  Tim

-- Forwarded message -
From: Tilman Hausherr (Jira) 
Date: Fri, Jul 16, 2021 at 2:55 PM
Subject: [jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows
To: 



 [ 
https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tilman Hausherr resolved TIKA-3485.
---
Resolution: Fixed

> testBadJVMArgs fails on Windows
> ---
>
> Key: TIKA-3485
> URL: https://issues.apache.org/jira/browse/TIKA-3485
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.1
>
>
> testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, 
> so I'll adjust this.
> (I mentioned this some time ago but can't remember where, and I remember that 
> I looked at the logs that it does indeed fail because of the bad args)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-07-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382306#comment-17382306
 ] 

ASF GitHub Bot commented on TIKA-3361:
--

tballison merged pull request #447:
URL: https://github.com/apache/tika/pull/447


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika] tballison merged pull request #447: TIKA-3361 Make ocrStrategy=Auto more intelligent

2021-07-16 Thread GitBox


tballison merged pull request #447:
URL: https://github.com/apache/tika/pull/447


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3484) TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist

2021-07-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382302#comment-17382302
 ] 

Hudson commented on TIKA-3484:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #284 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/284/])
TIKA-3484: escape second parameter so that it works on Windows 10 (tilman: 
[https://github.com/apache/tika/commit/29ec5a0c01c21977670f3d3224cf5c4e618ef32f])
* (edit) 
tika-integration-tests/tika-pipes-opensearch-integration-tests/src/test/java/org/apache/tika/pipes/opensearch/tests/TikaPipesOpenSearchTest.java


> TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" 
> directory does not exist
> 
>
> Key: TIKA-3484
> URL: https://issues.apache.org/jira/browse/TIKA-3484
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.1
>
>
> I've been trying to build "main" on windows 10, and got this:
> java.lang.RuntimeException: java.lang.IllegalArgumentException: "basePath" 
> directory does not exist: 
> X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files
> at 
> org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.runPipes(TikaPipesOpenSearchTest.java:129)
> at 
> org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.testFSToOpenSearch(TikaPipesOpenSearchTest.java:96)
> Caused by: java.lang.IllegalArgumentException: "basePath" directory does not 
> exist: 
> X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files
> The cause is that the file 
> tika\tika-integration-tests\tika-pipes-opensearch-integration-tests\target\ta-opensearch.xml
>  have two basepaths that doesn't exist. It contains my path but without any 
> "/" or "\".
> The root cause is that .replaceAll needs some escaping in the second 
> parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3482) Improve handling of FetchException in pipes processor

2021-07-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382303#comment-17382303
 ] 

Hudson commented on TIKA-3482:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #284 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/284/])
TIKA-3482 -- improve handling of fetch exceptions, add basic logging to 
tika-app -a (tallison: 
[https://github.com/apache/tika/commit/dd5f49fc5ac751a8aa67e29e4c4c6963ca8ea65e])
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesResult.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java
* (add) tika-core/src/test/resources/test-documents/subdir/example.xml
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesClient.java
* (edit) 
tika-core/src/test/java/org/apache/tika/pipes/pipesiterator/FileSystemPipesIteratorTest.java
* (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java


> Improve handling of FetchException in pipes processor
> -
>
> Key: TIKA-3482
> URL: https://issues.apache.org/jira/browse/TIKA-3482
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> In the current implementation, if there's a fetch exception, that causes the 
> forked process to restart.  We should transmit that exception back to the 
> forking process and not restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3485) testBadJVMArgs fails on Windows

2021-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-3485.
---
Resolution: Fixed

> testBadJVMArgs fails on Windows
> ---
>
> Key: TIKA-3485
> URL: https://issues.apache.org/jira/browse/TIKA-3485
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.1
>
>
> testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, 
> so I'll adjust this.
> (I mentioned this some time ago but can't remember where, and I remember that 
> I looked at the logs that it does indeed fail because of the bad args)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3485) testBadJVMArgs fails on Windows

2021-07-16 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-3485:
-

 Summary: testBadJVMArgs fails on Windows
 Key: TIKA-3485
 URL: https://issues.apache.org/jira/browse/TIKA-3485
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 2.0.1


testBadJVMArgs fails on Windows because the exit value is -1 instead of 255, so 
I'll adjust this.

(I mentioned this some time ago but can't remember where, and I remember that I 
looked at the logs that it does indeed fail because of the bad args)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-07-16 Thread Peter Kronenberg (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382256#comment-17382256
 ] 

Peter Kronenberg commented on TIKA-3361:


Finally got a chance to finish this Pull Request

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-07-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382254#comment-17382254
 ] 

ASF GitHub Bot commented on TIKA-3361:
--

peterkronenberg opened a new pull request #447:
URL: https://github.com/apache/tika/pull/447


   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika] peterkronenberg opened a new pull request #447: TIKA-3361 Make ocrStrategy=Auto more intelligent

2021-07-16 Thread GitBox


peterkronenberg opened a new pull request #447:
URL: https://github.com/apache/tika/pull/447


   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (TIKA-3484) TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist

2021-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-3484.
---
Resolution: Fixed

> TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" 
> directory does not exist
> 
>
> Key: TIKA-3484
> URL: https://issues.apache.org/jira/browse/TIKA-3484
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.1
>
>
> I've been trying to build "main" on windows 10, and got this:
> java.lang.RuntimeException: java.lang.IllegalArgumentException: "basePath" 
> directory does not exist: 
> X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files
> at 
> org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.runPipes(TikaPipesOpenSearchTest.java:129)
> at 
> org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.testFSToOpenSearch(TikaPipesOpenSearchTest.java:96)
> Caused by: java.lang.IllegalArgumentException: "basePath" directory does not 
> exist: 
> X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files
> The cause is that the file 
> tika\tika-integration-tests\tika-pipes-opensearch-integration-tests\target\ta-opensearch.xml
>  have two basepaths that doesn't exist. It contains my path but without any 
> "/" or "\".
> The root cause is that .replaceAll needs some escaping in the second 
> parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3484) TikaPipesOpenSearchTest: java.lang.IllegalArgumentException: "basePath" directory does not exist

2021-07-16 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-3484:
-

 Summary: TikaPipesOpenSearchTest: 
java.lang.IllegalArgumentException: "basePath" directory does not exist
 Key: TIKA-3484
 URL: https://issues.apache.org/jira/browse/TIKA-3484
 Project: Tika
  Issue Type: Bug
  Components: tika-pipes
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 2.0.1


I've been trying to build "main" on windows 10, and got this:

java.lang.RuntimeException: java.lang.IllegalArgumentException: "basePath" 
directory does not exist: 
X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files
at 
org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.runPipes(TikaPipesOpenSearchTest.java:129)
at 
org.apache.tika.pipes.opensearch.tests.TikaPipesOpenSearchTest.testFSToOpenSearch(TikaPipesOpenSearchTest.java:96)
Caused by: java.lang.IllegalArgumentException: "basePath" directory does not 
exist: 
X\YYJavatika-maintikatika-integration-teststika-pipes-opensearch-integration-teststargettest-files


The cause is that the file 
tika\tika-integration-tests\tika-pipes-opensearch-integration-tests\target\ta-opensearch.xml
 have two basepaths that doesn't exist. It contains my path but without any "/" 
or "\".

The root cause is that .replaceAll needs some escaping in the second parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart

2021-07-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382166#comment-17382166
 ] 

ASF GitHub Bot commented on TIKA-3483:
--

lewismc edited a comment on pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144


   Hi @bynare the NetworkPolicy looks to be fine thanks.
   It helps other developers understand the impact of this PR if we describe 
it. For example, 
   
   > This pull request proposes to create a network policy to restrict traffic 
to pods within the same namespace that include the label `-client: 
true` e.g. `tika-client: true`
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika-helm] lewismc edited a comment on pull request #5: [TIKA-3483] Implement a network policy for Helm Chart

2021-07-16 Thread GitBox


lewismc edited a comment on pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144


   Hi @bynare the NetworkPolicy looks to be fine thanks.
   It helps other developers understand the impact of this PR if we describe 
it. For example, 
   
   > This pull request proposes to create a network policy to restrict traffic 
to pods within the same namespace that include the label `-client: 
true` e.g. `tika-client: true`
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart

2021-07-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382165#comment-17382165
 ] 

ASF GitHub Bot commented on TIKA-3483:
--

lewismc commented on a change in pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#discussion_r671359490



##
File path: templates/networkpolicy.yaml
##
@@ -0,0 +1,23 @@
+{{- if .Values.networkPolicy.enabled }}

Review comment:
   We need an Apache License v2 header.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika-helm] lewismc commented on a change in pull request #5: [TIKA-3483] Implement a network policy for Helm Chart

2021-07-16 Thread GitBox


lewismc commented on a change in pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#discussion_r671359490



##
File path: templates/networkpolicy.yaml
##
@@ -0,0 +1,23 @@
+{{- if .Values.networkPolicy.enabled }}

Review comment:
   We need an Apache License v2 header.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart

2021-07-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382163#comment-17382163
 ] 

ASF GitHub Bot commented on TIKA-3483:
--

lewismc commented on pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144


   Hi @bynare the NetworkPolicy looks to be fine thanks.
   Can you provide some further context on the pull request for the other 
developers?
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika-helm] lewismc commented on pull request #5: [TIKA-3483] Implement a network policy for Helm Chart

2021-07-16 Thread GitBox


lewismc commented on pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#issuecomment-881547144


   Hi @bynare the NetworkPolicy looks to be fine thanks.
   Can you provide some further context on the pull request for the other 
developers?
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (TIKA-3483) Implement a network policy for Helm Chart

2021-07-16 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-3483:
---
Summary: Implement a network policy for Helm Chart  (was: Implement a 
network policy)

> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3483) Implement a network policy

2021-07-16 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created TIKA-3483:
--

 Summary: Implement a network policy
 Key: TIKA-3483
 URL: https://issues.apache.org/jira/browse/TIKA-3483
 Project: Tika
  Issue Type: Improvement
  Components: helm
Reporter: Lewis John McGibbney
 Fix For: 2.0.0


See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)