[jira] [Comment Edited] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348793#comment-17348793
 ] 

Tim Allison edited comment on TIKA-3361 at 5/20/21, 8:50 PM:
-

How about the PR as is, but change the terms to "faster" and "better"? 

If I wait to future proof this with the other things, it won't get in.  And I 
realize I'm the one holding this up. :(

Let's also add range checks on initialization.

I'm happy to do both of these if you're done with this. :D


was (Author: talli...@mitre.org):
How about the PR as is, but change the terms to "faster" and "better"? 

If we wait to future proof this with the other things, it won't get in.  And I 
realize I'm the one holding this up. :(

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-04-20 Thread Peter Kronenberg (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325904#comment-17325904
 ] 

Peter Kronenberg edited comment on TIKA-3361 at 4/20/21, 3:25 PM:
--

Yes, theoretically, you're correct.  You could argue that Best should mean if 
there is just 1 Unmapped character, than we should use OCR.  But the idea is to 
still balance performance with accuracy and just do it a little more 
intelligently.  If there user wanted to just OCR everything, then he should 
just set OCR Strategy to OCR_Only.

The problem here is with the word Best.  It's not really Best.  It's just 
_Fast_ and _Better but not as Fast._  So perhaps we need to come up with a 
better keyword


was (Author: peterkronenberg):
Yes, theoretically, you're correct.  You could argue that Best should mean if 
there is just 1 Unmapped character, than we should use OCR.  But the idea is to 
still balance performance with accuracy and just do it a little more 
intelligently.  If there user wanted to just OCR everything, then he should 
just set OCR Strategy to OCR_Only.

The problem here is with the word Best.  It's not really Best.  It's just 
_Fast_ and _Better but not as Fast._  So perhaps we need to come up with a 
better _keyword_

>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)