[ https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348793#comment-17348793 ]
Tim Allison edited comment on TIKA-3361 at 5/20/21, 8:50 PM: ------------------------------------------------------------- How about the PR as is, but change the terms to "faster" and "better"? If I wait to future proof this with the other things, it won't get in. And I realize I'm the one holding this up. :( Let's also add range checks on initialization. I'm happy to do both of these if you're done with this. :D was (Author: talli...@mitre.org): How about the PR as is, but change the terms to "faster" and "better"? If we wait to future proof this with the other things, it won't get in. And I realize I'm the one holding this up. :( > Improve intelligence of OCRStrategy=AUTO > ----------------------------------------- > > Key: TIKA-3361 > URL: https://issues.apache.org/jira/browse/TIKA-3361 > Project: Tika > Issue Type: Improvement > Reporter: Peter Kronenberg > Priority: Major > > Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt > at improving OCRStrategy=Auto > Currently, this strategy performs the following test > {code:java} > if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) { > doOCROnCurrentPage(AUTO); > } > {code} > I added a way to change the new numbers involved: the threshold for the total > characters per page (below which, we OCR the page), and the threshold for > unmapped characters (above which we OCR the page) > My main concern is with the unmapped characters. OCR adds a lot of overhead, > which might not be necessary for simply a few unmapped characters > I added a new config, *OCRStrategyAuto*, which is only used if > OCRStrategy=AUTO. Its format is > {code:java} > ocrStrategyAuto = best|fast|m[%], n > {code} > ‘best’ and ‘fast’ are shortcuts. More later > m, n – m is the threshold for the number of unmapped characters per page. It > can also be specified as a percentage. So, m=20 means if your page has more > than 20 unmapped characters, it will OCR. m=20% means if the unmapped > characters are more than 20% of the total characters, then it will OCR. > n is the threshold for the total number of characters on the page. n does not > need to be specified and defaults to 10 > {code:java} > <param name="ocrStrategyAuto" type="string">20</param> > {code} > is equivalent to > {code:java} > <param name="ocrStrategyAuto" type="string">20, 10</param> > {code} > *best* is shorthand for *20,10* > {code:java} > <param name="ocrStrategyAuto" type="string">best</param> > {code} > is equivalent to > {code:java} > <param name="ocrStrategyAuto" type="string">20, 10</param> > {code} > *best* is the default and is equivalent to the current behavior > *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number > of unmapped characters is greater than 10% > {code:java} > <param name="ocrStrategyAuto" type="string">fast</param> > {code} > is equivalent to > {code:java} > <param name="ocrStrategyAuto" type="string">10%, 10</param> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)