Dávid Tóth created TIKA-3134: -------------------------------- Summary: totalCharsPerPage and unmappedUnicodeCharsPerPage configuration Key: TIKA-3134 URL: https://issues.apache.org/jira/browse/TIKA-3134 Project: Tika Issue Type: Improvement Reporter: Dávid Tóth
During PDF parsing, when the code decides to do OCR on a page, this decision is made in the endPage(PDPage page) method of the AbstractPDF2XHTML class, based on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage. If any of these is less than 10 (10 is a hardcoded number) the page will be handled by OCR. In our improvement we eliminated these hardcoded numbers and from now they are configurable in the PDFParserConfig class. -- This message was sent by Atlassian Jira (v8.3.4#803005)