Dávid Tóth created TIKA-3134:
--------------------------------

             Summary: totalCharsPerPage and unmappedUnicodeCharsPerPage 
configuration
                 Key: TIKA-3134
                 URL: https://issues.apache.org/jira/browse/TIKA-3134
             Project: Tika
          Issue Type: Improvement
            Reporter: Dávid Tóth


During PDF parsing, when the code decides to do OCR on a page, this decision is 
made in the endPage(PDPage page) method of the AbstractPDF2XHTML class, based 
on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage. If any 
of these is less than 10 (10 is a hardcoded number) the page will be handled by 
OCR. In our improvement we eliminated these hardcoded numbers and from now they 
are configurable in the PDFParserConfig class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to