[ https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419998#comment-16419998 ]
Tim Allison commented on TIKA-2582: ----------------------------------- Y, all my fault. Sorry, and thank you! > Tesseract 4.0 includes a FF character by default, breaking parsers > ------------------------------------------------------------------ > > Key: TIKA-2582 > URL: https://issues.apache.org/jira/browse/TIKA-2582 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.17 > Reporter: Ewan Mellor > Priority: Major > Fix For: 1.18, 2.0.0 > > > Tesseract 4.0 includes a change to use form feed characters to separate pages > by default in its text output. Previous versions used no separator unless you > specified the include_page_breaks option. > This confuses any parser that is not expecting the FF. > ODFParserTest.testOO2Metadata fails, because it is expecting the output of a > blank image to be the empty string, but now the FF is there. > I haven't seen any other failures, but I expect that user code will now see > either FF or U+FFFD where they are not expecting it (SafeContentHandler > replaces the FF with U+FFFD when converting to text to XML). > We should set the appropriate Tesseract options to disable this behavior > unless explicitly requested by user code, to avoid the change in behavior. > For reference, the Tesseract change is as follows: > {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56 > Merge: 3bb573ae aa6eb6bd > Author: zdenop <zde...@gmail.com> > Date: Tue Sep 19 08:41:08 2017 +0200 > Merge pull request #1140 from stweil/pagebreak > Remove Tesseract parameter "include_page_breaks" and use FF by default > commit aa6eb6bd466101a3b89880f87580471a7694359d > Author: Stefan Weil <s...@weilnetz.de> > Date: Mon Jun 12 19:42:45 2017 +0200 > Remove Tesseract parameter "include_page_breaks" and use FF by default > Now Tesseract adds a page break (normally form feed) by default. > It is still possible to suppress page breaks by setting an empty > page_separator. > Signed-off-by: Stefan Weil <s...@weilnetz.de> > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)