Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file ------------------------------------------------------------------------------------------
Key: TIKA-583 URL: https://issues.apache.org/jira/browse/TIKA-583 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.8 Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 Reporter: Dennis Adler The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text: ------- start --------------- IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON DIVISION ONE SERGEY SAVCHUK, ) ) No. 64269-3-I Appellant, ) v. ) ) UNPUBLISHED OPINION STEVEN G. JERDE and ) DARLYCE J. JERDE, husband and wife ) ) Respondents. ) _______________________________ ) FILED: November 1, 2010 --------------- end --------------------- Tika 0.8 has this instead: -------------- start --------------------- IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j --------------- end --------------------- Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.