[ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-583. -------------------------------- Resolution: Duplicate Assignee: Jukka Zitting This is a duplicate of TIKA-548, fixed in trunk. > Tika 0.8 line break removal is faulty (misses space when concatenating lines) > for PDF file > ------------------------------------------------------------------------------------------ > > Key: TIKA-583 > URL: https://issues.apache.org/jira/browse/TIKA-583 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.8 > Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 > Reporter: Dennis Adler > Assignee: Jukka Zitting > Attachments: Savchuk v. Jerde.pdf > > > The included PDF (a legal filing from the web) when parsed by Tika 0.7 has > the following as its first several lines of plain text: > ------- start --------------- > IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON > DIVISION ONE > SERGEY SAVCHUK, ) > ) No. 64269-3-I > Appellant, ) > v. ) > ) UNPUBLISHED OPINION > STEVEN G. JERDE and ) > DARLYCE J. JERDE, husband and wife ) > ) > Respondents. ) > _______________________________ ) FILED: November 1, 2010 > --------------- end --------------------- > Tika 0.8 has this instead: > -------------- start --------------------- > IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION > ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. > JERDE and )DARLYCE J. JERDE, husband and > wife))Respondents.)_______________________________ )FILED: November 1, > 2010schindler, j > --------------- end --------------------- > Notice that as part of the improved paragraph breaking for PDF files, the > "header" of the document had lines catenated together without spaces in > between, creating run-on words (e.g. "WASHINGTONDIVISION" and > "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the > text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.