[ 
https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-583.
--------------------------------

    Resolution: Duplicate
      Assignee: Jukka Zitting

This is a duplicate of TIKA-548, fixed in trunk.

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) 
> for PDF file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>            Assignee: Jukka Zitting
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has 
> the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION 
> ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. 
> JERDE and )DARLYCE J. JERDE, husband and 
> wife))Respondents.)_______________________________  )FILED: November 1, 
> 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the 
> "header" of the document had lines catenated together without spaces in 
> between, creating run-on words (e.g. "WASHINGTONDIVISION" and 
> "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the 
> text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to