[ 
https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983454#action_12983454
 ] 

Dennis Adler commented on TIKA-583:
-----------------------------------

Ken, I tried replacing the 3 PDFBox 1.3.1 JARs (fontbox, jempbox, pdfbox) in my 
classpath with the 1.1.0 versions from Tika 0.7. Every PDF I tested failed with 
a "null" error... the old PDFbox code does not seem to work with Tika 0.8.

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) 
> for PDF file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has 
> the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION 
> ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. 
> JERDE and )DARLYCE J. JERDE, husband and 
> wife))Respondents.)_______________________________  )FILED: November 1, 
> 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the 
> "header" of the document had lines catenated together without spaces in 
> between, creating run-on words (e.g. "WASHINGTONDIVISION" and 
> "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the 
> text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to