Tika 0.8 line break removal is faulty (misses space when concatenating lines) 
for PDF file
------------------------------------------------------------------------------------------

                 Key: TIKA-583
                 URL: https://issues.apache.org/jira/browse/TIKA-583
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
            Reporter: Dennis Adler


The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the 
following as its first several lines of plain text:
------- start ---------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
DIVISION ONE
  SERGEY SAVCHUK, )
 ) No. 64269-3-I
 Appellant, )
 v. )
 ) UNPUBLISHED OPINION
 STEVEN G. JERDE and )
 DARLYCE J. JERDE, husband and wife )
)
 Respondents. )
 _______________________________  ) FILED: November 1, 2010
--------------- end ---------------------

Tika 0.8 has this instead:
-------------- start ---------------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION 
ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. 
JERDE and )DARLYCE J. JERDE, husband and 
wife))Respondents.)_______________________________  )FILED: November 1, 
2010schindler, j
--------------- end ---------------------

Notice that as part of the improved paragraph breaking for PDF files, the 
"header" of the document had lines catenated together without spaces in 
between, creating run-on words (e.g. "WASHINGTONDIVISION" and 
"ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the 
text.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to