PDF content extracted as single line ------------------------------------ Key: TIKA-548 URL: https://issues.apache.org/jira/browse/TIKA-548 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.8 Reporter: Staffan Olsson
Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test. Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph: $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf 1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson PDF Title For Short Document veryshortpdfcontents But Tika prints: $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf ... <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF Title For Short Documentveryshortpdfcontents</p> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.