[ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933536#action_12933536 ]
Staffan Olsson commented on TIKA-548: ------------------------------------- Verified to work with Solr. Thanks for the fix. > PDF content extracted as single line > ------------------------------------ > > Key: TIKA-548 > URL: https://issues.apache.org/jira/browse/TIKA-548 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.8 > Reporter: Staffan Olsson > Assignee: Jukka Zitting > Fix For: 0.9 > > Attachments: tika-PDF-content-regression-test.patch > > > Rev 1029510 introduces a regression in PDF content parsing, now present in > 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a > problem both for reading and for indexing. See the attached test. > Note that it seems like PDFBox 1.3.1 extracts correctly, at least from > command line. Here's from a sample file with a headline followed by a one > word paragraph: > $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf > 1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson > PDF Title For Short Document > veryshortpdfcontents > But Tika prints: > $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf > ... > <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF > Title For Short Documentveryshortpdfcontents</p> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.