[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler closed SOLR-5124. ------------------------------- Resolution: Duplicate > Solr glues word´s when parsing PDFs under certan circumstances > -------------------------------------------------------------- > > Key: SOLR-5124 > URL: https://issues.apache.org/jira/browse/SOLR-5124 > Project: Solr > Issue Type: Bug > Components: update > Affects Versions: 4.4 > Environment: Windows 7 (don´t think, this is relevant) > Reporter: Christoph Straßer > Priority: Minor > Labels: tika,text-extraction > Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, > 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, > 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png > > > For some kind of PDF-documents Solr glues words at linebreaks under some > circumstances. (eg the last word of line 1 and the first word of line 2 are > merged into one word) > (Stand-alone-)Tika extracts the text correct. Attached you find one > sample-PDF and screenshots of tika-output and the corrupted content indexed > by solr. > (This issue does not occur with all PDF-documents. Tried to recreate the > issue with new word-documents, I converted into PDF on multiple ways without > success.) The attached PDF-document has a real weird internal structure. But > Tika seems to do it´s work right. Even with this weird document. > In our Solr-indices we have a good amount of this weird documents. This > results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org