I am using Solr 3.3.0 using the attached jetty server. When i upload ms word documents or pdf files, the text is not formatted properly.
1. There is no line breaks between sentences. The text is extracted in a single line or string. 2. Wherever there are boxes in word documents , some weird characters come in place. How do i keep the formatting of the text just like in the document. For e.g if there are 3 line breaks , how do i maintain this? Also ? characters come in text while uploading word documents. Where is the issue? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.