I am using Solr 3.3.0 using the attached jetty server. When i upload ms word
documents or pdf files, the text is not formatted properly.

1. There is no line breaks between sentences. The text is extracted in a
single line or string. 

2. Wherever there are boxes in word documents , some weird characters come
in place.

How do i keep the formatting of the text just like in the document. For e.g
if there are 3 line breaks , how do i maintain this?

Also ? characters come in text while uploading word documents. Where is the
issue?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to