Can you post some example docs that don't extract correctly? Or, better, open a Jira issue(s) and attach the documents there?
Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Aug 19, 2011 at 7:49 AM, nirnaydewan <nirnayde...@gmail.com> wrote: > I am using Solr 3.3.0 using the attached jetty server. When i upload ms word > documents or pdf files, the text is not formatted properly. > > 1. There is no line breaks between sentences. The text is extracted in a > single line or string. > > 2. Wherever there are boxes in word documents , some weird characters come > in place. > > How do i keep the formatting of the text just like in the document. For e.g > if there are 3 line breaks , how do i maintain this? > > Also ? characters come in text while uploading word documents. Where is the > issue? > > Thanks > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html > Sent from the Apache Tika - Development mailing list archive at Nabble.com. >