I ran Tika to get the text: > java -jar ./tika-app/target/tika-app-1.0-SNAPSHOT.jar -T 2011-01-23-7-22-09_sample.doc
And it produces this output for me: +9245114107060 (M) E-Mail: coolgaas.1...@rediffmail.com To enhance the organizational development by self development and motivation from the organizational atmosphere. Hence to involve myself as an effective personnel in this field with my skill, potential, talents with dedication. Working Experience DESIGNATION : Relationship Manager Computer Awareness Office Packages : MS-OFFICE. ACADEMIC CREDENTIALS Completed MBA in the year 2010 in MAKETING & RETAIL as major under RAI BUSINESS SCHOOL (67% Marks, overall). Completed GRADUATION in BSc with 50% marks under WBCHSE Extra Curriculum Activities In my Graduation level I was leading my College Cricket team PERSONAL DETAILS Declaration : It looks like it's missing some text? The Word doc starts with NAMITGOP SAHAD but it's not in the above text (strangely if I get the XHTML output instead, I do see that text); various other text seems to be missing too. Do you see that? On the formatting, it seems to have retained some of the formatting... (I don't get only a single line), but, how are you trying to highlight? Are you displaying the Tika output filtered text to the user? Can you try the XHTML output? Mike McCandless http://blog.mikemccandless.com On Fri, Aug 19, 2011 at 3:32 PM, nirnaydewan <nirnayde...@gmail.com> wrote: > Thanks for your suggestion Mike. Attached is the ms word file. > > What happens is that, i get a single line of text but i want it be formatted > as it is so that i can display it in highlighting. > > > Thanks > > http://lucene.472066.n3.nabble.com/file/n3269071/2011-01-23-7-22-09_sample.doc > 2011-01-23-7-22-09_sample.doc > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269071.html > Sent from the Apache Tika - Development mailing list archive at Nabble.com. >