I ran Tika to get the text:

  > java -jar ./tika-app/target/tika-app-1.0-SNAPSHOT.jar -T
2011-01-23-7-22-09_sample.doc

And it produces this output for me:

+9245114107060 (M) E-Mail: coolgaas.1...@rediffmail.com
To enhance the organizational development by self development and
motivation from the organizational atmosphere. Hence to involve myself
as an effective personnel in this field with my skill, potential,
talents with dedication.
Working Experience
DESIGNATION : Relationship Manager
Computer Awareness
Office Packages                                       :        MS-OFFICE.
ACADEMIC CREDENTIALS
      Completed MBA in the year 2010 in MAKETING & RETAIL as major
under RAI  BUSINESS SCHOOL (67% Marks, overall).
        Completed GRADUATION in BSc with 50% marks under WBCHSE

Extra Curriculum Activities
In my Graduation level I was leading my College Cricket team
PERSONAL DETAILS
Declaration :


It looks like it's missing some text?  The Word doc starts with
NAMITGOP SAHAD but it's not in the above text (strangely if I get the
XHTML output instead, I do see that text); various other text seems to
be missing too.  Do you see that?

On the formatting, it seems to have retained some of the formatting...
(I don't get only a single line), but, how are you trying to
highlight?  Are you displaying the Tika output filtered text to the
user?  Can you try the XHTML output?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Aug 19, 2011 at 3:32 PM, nirnaydewan <nirnayde...@gmail.com> wrote:
> Thanks for your suggestion Mike.  Attached is the ms word file.
>
> What happens is that, i get a single line of text but i want it be formatted
> as it is so that i can display it in highlighting.
>
>
> Thanks
>
> http://lucene.472066.n3.nabble.com/file/n3269071/2011-01-23-7-22-09_sample.doc
> 2011-01-23-7-22-09_sample.doc
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269071.html
> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>

Reply via email to