OK one correction: I ran the TikaCLI tool with the -T option, which extracts "main content only"; when I re-ran with the -t (lowercase) option, which outputs all plain text, then it looks like all text appears correctly (phew!).
On moving to 0.9, that's your call -- I'm not sure what's changed since then, but presumably it is better than 0.8! Displaying the equivalent of "-t" from the TikaCLI tool seems like a good approach? Especially because the XHTML output incorrectly breaks up the SAHAD from your document. Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnayde...@gmail.com> wrote: > First of all thanks again Mike for helping me out. > > Yes, i have seen that, some text do get stripped out sometimes. Any idea as > to why this could be happening? > > I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move > to 0.9? if so how? > > Also i am storing this text only which i am trying to display. If the xhtml > produces the correct text, how do i store it instead? > > > Thanks > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html > Sent from the Apache Tika - Development mailing list archive at Nabble.com. >