One thing I still don't like is with the XML (-x) or XHTML (-h) output, the result filtered output incorrectly splits up a word. The doc has:
NAMITGOP SAHAD But in the XML/XHTML it looks like this: <p> <b>NAMITGOP</b> <b> SAHA</b> <b>D</b> </p> Ie SAHAD became SAHA and D, separated. I think this is a bug and I think I know why it's happening... I'll open an issue. Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > OK one correction: I ran the TikaCLI tool with the -T option, which > extracts "main content only"; when I re-ran with the -t (lowercase) > option, which outputs all plain text, then it looks like all text > appears correctly (phew!). > > On moving to 0.9, that's your call -- I'm not sure what's changed > since then, but presumably it is better than 0.8! > > Displaying the equivalent of "-t" from the TikaCLI tool seems like a > good approach? Especially because the XHTML output incorrectly breaks > up the SAHAD from your document. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnayde...@gmail.com> wrote: >> First of all thanks again Mike for helping me out. >> >> Yes, i have seen that, some text do get stripped out sometimes. Any idea as >> to why this could be happening? >> >> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move >> to 0.9? if so how? >> >> Also i am storing this text only which i am trying to display. If the xhtml >> produces the correct text, how do i store it instead? >> >> >> Thanks >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html >> Sent from the Apache Tika - Development mailing list archive at Nabble.com. >> >