Ahhh.... what threw me off was the browser rendering, which turns that newline into space so I see "SAHA D".
Hmm, actually: the <p> element allows text, in addition to child elements? So shouldn't any whitespace within the <p>...</p> be treated as significant (part of the content)? I need to go learn XML's whitespace rules :) Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 8:39 AM, Uwe Schindler <u...@thetaphi.de> wrote: > From the xml point of view, its not separated. It's just in two elements, but > no whitespace in-between, according to parsing standards (see xml whitespace > rules). > > Uwe > -- > Uwe Schindler > H.-H.-Meier-Allee 63, 28213 Bremen > http://www.thetaphi.de > > > > Michael McCandless <luc...@mikemccandless.com> schrieb: > > One thing I still don't like is with the XML (-x) or XHTML (-h) > output, the result filtered output incorrectly splits up a word. The > doc has: > > NAMITGOP SAHAD > > But in the XML/XHTML it looks like this: > > <p> > <b>NAMITGOP</b> > <b> SAHA</b> > <b>D</b> > </p> > > Ie SAHAD became SAHA and D, separated. > > I think this is a bug and I think I know why it's happening... I'll > open an issue. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> OK one correction: I ran the TikaCLI tool with the -T option, which >> extracts "main content only"; when I re-ran with the -t (lowercase) >> option, which outputs all plain text, then it looks like all text >> appears correctly (phew!). >> >> On moving to 0.9, that's your call -- I'm not sure what's changed >> since then, but presumably it is better than 0.8! >> >> Displaying the equivalent of "-t" from the TikaCLI tool seems like a >> good approach? Especially because the XHTML output incorrectly breaks >> up the SAHAD from your document. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnayde...@gmail.com> wrote: >>> First of all thanks again Mike for helping me out. >>> >>> Yes, i have seen that, some text do get stripped out sometimes. Any idea as >>> to why this could be happening? >>> >>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move >>> to 0.9? if so how? >>> >>> Also i am storing this text only which i am trying to display. If the xhtml >>> produces the correct text, how do i store it instead? >>> >>> >>> Thanks >>> >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html >>> Sent from the Apache Tika - Development mailing list archive at Nabble.com. >>> >> > >