Does it really add this newline, because this is strange? If you look at XHTMLContentHandler it does not. So the newline must come from somewhere else.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Saturday, August 20, 2011 5:33 PM > To: dev@tika.apache.org > Subject: Re: Issue in text extraction in Solr / Tika > > On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <u...@thetaphi.de> wrote: > >> Hmm, actually: the <p> element allows text, in addition to child elements? > > So > >> shouldn't any whitespace within the <p>...</p> be treated as > >> significant > > (part of > >> the content)? > > > > This is very indeed very complicated. For mixed content elements, the > > whitespace inside is preserved, but not next to child elements - very > > stupid rules. If you once coded HTML you know this :-) > > Hmm... are you sure? :) > > Because, I've tried Firefox and Chrome and Safari, on the xml file, and all insert > a space in rendering. > > Also, I tried Tika itself (feeding back the .xml it had created, to produce text) > and it also inserts a space. > > I also tried JTidy and it inserts the space though it thinks it's parsing HTML so > that may be an invalid test. > > Anyway... even if the strict XML white space rules state that this newline should > not be counted as whitespace in the content, because so many tools seem not > to do it correctly.... I think it's worth trying to fix Tika to not add this newline. > > Mike McCandless > > http://blog.mikemccandless.com