I keep forgetting about the parsechecker. I'll have to take a look and see what it kicks out.
And I've already changed solr, I was just looking at what I could do with Nutch as well. Thanks. On Tue, May 8, 2012 at 8:44 AM, Markus Jelsma <[email protected]>wrote: > Hi > > Nutch should parse an HTML file with a .txt extension just as a normal > HTML file, at least, here it does. What does your parserchecker say? In any > case you must strip potential left-over HTML in your Solr analyzer, if left > like this it's a bad XSS vulnerability. > > Cheers > > > On Tue, 8 May 2012 08:34:58 -0400, Bai Shen <[email protected]> > wrote: > >> Nutch ended up crawling some HTML files that had a TXT extension. Because >> of this(I assume), it didn't strip out the HTML. So now I have weird >> formatting on my results page. >> >> Is there a way to fix this on the Nutch side so it doesn't happen again? >> > >

