Nutch ended up crawling some HTML files that had a TXT extension. Because of this(I assume), it didn't strip out the HTML. So now I have weird formatting on my results page.
Is there a way to fix this on the Nutch side so it doesn't happen again?
Nutch ended up crawling some HTML files that had a TXT extension. Because of this(I assume), it didn't strip out the HTML. So now I have weird formatting on my results page.
Is there a way to fix this on the Nutch side so it doesn't happen again?