Re: HTML documents with TXT extension

Bai Shen Fri, 11 May 2012 11:06:31 -0700

I keep forgetting about the parsechecker.  I'll have to take a look and see
what it kicks out.


And I've already changed solr, I was just looking at what I could do with
Nutch as well.

Thanks.

On Tue, May 8, 2012 at 8:44 AM, Markus Jelsma <[email protected]>wrote:

> Hi
>
> Nutch should parse an HTML file with a .txt extension just as a normal
> HTML file, at least, here it does. What does your parserchecker say? In any
> case you must strip potential left-over HTML in your Solr analyzer, if left
> like this it's a bad XSS vulnerability.
>
> Cheers
>
>
> On Tue, 8 May 2012 08:34:58 -0400, Bai Shen <[email protected]>
> wrote:
>
>> Nutch ended up crawling some HTML files that had a TXT extension.  Because
>> of this(I assume), it didn't strip out the HTML.  So now I have weird
>> formatting on my results page.
>>
>> Is there a way to fix this on the Nutch side so it doesn't happen again?
>>
>
>

Re: HTML documents with TXT extension

Reply via email to