Hi,

I am new to nutch and am looking to use nutch to index some discussion
forums on the web. I have successfully used nutch to crawl a site and build
the index, but each page typically contains all of the replies for the given
thread. 

I would like to index each article and the individual replies as separate
documents, rather than all together in a single document. 

Since all of the content is on a single web page, does this mean one way to
do this is to write an HtmlParserFilter to extract the fields for each
reply? 

How might I then create multiple documents in the index as opposed to a
single doc?

Thanks for your help and insights.

Bowden

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to