Hi, I am new to nutch and am looking to use nutch to index some discussion forums on the web. I have successfully used nutch to crawl a site and build the index, but each page typically contains all of the replies for the given thread.
I would like to index each article and the individual replies as separate documents, rather than all together in a single document. Since all of the content is on a single web page, does this mean one way to do this is to write an HtmlParserFilter to extract the fields for each reply? How might I then create multiple documents in the index as opposed to a single doc? Thanks for your help and insights. Bowden
smime.p7s
Description: S/MIME cryptographic signature

