Hi Folks, I'm reasonably familiar with older versions of Nutch - but have been out of the loop for a bit. I've done some googling, and reading docs, and have not really understood everything yet.
Would someone please summarise the state of play if I want to do web scraping with Nutch - eg to extract text that is delimited with a specific CSS tag, or is found within a particular XPath? Now in the past this was totally impossible because if you wanted to write a plugin then Nutch had already thrown away anything like html and just left the "plain text" content. So if I wanted to take that html and push it on to some other task - whether Hadoop based or elsewhere, what would I need to learn about? Is this still plugin based? or do I just need to learn how to write my own Hadoop jobs which read the nutch database? Presumably people do do this, right? There are many other web scraping systems out there, but I'd like to stick with Nutch if possible. Alex

