Performing Web Scraping within the content of fetched html pages

Alex McLintock Thu, 14 Nov 2013 05:34:55 -0800

Hi Folks,

I'm reasonably familiar with older versions of Nutch - but have been out of
the loop for a bit. I've done some googling, and reading docs, and have not
really understood everything yet.


Would someone please summarise the state of play if I want to do web
scraping with Nutch - eg to extract text that is delimited with a specific
CSS tag, or is found within a particular XPath?

Now in the past this was totally impossible because if you wanted to write
a plugin then Nutch had already thrown away anything like html and just
left the "plain text" content.

So if I wanted to take that html and push it on to some other task -
whether Hadoop based or elsewhere, what would I need to learn about? Is
this still plugin based? or do I just need to learn how to write my own
Hadoop jobs which read the nutch database?

Presumably people do do this, right? There are many other web scraping
systems out there, but I'd like to stick with Nutch if possible.

Alex

Performing Web Scraping within the content of fetched html pages

Reply via email to