Re: Looking to extract link data from a nutch crawl

Alex McLintock Wed, 14 Jul 2010 13:50:34 -0700

I'm a bit confused as to what you want to do, your skills available,
and how much you can code yourself. Presumably you have seen the
linksdb? and you see that there is code to read from linksdb?


Have you looked at the ReadDB facility? You probably want to look at
the class org.apache.nutch.crawl.CrawlDbReader


Alex



On 14 July 2010 21:34, Branden Root <[email protected]> wrote:
> Hello,
>
>
>        I'm new to Nutch, so pardon me if this question had been asked before 
> (an archives search didn't show anything). I'm trying to use nutch to crawl a 
> website, and then get a list of all URLs on the site, including image URLs. I 
> just need the URLs themselves, not the page/image content or anything like 
> that.
>
>        Right now I know how to run Nutch on the command line, then after 
> crawling/indexing, I can view the links/whatever file to see all the links, 
> so that's a starting point. But, I really want to be able to programmatically 
> run a nutch crawl (examples exist), then programmatically retrieve those 
> links (no examples I can find). Also, I want to include all image hrefs on 
> the crawled site in the final link printout (if it is even possible)
>
>        Any help is greatly appreciated!
>
> Thanks,
> Branden Makana
>
>
>

Re: Looking to extract link data from a nutch crawl

Reply via email to