Maintaining source url data (father) during runtime

eyal edri Sun, 25 Nov 2007 03:35:28 -0800

Hello nutch developers,

Is there a way to know at runtime, where did the current url being fetched
was originated from?
let me give an example.


let's say that the site: 'www.site.com' was in the initial url.txt file that
nutch started to crawl from.
the 1st page to fetch is 'www.site.com/index.htm'. in that page there are
links to:

   - www.site.com/news/top.exe
   - www.site.com/help.html
   - www.page.com/home/about.html ( inside www.page.com/home/about.html,
   there is a link to www.cnn.com/news.htm )

I want to know if there is a way to know during fetching that each one of
these files originated from www.site.com/index.htm.

*[ 1 e.g. www.site.com/index.htm-->www.page.com/home/about.html ]
**[ 2 e.g. www.site.com/index.htm-->www.cnn.com/news.html ]

*I know that nutch can build the linksdb database which contains InLink
data, but this is done only after nutch crawl is finished.
Is there a way to know this information duting the fetching process ?

thanks,


-- 
Eyal Edri

Maintaining source url data (father) during runtime

Reply via email to