Re: Using nutch as a web crawler

2007-04-05 Thread Lourival Júnior
Nutch has a file called crawl-urlfilter.txt where you can set your site domain or site list, so nutch will only crawl this list. Download nutch and see it working, is better for you :). Take a look: http://lucene.apache.org/nutch/tutorial8.html Regards, On 4/5/07, Meryl Silverburgh <[EMAIL PROTE

Re: Using nutch as a web crawler

2007-04-04 Thread Meryl Silverburgh
Thanks. Can you please tell me how can I plugin in my own handling when nutch sees a site instead of building the search database for that site? On 4/3/07, Lourival Júnior <[EMAIL PROTECTED]> wrote: I have total certainty that nutch is what are you looking for. Take a look to nutch's documenta

Re: Using nutch as a web crawler

2007-04-04 Thread zzp good
You don't need so powerful nutch. Your task is so easy, you can just use NekoHtml to do it, with a few additional programming. On 4/4/07, Michael Wechner <[EMAIL PROTECTED]> wrote: Lourival Júnior wrote: > I have total certainty that nutch is what are you looking for. Take a > look > to nutc

Re: Using nutch as a web crawler

2007-04-04 Thread Michael Wechner
Lourival Júnior wrote: I have total certainty that nutch is what are you looking for. Take a look to nutch's documentation for more details and you will see :). an alternative is websphinx, but it's not really maintained anymore. HTH Michael On 4/3/07, Meryl Silverburgh <[EMAIL PROTECTE

Re: Using nutch as a web crawler

2007-04-03 Thread Lourival Júnior
I have total certainty that nutch is what are you looking for. Take a look to nutch's documentation for more details and you will see :). On 4/3/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: Hi, I would like to know if know if it is a good idea to use nutch web carwler? Basically, this is w

Using nutch as a web crawler

2007-04-03 Thread Meryl Silverburgh
Hi, I would like to know if know if it is a good idea to use nutch web carwler? Basically, this is what I need: 1. I have a list of web site 2. I want the web crawler to go thru each site, parser the anchor. if it is the same domain, go thru the same step for 3 level. 3. For each link, write to a