Re: Can Nutch crawler Impersonate user-agent?

2009-06-02 Thread Jake Jacobson
keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* I copied this to my nutch-site.xml file, edited it with my user-agent string and the magic worked. I would suggest that this block of code is added to the nutch-site.xml file by default. Jake Jacobson http://www.linkedin.c

Re: Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread Jake Jacobson
hould not be of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS On Mon, Jun 1, 2009 at 2:46 PM, David M. Cole wrote: > At 2:23 PM -0400 6/1/09, Jake Jacobson wrote: >> >> User-agent: imo-robot-intelink >> Disallow: /App_Themes/ >&g

Re: Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread David M. Cole
At 2:23 PM -0400 6/1/09, Jake Jacobson wrote: User-agent: imo-robot-intelink Disallow: /App_Themes/ Disallow: /app_themes/ Disallow: /Archive/ Disallow: /archive/ Disallow: /Bin/ Disallow: /bin/ Jake: I think you need to add one more line after the last line: Allow: / \dmc

Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread Jake Jacobson
Hi, I am testing out Nutch 1.0 and it doesn't seem to be able to crawl my website that has the following robots.txt file: User-agent: imo-robot-intelink Disallow: /App_Themes/ Disallow: /app_themes/ Disallow: /Archive/ Disallow: /archive/ Disallow: /Bin/ Disallow: /bin/ I have the

Re: User Agent

2005-12-09 Thread gekkokid
just make sure you have "Mozilla/5.0" at the front :) - Original Message - From: "Insurance Squared Inc." <[EMAIL PROTECTED]> To: Sent: Friday, December 09, 2005 2:55 PM Subject: User Agent What should I be using for a user agent in the crawler? We just tr

User Agent

2005-12-09 Thread Insurance Squared Inc.
What should I be using for a user agent in the crawler? We just tried crawling a government site and if we leave the user agent set to nutch, we get the crawl. When I change it, I'm getting blocked with an error about the user agent not being supported. It seems that I should be cha