keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
I copied this to my nutch-site.xml file, edited it with my user-agent
string and the magic worked. I would suggest that this block of code
is added to the nutch-site.xml file by default.
Jake Jacobson
http://www.linkedin.c
hould not be of failure,
but of succeeding at something that doesn't really matter.
-- ANONYMOUS
On Mon, Jun 1, 2009 at 2:46 PM, David M. Cole wrote:
> At 2:23 PM -0400 6/1/09, Jake Jacobson wrote:
>>
>> User-agent: imo-robot-intelink
>> Disallow: /App_Themes/
>&g
At 2:23 PM -0400 6/1/09, Jake Jacobson wrote:
User-agent: imo-robot-intelink
Disallow: /App_Themes/
Disallow: /app_themes/
Disallow: /Archive/
Disallow: /archive/
Disallow: /Bin/
Disallow: /bin/
Jake:
I think you need to add one more line after the last line:
Allow: /
\dmc
Hi,
I am testing out Nutch 1.0 and it doesn't seem to be able to crawl my
website that has the following robots.txt file:
User-agent: imo-robot-intelink
Disallow: /App_Themes/
Disallow: /app_themes/
Disallow: /Archive/
Disallow: /archive/
Disallow: /Bin/
Disallow: /bin/
I have the
just make sure you have "Mozilla/5.0" at the front :)
- Original Message -
From: "Insurance Squared Inc." <[EMAIL PROTECTED]>
To:
Sent: Friday, December 09, 2005 2:55 PM
Subject: User Agent
What should I be using for a user agent in the crawler? We just tr
What should I be using for a user agent in the crawler? We just tried
crawling a government site and if we leave the user agent set to nutch,
we get the crawl. When I change it, I'm getting blocked with an error
about the user agent not being supported. It seems that I should be
cha