Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-30 Thread Doug Cutting

Ken Krugler wrote:

2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer 
such as
myself it appears as though webmasters are not getting many replies 
to their

inqueries.



I can speak for myself only .. I'm not tracking that list. What about 
others?


Folks who are running a nutch-based crawler that provides this email 
address as the contact address should subscribe to this list and respond 
to messages, especially those which may have been caused by their 
crawler.  Others are also encouraged to subscribe and help respond to 
messages here, as a bad reputation for the crawler affects the whole 
project.  This list is actually fairly low-volume.


This brings up an issue I've been thinking about. It might make sense to 
require everybody set the user-agent string, versus it having default 
values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to do 
this were explicit, this wouldn't be much of a hardship for anybody 
trying it out.


+1

That would be a better solution.

Doug


Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Andrzej Bialecki

Jeremy Bensley wrote:
There are posts every three or four days to the nutch-agent regarding 
bots
submitting empty forms to websites. I don't think I've seen any 
regular devs

reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch submitting
forms? I could find no bug listings in JIRA for this.  If it is known and
resolved, what versions of the bot exhibit this behavior?


Yes, there was a discussion on the list about this - I'm afraid this 
behavior is present in both 0.7.x and 0.8. I'm going to remove the 
offending code (or to put it as an option, turned off by default).




2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer 
such as
myself it appears as though webmasters are not getting many replies to 
their

inqueries.


I can speak for myself only .. I'm not tracking that list. What about 
others?




I don't mean to be alarmist, but I think it is in the community's best
interests to make sure that these kinds of complaints get resolved 
such that

nutch is a good 'citizen' and isn't blacklisted from searching sites.

Of course you are right, there is no ill will here on our part, just a long 
queue of issues to address ... but it seems we have to prioritize this one.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Ken Krugler

Jeremy Bensley wrote:

There are posts every three or four days to the nutch-agent regarding bots
submitting empty forms to websites. I don't think I've seen any regular devs
reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch submitting
forms? I could find no bug listings in JIRA for this.  If it is known and
resolved, what versions of the bot exhibit this behavior?


Yes, there was a discussion on the list about this - I'm afraid this 
behavior is present in both 0.7.x and 0.8. I'm going to remove the 
offending code (or to put it as an option, turned off by default).


I think the biggest issue is following links for a form POST. This 
definitely seems wrong to me, and thus should never be done.


There's a separate issue re whether it's OK to follow form links that 
do a GET, since that's what the guy complained to us about recently. 
He agreed that his form should be doing a POST, since it triggers a 
massive build process, but he also said that no other crawl besides 
Nutch was following these links.


I could see making that a configurable option, where it was false by 
default. But we'd probably need to modify this setting to be 
domain-specific, ie some sites we crawl require us to follow these 
types of links to get at content, but in general we'd want to not 
follow them.



2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer such as
myself it appears as though webmasters are not getting many replies to their
inqueries.


I can speak for myself only .. I'm not tracking that list. What about others?


I did respond to John Masone at MacFixer.net, to get the URL to the 
form where Nutch was triggering a submit. So just FYI for testing the 
fix, it's:


http://www.macfixer.net/contact


I don't mean to be alarmist, but I think it is in the community's best
interests to make sure that these kinds of complaints get resolved such that
nutch is a good 'citizen' and isn't blacklisted from searching sites.
Of course you are right, there is no ill will here on our part, just 
a long queue of issues to address ... but it seems we have to 
prioritize this one.


This brings up an issue I've been thinking about. It might make sense 
to require everybody set the user-agent string, versus it having 
default values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to 
do this were explicit, this wouldn't be much of a hardship for 
anybody trying it out.


I could write up some quick text for the Wiki re what a good user 
agent string should contain, and what should be on the web page that 
it refers to, since we also went through that same process not too 
long ago.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers


Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Andrzej Bialecki

Ken Krugler wrote:

Jeremy Bensley wrote:
There are posts every three or four days to the nutch-agent 
regarding bots
submitting empty forms to websites. I don't think I've seen any 
regular devs

reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch 
submitting
forms? I could find no bug listings in JIRA for this.  If it is 
known and

resolved, what versions of the bot exhibit this behavior?


Yes, there was a discussion on the list about this - I'm afraid this 
behavior is present in both 0.7.x and 0.8. I'm going to remove the 
offending code (or to put it as an option, turned off by default).


I think the biggest issue is following links for a form POST. This 
definitely seems wrong to me, and thus should never be done.


I don't think this is happening anymore, there is an explicit check for 
POST method in DOMContentUtils that should prevent this. However, some 
horribly broken HTML may be fooling Neko or TagSoup, so that they lose 
the 'method' attribute (in which case it defaults to GET).




There's a separate issue re whether it's OK to follow form links that 
do a GET, since that's what the guy complained to us about recently. 
He agreed that his form should be doing a POST, since it triggers a 
massive build process, but he also said that no other crawl besides 
Nutch was following these links.


I could see making that a configurable option, where it was false by 
default. But we'd probably need to modify this setting to be 
domain-specific, ie some sites we crawl require us to follow these 
types of links to get at content, but in general we'd want to not 
follow them.


For now I modified the code to skip form action URLs, depending on a 
boolean option. I'll commit this in a moment.



This brings up an issue I've been thinking about. It might make sense 
to require everybody set the user-agent string, versus it having 
default values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to do 
this were explicit, this wouldn't be much of a hardship for anybody 
trying it out.


I could write up some quick text for the Wiki re what a good user 
agent string should contain, and what should be on the web page that 
it refers to, since we also went through that same process not too 
long ago.


I like this idea. I know that I've been guilty of this in the past, out 
of pure laziness ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com