Re: blocking site downloaders

2009-10-17 Thread James Holmes

Session variables are always stored in the server's memory. J2EE
session cookies simply expire in the client when the browser is
closed, rather than only at the time set on the session timeout.

Bots will often ignore either style of session, as they often don't
keep track of cookies.

mxAjax / CFAjax docs and other useful articles:
http://www.bifrost.com.au/blog/



2009/10/16 Azadi Saryev :
>
> 2) >> However, this may not work as the offending behavior is probably
> generated by a bot or desktop app that doesn't store session variables.
> if you use j2ee sessions then session vars are in-memory only and are
> stored in server's memory, not on client;s computer...

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327295
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: blocking site downloaders

2009-10-16 Thread Azadi Saryev

On 17/10/2009 09:08, Claude Schneegans wrote:
> robots.txt file may be useful for well behaved robots like Google or 
> Yahoo, but bad bot disguise themselves
> like "Explorer" or "Mozilla" and they do not comply to robots.txt 
> directives.
>   

that's exactly why i also said "of course, not all robots obey these
rules - if this bot does not then you can consider banning it completely
using the options you outlined."

-- 

Azadi Saryev
Sabai-dee.com
http://www.sabai-dee.com/


~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327292
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: blocking site downloaders

2009-10-16 Thread Claude Schneegans

 >>in case it's a search spider, you can try adding appropriate robots.txt

robots.txt file may be useful for well behaved robots like Google or 
Yahoo, but bad bot disguise themselves
like "Explorer" or "Mozilla" and they do not comply to robots.txt 
directives.
Most of them never read it, and some of them will even read it ... and 
visit the forbiden areas on purpose.
This is one of the trick I use to detect them : attempt to read pages 
referred in robots.txt

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327291
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: blocking site downloaders

2009-10-16 Thread Claude Schneegans

 >>(c) What the hell are they doing?

Some will search for copyrighted images and, if they find some, they 
will send you some lawyer letter
with a bill of 1500$ or so for using their image. (ie: Getty images,...)
Most of them are simply looking for email addresses they could collect 
and sell to spamers.

 >>The current offending IP reverses to China.

 From China, it could also be gouvernement spider check if your site is 
promoting human rights,
and if yes, I've got bad new, your site will be BANNED form china ! ;-)
PLUS email fetchers, of course.

 >>o Add a session variable that stores the last page view time down to 
the second.  However, this may not work as the offending behavior is 
probably generated by a bot or desktop app that doesn't store session 
variables.

Exact. Better store the IP address in an application (or even a server 
variable) and calculate a moving average of time between requests to pages.
If the average gets lower than a certain minimum, ban the IP address for 
at least a couple or hours.
The bot will be discaouraged.

 >>o Review my databased logs for the current IP's last twenty page 
views.  This may put an extra small hit on the server, but over all not 
as much as an extra 2200 page views in an hour every couple of weeks.

This can be a pretty heavy task for the server. Log files are pretty big 
files.

 >>If the requesting IP has requested more than twenty pages from the 
website in the current minute, I block the IP for a period of time, say, 
an hour or two.

This is what I do, but I use a moving average in a server variable. Much 
more efficient than reading log files. ie:
server.structPass[REMOTE_ADDR].durpass 
=server.structPass[REMOTE_ADDR].durpass*6/7 + thisLaps;
In thisLaps I have the nb of seconds between last and current request. 
The formula gives a moving average
over 7 consecutive requests. If this average goes below a certain 
amount, there are chances the agent is a bot.

I use this test among others to detect bad bots.

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327289
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: blocking site downloaders

2009-10-16 Thread Mike Chabot

Is it a search engine robot? If so you could block it using
robots.txt. What is the Web browser they are using?

Do you have any customers in China? If not, you can block the entire
country at the firewall level.

-Mike Chabot

On Fri, Oct 16, 2009 at 10:12 AM, Michael Muller  wrote:
>
> Hey all,
>
> Every once in a while I'll notice in my logs that someone comes to one of my 
> sites and hits thousands of pages in a short span and then leaves.  This 
> annoys me for a few reasons:
>
> (a) It's an unecessary tax on my server (and we all hate taxes)
>
> (b) It artificially inflates my page hits
>
> (c) What the hell are they doing? Scraping my pages and hosting them on some 
> site? The current offending IP reverses to China.
>
>
> So, to avoid this, I'm considering the following:
>
> o Add a session variable that stores the last page view time down to the 
> second.  However, this may not work as the offending behavior is probably 
> generated by a bot or desktop app that doesn't store session variables.
>
> o Review my databased logs for the current IP's last twenty page views.  This 
> may put an extra small hit on the server, but over all not as much as an 
> extra 2200 page views in an hour every couple of weeks.
>
> If the requesting IP has requested more than twenty pages from the website in 
> the current minute, I block the IP for a period of time, say, an hour or two.
>
> I have about a dozen sites running the same software, each with multiple 
> thousands of pages (community sites, with 21,000 messages and
>
>
> I've posed this question a couple of times before on this list and it hasn't 
> prompted any response.  I will try again, hoping someone will either tell me 
> I'm worrying too much, or that this is a smart idea.
>
> Thanks,
>
> Mik
>
>
>
> 
> Michael Muller
> office (413) 863-6455
> cell (413) 320-5336
> skype: michaelBmuller
> http://MontagueWebWorks.com
>
> Information is not knowledge
> Knowlege is not wisdom
>
> Eschew Obfuscation
>
>
> 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327271
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: blocking site downloaders

2009-10-16 Thread Azadi Saryev

some thoughts on the issue:

1) it may very well be a spider from a chinese search engine indexing
your pages... i'll leave it up to you to decide if you want none of 1
billion chinese people to be able to find your site...
in case it's a search spider, you can try adding appropriate robots.txt
file and  tags to prevent access to and indexing of pages you do
not want accessed/indexed. you can also add rel="nofollow" to any links
you do not want followed by robots. of course, not all robots obey these
rules - if this bot does not then you can consider banning it completely
using the options you outlined.
another impact of robots accessing your site is that your app will still
try and create session vars for their sessions, filling your server
memory with useless sessions. you can try to lessen this burden on your
server by setting super-short session timeout for robots' sessions. irc,
ben nadel had a good blog post on how to do this over on
http://www.bennadel.com/.

2) >> However, this may not work as the offending behavior is probably
generated by a bot or desktop app that doesn't store session variables.
if you use j2ee sessions then session vars are in-memory only and are
stored in server's memory, not on client;s computer...

3) >> This may put an extra small hit on the server, but over all not as
much as an extra 2200 page views in an hour every couple of weeks.
i think you may be wrong here... 2200 page requests once every couple of
weeks seems like a less tax on your server than parsing your logs on
every page request... but then it depends on how loaded your sites are...

just some quick thoughts that i hope may help you...

Azadi Saryev
Sabai-dee.com
http://www.sabai-dee.com/


On 16/10/2009 22:12, Michael Muller wrote:
> Hey all,
>
> Every once in a while I'll notice in my logs that someone comes to one of my 
> sites and hits thousands of pages in a short span and then leaves.  This 
> annoys me for a few reasons:
>
> (a) It's an unecessary tax on my server (and we all hate taxes)
>
> (b) It artificially inflates my page hits
>
> (c) What the hell are they doing? Scraping my pages and hosting them on some 
> site? The current offending IP reverses to China.
>
>
> So, to avoid this, I'm considering the following:
>
> o Add a session variable that stores the last page view time down to the 
> second.  However, this may not work as the offending behavior is probably 
> generated by a bot or desktop app that doesn't store session variables.
>
> o Review my databased logs for the current IP's last twenty page views.  This 
> may put an extra small hit on the server, but over all not as much as an 
> extra 2200 page views in an hour every couple of weeks.
>
> If the requesting IP has requested more than twenty pages from the website in 
> the current minute, I block the IP for a period of time, say, an hour or two.
>
> I have about a dozen sites running the same software, each with multiple 
> thousands of pages (community sites, with 21,000 messages and 
>
>
> I've posed this question a couple of times before on this list and it hasn't 
> prompted any response.  I will try again, hoping someone will either tell me 
> I'm worrying too much, or that this is a smart idea.
>
> Thanks,
>
> Mik
>
>
>
> 
> Michael Muller
> office (413) 863-6455
> cell (413) 320-5336
> skype: michaelBmuller
> http://MontagueWebWorks.com
>
> Information is not knowledge
> Knowlege is not wisdom
>
> Eschew Obfuscation
>
>
> 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327270
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4