Re: blocking site downloaders
Session variables are always stored in the server's memory. J2EE session cookies simply expire in the client when the browser is closed, rather than only at the time set on the session timeout. Bots will often ignore either style of session, as they often don't keep track of cookies. mxAjax / CFAjax docs and other useful articles: http://www.bifrost.com.au/blog/ 2009/10/16 Azadi Saryev : > > 2) >> However, this may not work as the offending behavior is probably > generated by a bot or desktop app that doesn't store session variables. > if you use j2ee sessions then session vars are in-memory only and are > stored in server's memory, not on client;s computer... ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327295 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: blocking site downloaders
On 17/10/2009 09:08, Claude Schneegans wrote: > robots.txt file may be useful for well behaved robots like Google or > Yahoo, but bad bot disguise themselves > like "Explorer" or "Mozilla" and they do not comply to robots.txt > directives. > that's exactly why i also said "of course, not all robots obey these rules - if this bot does not then you can consider banning it completely using the options you outlined." -- Azadi Saryev Sabai-dee.com http://www.sabai-dee.com/ ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327292 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: blocking site downloaders
>>in case it's a search spider, you can try adding appropriate robots.txt robots.txt file may be useful for well behaved robots like Google or Yahoo, but bad bot disguise themselves like "Explorer" or "Mozilla" and they do not comply to robots.txt directives. Most of them never read it, and some of them will even read it ... and visit the forbiden areas on purpose. This is one of the trick I use to detect them : attempt to read pages referred in robots.txt ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327291 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: blocking site downloaders
>>(c) What the hell are they doing? Some will search for copyrighted images and, if they find some, they will send you some lawyer letter with a bill of 1500$ or so for using their image. (ie: Getty images,...) Most of them are simply looking for email addresses they could collect and sell to spamers. >>The current offending IP reverses to China. From China, it could also be gouvernement spider check if your site is promoting human rights, and if yes, I've got bad new, your site will be BANNED form china ! ;-) PLUS email fetchers, of course. >>o Add a session variable that stores the last page view time down to the second. However, this may not work as the offending behavior is probably generated by a bot or desktop app that doesn't store session variables. Exact. Better store the IP address in an application (or even a server variable) and calculate a moving average of time between requests to pages. If the average gets lower than a certain minimum, ban the IP address for at least a couple or hours. The bot will be discaouraged. >>o Review my databased logs for the current IP's last twenty page views. This may put an extra small hit on the server, but over all not as much as an extra 2200 page views in an hour every couple of weeks. This can be a pretty heavy task for the server. Log files are pretty big files. >>If the requesting IP has requested more than twenty pages from the website in the current minute, I block the IP for a period of time, say, an hour or two. This is what I do, but I use a moving average in a server variable. Much more efficient than reading log files. ie: server.structPass[REMOTE_ADDR].durpass =server.structPass[REMOTE_ADDR].durpass*6/7 + thisLaps; In thisLaps I have the nb of seconds between last and current request. The formula gives a moving average over 7 consecutive requests. If this average goes below a certain amount, there are chances the agent is a bot. I use this test among others to detect bad bots. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327289 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: blocking site downloaders
Is it a search engine robot? If so you could block it using robots.txt. What is the Web browser they are using? Do you have any customers in China? If not, you can block the entire country at the firewall level. -Mike Chabot On Fri, Oct 16, 2009 at 10:12 AM, Michael Muller wrote: > > Hey all, > > Every once in a while I'll notice in my logs that someone comes to one of my > sites and hits thousands of pages in a short span and then leaves. This > annoys me for a few reasons: > > (a) It's an unecessary tax on my server (and we all hate taxes) > > (b) It artificially inflates my page hits > > (c) What the hell are they doing? Scraping my pages and hosting them on some > site? The current offending IP reverses to China. > > > So, to avoid this, I'm considering the following: > > o Add a session variable that stores the last page view time down to the > second. However, this may not work as the offending behavior is probably > generated by a bot or desktop app that doesn't store session variables. > > o Review my databased logs for the current IP's last twenty page views. This > may put an extra small hit on the server, but over all not as much as an > extra 2200 page views in an hour every couple of weeks. > > If the requesting IP has requested more than twenty pages from the website in > the current minute, I block the IP for a period of time, say, an hour or two. > > I have about a dozen sites running the same software, each with multiple > thousands of pages (community sites, with 21,000 messages and > > > I've posed this question a couple of times before on this list and it hasn't > prompted any response. I will try again, hoping someone will either tell me > I'm worrying too much, or that this is a smart idea. > > Thanks, > > Mik > > > > > Michael Muller > office (413) 863-6455 > cell (413) 320-5336 > skype: michaelBmuller > http://MontagueWebWorks.com > > Information is not knowledge > Knowlege is not wisdom > > Eschew Obfuscation > > > ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327271 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: blocking site downloaders
some thoughts on the issue: 1) it may very well be a spider from a chinese search engine indexing your pages... i'll leave it up to you to decide if you want none of 1 billion chinese people to be able to find your site... in case it's a search spider, you can try adding appropriate robots.txt file and tags to prevent access to and indexing of pages you do not want accessed/indexed. you can also add rel="nofollow" to any links you do not want followed by robots. of course, not all robots obey these rules - if this bot does not then you can consider banning it completely using the options you outlined. another impact of robots accessing your site is that your app will still try and create session vars for their sessions, filling your server memory with useless sessions. you can try to lessen this burden on your server by setting super-short session timeout for robots' sessions. irc, ben nadel had a good blog post on how to do this over on http://www.bennadel.com/. 2) >> However, this may not work as the offending behavior is probably generated by a bot or desktop app that doesn't store session variables. if you use j2ee sessions then session vars are in-memory only and are stored in server's memory, not on client;s computer... 3) >> This may put an extra small hit on the server, but over all not as much as an extra 2200 page views in an hour every couple of weeks. i think you may be wrong here... 2200 page requests once every couple of weeks seems like a less tax on your server than parsing your logs on every page request... but then it depends on how loaded your sites are... just some quick thoughts that i hope may help you... Azadi Saryev Sabai-dee.com http://www.sabai-dee.com/ On 16/10/2009 22:12, Michael Muller wrote: > Hey all, > > Every once in a while I'll notice in my logs that someone comes to one of my > sites and hits thousands of pages in a short span and then leaves. This > annoys me for a few reasons: > > (a) It's an unecessary tax on my server (and we all hate taxes) > > (b) It artificially inflates my page hits > > (c) What the hell are they doing? Scraping my pages and hosting them on some > site? The current offending IP reverses to China. > > > So, to avoid this, I'm considering the following: > > o Add a session variable that stores the last page view time down to the > second. However, this may not work as the offending behavior is probably > generated by a bot or desktop app that doesn't store session variables. > > o Review my databased logs for the current IP's last twenty page views. This > may put an extra small hit on the server, but over all not as much as an > extra 2200 page views in an hour every couple of weeks. > > If the requesting IP has requested more than twenty pages from the website in > the current minute, I block the IP for a period of time, say, an hour or two. > > I have about a dozen sites running the same software, each with multiple > thousands of pages (community sites, with 21,000 messages and > > > I've posed this question a couple of times before on this list and it hasn't > prompted any response. I will try again, hoping someone will either tell me > I'm worrying too much, or that this is a smart idea. > > Thanks, > > Mik > > > > > Michael Muller > office (413) 863-6455 > cell (413) 320-5336 > skype: michaelBmuller > http://MontagueWebWorks.com > > Information is not knowledge > Knowlege is not wisdom > > Eschew Obfuscation > > > ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:327270 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4