Re: Search Engine Bots Generating Strange Queries
Another reason not to use redirects for missing URIs is that you could mistakenly create what is called a "crawler trap". A crawler trap are URLs that keep changing but keep producing the same content. The crawler gets stuck wasting its time download the same page, because it can't tell by the URL that the content is the same. While good crawlers have logic to prevent this problem from happening. Your site could be flagged as poorly structured, and commercial crawlers will avoid indexing your content. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
Most web crawlers won't check a 404, because of the way servers send Http responses. When a crawler requests a page that is missing, it first receives the header response from the request, and it can read the response code, content-type, and other information. The web crawler can then stop the download of the content after it has checked the response code, reducing the bandwidth placed on the server, and reducing time the web crawler is spending on missing content. If a redirect response is sent, then the crawler must make another request to the server and will download the entire content of a page that does not reflect the source url. The web crawler will see a 200 response code on the new URI, download all the content, and increase the time and bandwidth spent crawling that domain. But I understand what your saying Brendon about it being a design choice. I'm just not sure traversing the URL path improves the visitors usability of the website their visiting. Once they step up to an invalid URI they will be redirected somewhere else, which would stop the traversal of the URL. Here's CNN as an example. http://edition.cnn.com/2008/POLITICS/11/06/middle.east.peace.deal/index.html http://edition.cnn.com/2008/POLITICS/11/06/middle.east.peace.deal http://edition.cnn.com/2008/POLITICS/11/06 http://edition.cnn.com/2008/POLITICS/11 http://edition.cnn.com/2008/POLITICS http://edition.cnn.com/2008 While these links will produce a 404 response and display Html. A web crawler will not download the content after it has rejected the response code in the header of the Http response. So the most bandwidth load placed on the server is a few bytes per bad URI. This makes your domain crawler friendly, but a friendly crawler would not request phantom URIs. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
It may not index a 404, but it still checks the 404. For usability's sake I'd still prefer to redirect than to send a 404. Although we were discussing bots, we have to keep the user in mind as well. I have personally traversed the URL path to see what may be found on some sites, and if Safari has the feature included out of the box, well...I'd rather present the user with something than nothing at all, and a 404 isn't my idea of proper degredation within the path. Either way, it's simply a matter of personal preference. Google was not the first search engine to incorporate robots.txt by the way...they were the first to incorporate the rel="nofollow" and also I think the SiteMap.xml idea. On Nov 6, 12:05 pm, Mathew <[EMAIL PROTECTED]> wrote: > > I'd actually say using a permanent redirect (301, I believe) to your > > root (or that controller's index), rather than to the 404 page might > > be a better solution. If your users/visitors won't see it since > > you're not linking to it, it isn't really a bad solution, and I doubt > > you'd want any search engines indexing 404 errors in association with > > your site/domain. If it was a hacker, I don't think I'd send them a > > 404 message either, I'd just redirect them...if it was a Safari user, > > You should not redirect unless the content has been moved. Sending the > wrong response codes to incorrect URIs makes it difficult for web > crawl operators to correctly crawl your site. Should a web crawl > operator come to the conclusion that your site provides incorrect > response codes, then they might choose to crawl it aggressively since > the server's responses can not be trusted. > > Indexing bots will not index a 404 response code from the Http header. > That response code tells the bots the URI points to no content. Bots > will only index pages when the 404 error message is sent with a Http > 200 response code and a text/html content-type in the header, which is > incorrect and more of an error on the server side then a problem with > the bot. > > If you send a 301/302 response code you are telling the bot, this URI > is valid, it has been moved, now the source URI and the redirected URI > will continue to be processed by the bot. Where as if you tell the bot > 404, then the bot knows this URI is invalid, the source page that URI > comes from is generating invalid URIs, and it can drop other URIs from > that source. > > Sending a hacker a 301, 302 does nothing to change their behavior, and > provides them no extra information then a 404. > > Blocking a remote computer from making to many invalid requests from > your server does change the behavior of that remote computer. It stops > it. Which is about all you can do at this point. A hacker will return > with a different IP address, and attack. So, hackers are a completely > different topic :) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
> I'd actually say using a permanent redirect (301, I believe) to your > root (or that controller's index), rather than to the 404 page might > be a better solution. If your users/visitors won't see it since > you're not linking to it, it isn't really a bad solution, and I doubt > you'd want any search engines indexing 404 errors in association with > your site/domain. If it was a hacker, I don't think I'd send them a > 404 message either, I'd just redirect them...if it was a Safari user, You should not redirect unless the content has been moved. Sending the wrong response codes to incorrect URIs makes it difficult for web crawl operators to correctly crawl your site. Should a web crawl operator come to the conclusion that your site provides incorrect response codes, then they might choose to crawl it aggressively since the server's responses can not be trusted. Indexing bots will not index a 404 response code from the Http header. That response code tells the bots the URI points to no content. Bots will only index pages when the 404 error message is sent with a Http 200 response code and a text/html content-type in the header, which is incorrect and more of an error on the server side then a problem with the bot. If you send a 301/302 response code you are telling the bot, this URI is valid, it has been moved, now the source URI and the redirected URI will continue to be processed by the bot. Where as if you tell the bot 404, then the bot knows this URI is invalid, the source page that URI comes from is generating invalid URIs, and it can drop other URIs from that source. Sending a hacker a 301, 302 does nothing to change their behavior, and provides them no extra information then a 404. Blocking a remote computer from making to many invalid requests from your server does change the behavior of that remote computer. It stops it. Which is about all you can do at this point. A hacker will return with a different IP address, and attack. So, hackers are a completely different topic :) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
I'd actually say using a permanent redirect (301, I believe) to your root (or that controller's index), rather than to the 404 page might be a better solution. If your users/visitors won't see it since you're not linking to it, it isn't really a bad solution, and I doubt you'd want any search engines indexing 404 errors in association with your site/domain. If it was a hacker, I don't think I'd send them a 404 message either, I'd just redirect them...if it was a Safari user, I'd rather give them a graceful degredation than a 404 just as well. That's just me though. Standard incorrect addresses should still receive a 404. A 404 does serve a very important purpose. On Nov 6, 9:00 am, Mathew <[EMAIL PROTECTED]> wrote: > Hi Mike, > > If your using Apache it has some features in the htaccess file that > will allow you to disable access to your server for bots causing you > trouble. > > In your Cake 404 display page keep track of the number of times a 404 > is generated per IP address, and if it exceeds a threshold log that IP > address to a text file. > > Humans browsing a website will not generate many 404 messages, even if > they have bad bookmarks, or follow old links from search engines. So > an IP address requesting more then one hundred 404 errors is likely a > problem bot. Each time a 404 page is display log the IP to a database > with a counter. When the counter reaches your limit add that IP > address to a text file. > > In your .htaccess you can load this text file of IP addresses and > apply rules to those addresses. It's up to you if you wish to display > a static access denied Html page, or simply throw a connection > refused. > > Sorry I don't remember the commands for the htaccess file. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
Hi Mike, If your using Apache it has some features in the htaccess file that will allow you to disable access to your server for bots causing you trouble. In your Cake 404 display page keep track of the number of times a 404 is generated per IP address, and if it exceeds a threshold log that IP address to a text file. Humans browsing a website will not generate many 404 messages, even if they have bad bookmarks, or follow old links from search engines. So an IP address requesting more then one hundred 404 errors is likely a problem bot. Each time a 404 page is display log the IP to a database with a counter. When the counter reaches your limit add that IP address to a text file. In your .htaccess you can load this text file of IP addresses and apply rules to those addresses. It's up to you if you wish to display a static access denied Html page, or simply throw a connection refused. Sorry I don't remember the commands for the htaccess file. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
Thank you Matthew - I log it everytime before throwing the 404 and I figured whatever was creating these things would stop - but it continues. I'm so dadgum anal obsessive it just kills me - hard to ignore... It is not coming from any 'known' bot either... --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
Great advice mathew... Yes... i think that this is the way to go... point all /controller/action which dont mean any thing without an extra id to 404... once the crawler sees this 404 it would never try to fetch the same thing again. Thanks. On Sat, Nov 1, 2008 at 6:21 AM, Mathew <[EMAIL PROTECTED]> wrote: > > Hi Mike, > > Disallowing that in your robots.txt is a waste of time. > > The robots.txt file was started by Google, and is not an officially > supported feature of all crawlers. So they don't have to follow it, > and I can tell you this doesn't sound like the google bot anyway, > because that bot doesn't generate phantom URIs. > > Web crawlers can extract URIs from many different sources, and they > can generate URIs as they see fit. URIs can come from HTML, CSS, SWF, > JavaScript, and form post/get actions. I've even seen crawlers submit > post requests to generate more URIs to crawl. > > Crawlers will also clean URIs removing ids, changing queries, fake > cookies, and sometimes rotate their IP address. > > There are no rules about crawlers, no guidelines they have to follow, > or limits on how long they will crawl or how aggressively they will > request URIs from your server. > > You should modify your Routes to point to a 404 if they request paths > that you don't want them to see. > > > -- Thanks & Regards, Novice. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
Hi Mike, Disallowing that in your robots.txt is a waste of time. The robots.txt file was started by Google, and is not an officially supported feature of all crawlers. So they don't have to follow it, and I can tell you this doesn't sound like the google bot anyway, because that bot doesn't generate phantom URIs. Web crawlers can extract URIs from many different sources, and they can generate URIs as they see fit. URIs can come from HTML, CSS, SWF, JavaScript, and form post/get actions. I've even seen crawlers submit post requests to generate more URIs to crawl. Crawlers will also clean URIs removing ids, changing queries, fake cookies, and sometimes rotate their IP address. There are no rules about crawlers, no guidelines they have to follow, or limits on how long they will crawl or how aggressively they will request URIs from your server. You should modify your Routes to point to a 404 if they request paths that you don't want them to see. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
So you're saying the search bots are just walking all my actions as if they are subdirs on a site? Not sure about this. Maybe I should disallow those specific requests with robots.txt? Any other cakers have an opinion on this? If I disallow www.mydomain.com/controller/action/ wont the bots stop walking all the actions? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Re: Search Engine Bots Generating Strange Queries
I'm totally no expert on this, but I'd guess that the bots are simply trying to walk the tree. If "http://mysite.com/directory/subdirectory/subsubdirectory"; is valid, then "http://mysite.com/directory/subdirectory";, "http://mysite.com/directory " and "http://mysite.com"; are probably also valid. The GOOG doesn't know that those directories don't actually exist. In "classic" web development patterns there should be an index.htm file in each of these directories, so it can't hurt to look for them. BTW: Safari (and possibly other browsers as well) allow you to right- click on the title bar and offer the same kind of "URL shortening shortcuts" in a popup menu. On 30 Oct 2008, at 15:02, MikeK wrote: > > In a general CMS app written in CakePHP I am noticing in my logs > invalid queries being generated by various search engine bots > including Google, Inktomi, and Yahoo. > > What I'm wondering is WHY? > > For example they are requesting > > http://mysite.com/controller/view instead of the correct > > http://mysite.com/controller/view/34 (ex: id 34) > > Nowhere on my site do I publish any links to /controllers/view without > an id parm > > This is driving me slightly nuts. Why would a bot request a URI it has > never seen? > > My validation code that checks for valid requests logs these > occurences and every day I puzzle over my logs and examine the emitted > web page source wondering where or why they are requesting these > invalid URIs. I've been dumping $_SERVER and no clues there either. > The referer is always '/'. > > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---
Search Engine Bots Generating Strange Queries
In a general CMS app written in CakePHP I am noticing in my logs invalid queries being generated by various search engine bots including Google, Inktomi, and Yahoo. What I'm wondering is WHY? For example they are requesting http://mysite.com/controller/view instead of the correct http://mysite.com/controller/view/34 (ex: id 34) Nowhere on my site do I publish any links to /controllers/view without an id parm This is driving me slightly nuts. Why would a bot request a URI it has never seen? My validation code that checks for valid requests logs these occurences and every day I puzzle over my logs and examine the emitted web page source wondering where or why they are requesting these invalid URIs. I've been dumping $_SERVER and no clues there either. The referer is always '/'. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "CakePHP" group. To post to this group, send email to cake-php@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/cake-php?hl=en -~--~~~~--~~--~--~---