Let us please be civil and as always, Caveat Lector. ======================================================================== Archives Available at:
http://www.mail-archive.com/[EMAIL PROTECTED]/ <A HREF="">ctrl</A> ======================================================================== To subscribe to Conspiracy Theory Research List[CTRL] send email: SUBSCRIBE CTRL [to:] [EMAIL PROTECTED]
To UNsubscribe to Conspiracy Theory Research List[CTRL] send email: SIGNOFF CTRL [to:] [EMAIL PROTECTED]
Om
--- Begin Message ---
-Caveat Lector- Google is dying: Death by a billion cuts
by Daniel Brandt
September 10, 2004 On sites with more than a few thousand pages, Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about.
_________________________
Bob> Google is dead already.
BigBro has no need to fear that small sites will teach history or historical
hyperlink annotate analyze news in real time.
http://www.google.com/search?q=site%3Awww.madcowprod.com%20seal
...21
http://www.google.com/search?q=site%3Awww.madcowprod.com%20mena
...16
http://www.google.com/search?q=site%3Awww.madcowprod.com%20cuba
...but only five pages for Cuba, a FAR more popular search term, making it
difficult for Cubans to learn Real Cuban History
http://www.google.com/search?q=site%3Awww.madcowprod.com%20porter%20goss
...only one hit for Porter Goss doesn't teach much Real Cuban History!
http://www.google.com/search?q=site%3Asitbot.net%20bush
...13?!
http://www.google.com/search?q=site%3Asitbot.net%20heroin
...FIVE hits on "heroin" in 70,000 page cia-drugs archives!
http://www.google.com/search?q=site%3Asitbot.net%20cocaine
...two!
http://www.google.com/search?q=site%3Asitbot.net%20cia
...18 hits for "CIA" found in 70,000 page cia-drugs archive!
http://www.google.com/search?q=site%3Awww.sitbot.net%20phoenix
...one hit but that page has nothing to do with Phoenix Program history!
http://www.google.com/search?q=site%3Awww.solari.com%20hamilton
...67 pages is about right for Hamilton (good guy)
http://www.google.com/search?q=site%3Awww.solari.com%20ervin
...but 185 Ervin (bad guy) hits sounds a bit unfair to Hamilton!
http://www.google.com/search?q=site%3Awww.sitbot.net%20roadsend
http://www.google.com/search?q=site%3Awww.sitbot.net%20recbo
...Hint: eric44 and webfairy do relatively well
http://www.google.com/search?q=site%3Awww.sitbot.net%20hopsicker
...uh-huh, "small world", I mean REALLY small
http://www.google.com/search?q=site%3Awww.sitbot.net%20saeed%20sheikh
http://www.google.com/search?q=site%3Awww.sitbot.net%20khalid%20sheikh%20mohammed
http://www.google.com/search?q=site%3Awww.sitbot.net%20goss
http://www.google.com/search?q=site%3Awww.sitbot.net%20mahmud%20isi
Dead.
-Bob
replace "phoenix" with your favorite unfound word, complain--
http://www.google.com/quality_form?q=site:www.sitbot.net+phoenix
Rocky Ward wrote:Google is dying: Death by a billion cuts
by Daniel Brandt
September 10, 2004On sites with more than a few thousand pages, Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about. These pages show up in Google's main index as a listing of the URL, which means that the Googlebot is aware of the page. But they do not show up as an indexed page. When the page is listed but not indexed, the only way to find it in a search is if your search terms hit on words in the URL itself. Even if they do hit, these listed pages rank so poorly compared to indexed pages, that they are almost invisible. This is true even though the listed pages still retain their usual PageRank.
I have been complaining about this since April 2003, and it has become more visible in 2004. There is no method to Google's madness, which is another way of saying that this phenomenon is not characteristic of any particular type of site. It is happening across the entire landscape of large sites. I find it on www.johnkerry.com, on searchenginewatch.com, and dozens of other large sites I checked. Our own site, www.namebase.org, is a clean example of this, and I will use it to show how to do searches that expose this phenomenon.
You have to know what to look for and how to look for it...
First of all, a listing consists of the URL in place of the title on Google's search results pages, in blue, and below this in a smaller font there appears a "Similar pages" link in blue. That's all. An indexed page has a real title, almost always has a snippet in black, shows the URL and the size of the page in green, and then has "Cached" and "Similar pages" links in blue. (On NameBase we disallow Google's cache copy, so the "Cached" link is legitimately missing on all of our pages.) These two types of links are very different and immediately obvious. However, you should set your Google preferences to 100 links per page, because the listed links are buried much deeper in the results.Before I explain how to isolate the listed links from the indexed links, there are two cases I know of where a listing is normal for Google. These are exceptions to the phenomenon that interests me in this essay. Neither is relevant to NameBase, but I have to mention them in case you want to examine other sites. The first exception is when a site has certain directories disallowed in their robots.txt file. Google will habitually list the URLs in the disallowed directory but not index them. (This itself is an invasion of privacy, because filenames can be very revealing -- but that's a rant for another day.)
The second exception is when there are ID numbers at the end of the URL, particularly if these numbers follow a question mark in the URL. Google avoids any URL that looks like it might be a problem. Sometimes this number is a session ID number from a shopping cart site. If Google followed these links, the crawler might end up grabbing thousands of duplicate pages, distinguished only by the session ID.
Now that you know what I'm not talking about, here is how you can investigate a site. First you have to find a word on the site that is present on nearly every page of the site. On some of the sites we looked at, the word "reserved" from the copyright notice (as in "All rights reserved") worked fairly well. On NameBase, we have "home page" at the bottom. The "site:" command is used in conjunction with "home page." By putting "home page" in quotes, the search is more accurate:
site:www.namebase.org "home page"
That search asks for all pages from www.namebase.org that include the phrase "home page." These will be indexed pages. If the page was merely listed, Google wouldn't be aware that this phrase is at the bottom of the page. Next you can request all pages that do not contain this phrase, by inserting a minus sign in front of the phrase:
site:www.namebase.org -"home page"
In the case of NameBase, this became a problem that I first noticed in April 2003. That was the month when Google underwent a massive upheaval, which I describe in my Google is broken essay. When that essay was written two months after the upheaval, it would have been speculative to claim that the listed URL phenomenon was a symptom of the 4-byte docID problem described in the essay. It was too soon. But sixteen months later, the URL listings are beginning to look very widespread and very suspicious. It's a major fault in Google's index, it is getting worse, and it is much more than a mere temporary glitch.
Another curiosity emerged in August 2003, two months after my "Google is broken" essay. Google started showing supplemental results from an entirely separate index. If you run out of regular results you will often see the label "Supplemental Result" in green on the last page of available links. At that time Google briefly stated on their site that they "augment results for difficult queries by searching a supplemental collection of web pages." A representative from Google had little to add to this, but did concede that it is an entirely separate index, and then threw out a few words of spin. It sounded like a cover story. I believe that this new index was started due to a capacity problem in the main index and the need to develop new software.
Google is dying. It broke sixteen months ago and hasn't been fixed. It looks to me as if pages that have been noted by the crawler cannot be indexed until some other indexed page gives up its docID number. Now that Google is a public company, stockholders and analysts should require that Google give a full accounting of their indexing problems, and what they are doing to fix the situation. The SEC should get involved too, because this continuing decline in the quality of Google's main index is a significant risk factor that should have been mentioned in the prospectus.
*** [If you found this article of value, please peruse the SiaNews/FriendsOfLiberty Archives, and register (free) as a member, for exclusive benefits.]
"Get off of your ass and take your government back!" ~Rocky
"Knowledge makes a man unfit to be a slave." -Frederick Douglass
Please let us stay on topic and be civil.-Home Page- www.cia-drugs.org
OM
Yahoo! Groups Sponsor
ADVERTISEMENT
Yahoo! Groups Links
www.ctrl.org DECLARATION & DISCLAIMER ========== CTRL is a discussion & informational exchange list. Proselytizing propagandic screeds are unwelcomed. Substance—not soap-boxing—please! These are sordid matters and 'conspiracy theory'—with its many half-truths, mis- directions and outright frauds—is used politically by different groups with major and minor effects spread throughout the spectrum of time and thought. That being said, CTRLgives no endorsement to the validity of posts, and always suggests to readers; be wary of what you read. CTRL gives no credence to Holocaust denial and nazi's need not apply.
- To visit your group on the web, go to:
http://groups.yahoo.com/group/cia-drugs/
- To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]
- Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Let us please be civil and as always, Caveat Lector. ======================================================================== Archives Available at:
http://www.mail-archive.com/[EMAIL PROTECTED]/ <A HREF="">ctrl</A> ======================================================================== To subscribe to Conspiracy Theory Research List[CTRL] send email: SUBSCRIBE CTRL [to:] [EMAIL PROTECTED]
To UNsubscribe to Conspiracy Theory Research List[CTRL] send email: SIGNOFF CTRL [to:] [EMAIL PROTECTED]
Om
--- End Message ---