Hi everyone,
In an interesting new twist, I re-indexed the site, and suddenly have only
the start url in the db. I hadn't changed anything on the backend--only the
html side-- since the last htdig, so I'm puzzled. It's indexing correctly,
but making it through the first page and not following any links.
At first I thought robots.txt is the problem, but there are many links which
should be followed yet aren't appearing in the search results. Any ideas?
Here's my server response:
rundig -vvvvv
ht://dig Start Time: Fri Nov 8 13:42:08 2002
1:1:http://www.theonion.com/
New server: www.theonion.com, 80
- Persistent connections: enabled
- HEAD before GET: disabled
- Timeout: 30
- Connection space: 0
- Max Documents: -1
- TCP retries: 1
- TCP wait time: 5
- Accept-Language:
Trying to retrieve robots.txt file
Creating an HtHTTPBasic object
Making HTTP request on http://www.theonion.com/robots.txt
Try to get through to host www.theonion.com (port 80)
1 - Open of the connection ok
Assigned the remote host www.theonion.com
Assigned the port 80
Header line: HTTP/1.1 200 OK
Header line: Date: Fri, 08 Nov 2002 19:42:08 GMT
Header line: Server: Apache/1.3.22 (Unix) (Red-Hat/Linux) mod_python/2.7.6
Python/1.5.2 mod_ssl/2.8.5 OpenSSL/0.9.6b DAV/1.0.2 PHP/4.0.6 mod_perl/1.26
mod_throttle/3.1.2
Header line: Last-Modified: Tue, 05 Nov 2002 21:09:09 GMT
Header line: ETag: "97cf8-10d-3dc83375"
Discarded header line: ETag: "97cf8-10d-3dc83375"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Length: 269
Header line: Content-Type: text/plain
Retrieving document /robots.txt on host: www.theonion.com:80
Http version : HTTP/1.1
Server : HTTP/1.1
Status Code : 200
Reason : OK
Access Time : Fri, 08 Nov 2002 19:42:08 GMT
Modification Time : Tue, 05 Nov 2002 21:09:09 GMT
Content-type : text/plain
Persistent connection: would be accepted
Reading the body of the response
Connection stays up ... (Persistent connection)
Request time: 0 secs
Parsing robots.txt file using myname = htdig
Robots.txt line: User-Agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow: /ads/
Found 'disallow' line: /ads/
Robots.txt line: Disallow: /archives/
Found 'disallow' line: /archives/
Robots.txt line: Disallow: /css/
Found 'disallow' line: /css/
Robots.txt line: Disallow: /current/
Found 'disallow' line: /current/
Robots.txt line: Disallow: /global/
Found 'disallow' line: /global/
Robots.txt line: Disallow: /back_issues/
Found 'disallow' line: /back_issues/
Robots.txt line: Disallow: /contests/
Found 'disallow' line: /contests/
Robots.txt line: Disallow: /info/
Found 'disallow' line: /info/
Robots.txt line: Disallow: /onion_help/
Found 'disallow' line: /onion_help/
Robots.txt line: Disallow: /print_edition/
Found 'disallow' line: /print_edition/
Robots.txt line: Disallow: /site_index/
Found 'disallow' line: /site_index/
Robots.txt line: Disallow: /email_this_page/
Found 'disallow' line: /email_this_page/
Pattern:
/ads/|/archives/|/css/|/current/|/global/|/back_issues/|/contests/|/info/|/o
nion_help/|/print_edition/|/site_index/|/email_this_page/
1 - Closing previous connection with the remote host
pushed
pick: www.theonion.com, # servers = 1
> www.theonion.com supports HTTP persistent connections (infinite)
0:2:0:http://www.theonion.com/: Creating an HtHTTPBasic object
Making HTTP request on http://www.theonion.com/
Try to get through to host www.theonion.com (port 80)
2 - Open of the connection ok
Assigned the remote host www.theonion.com
Assigned the port 80
Header line: HTTP/1.1 200 OK
Header line: Date: Fri, 08 Nov 2002 19:42:08 GMT
Header line: Server: Apache/1.3.22 (Unix) (Red-Hat/Linux) mod_python/2.7.6
Python/1.5.2 mod_ssl/2.8.5 OpenSSL/0.9.6b DAV/1.0.2 PHP/4.0.6 mod_perl/1.26
mod_throttle/3.1.2
Header line: Transfer-Encoding: chunked
Header line: Content-Type: text/html
No modification time returned: assuming now
Retrieving document / on host: www.theonion.com:80
Http version : HTTP/1.1
Server : HTTP/1.1
Status Code : 200
Reason : OK
Access Time : Fri, 08 Nov 2002 19:42:08 GMT
Modification Time : Fri, 08 Nov 2002 19:42:08 GMT
Content-type : text/html
Transfer-encoding : chunked
Persistent connection: would be accepted
Reading the body of the response
Initial chunk-size: 1953
Chunk-size: 0
Connection stays up ... (Persistent connection)
Request time: 0 secs
Tag: <HTML>, matched -1
Tag: <HEAD>, matched -1
Tag: <TITLE>, matched 0
word: The@1
word: Onion@2
word: America's@3
word part: America@3
word: Finest@4
word: News@5
word: Source™@6
word part: Source@6
word part: 153@6
Tag: </TITLE>, matched 1
title: The Onion | America's Finest News Source™
Tag: <META NAME="robots" content="index,follow">, matched 20
Tag: <META NAME="description" CONTENT="The Onion, America's Finest News
Reporting is an award-winning satirical publication founded in 1988 in
Madison, Wisconsin.">, matched 20
META Description: The Onion, America's Finest News Reporting is an
award-winning satirical publication founded in 1988 in Madison, Wisconsin.
meta description: The Onion, America's Finest News Reporting is an
award-winning satirical publication founded in 1988 in Madison, Wisconsin.
word: The@1
word: Onion@2
word: America's@3
word part: America@3
word: Finest@4
word: News@5
word: Reporting@6
word: award-winning@7
word part: award@7
word part: winning@7
word: satirical@8
word: publication@9
word: founded@10
word: 1988@11
word: Madison@12
word: Wisconsin.@13
Tag: <META NAME="keywords" CONTENT="The Onion, Onion, America's Finest News
Source, satire, political, humor, comedy, jokes, editorial, magazine,
newspaper, In the News, News in Brief, Top Story, What Do You Think,
Infographic, STATShot, The Onion in History, area man, area woman, wit, pop
culture, shattered nation, Attack, Middle East crisis, Iraq, Americans,
George W Bush, Congress, Bill Clinton, Al Gore, Ralph Nader, religion, God,
Christ, Pope, Starbucks, Star Wars, Death Star, special forces, Crypty,
Cryptosporidium, Eminem, Ted Nugent, Harry Potter, monkey, school, college,
drugs, marijuana, ferret, campus life, Taco Bell, Doritos, NASA, NASCAR,
Microsoft, Bill Gates, Books, Our Dumb Century, Dispatches from the Tenth
Circle, Onion calendars, Onion print edition, Onion mobile edition,
merchandise, Smoove B, love man, Jim Anchower, The Cruise, Jean Teasdale, A
Room of Jean's Own, Jeanketeer, Jackie Harvey, The Outside Scoop, Herbert
Kornfeld, h-dog, point-counterpoint, T. Herman Zweibel, The Mercantile
Onion, Advice, Reviews, Justify Your Existence, Savage Love, Red Meat,
Pathetic Geek Stories">, matched 20
word: The@1
word: Onion@2
word: Onion@3
word: America's@4
word part: America@4
word: Finest@5
word: News@6
word: Source@7
word: satire@8
word: political@9
word: humor@10
word: comedy@11
word: jokes@12
word: editorial@13
word: magazine@14
word: newspaper@15
word: the@16
word: News@17
word: News@18
word: Brief@19
word: Top@20
word: Story@21
word: What@22
word: You@23
word: Think@24
word: Infographic@25
word: STATShot@26
word: The@27
word: Onion@28
word: History@29
word: area@30
word: man@31
word: area@32
word: woman@33
word: wit@34
word: pop@35
word: culture@36
word: shattered@37
word: nation@38
word: Attack@39
word: Middle@40
word: East@41
word: crisis@42
word: Iraq@43
word: Americans@44
word: George@45
word: Bush@46
word: Congress@47
word: Bill@48
word: Clinton@49
word: Gore@50
word: Ralph@51
word: Nader@52
word: religion@53
word: God@54
word: Christ@55
word: Pope@56
word: Starbucks@57
word: Star@58
word: Wars@59
word: Death@60
word: Star@61
word: special@62
word: forces@63
word: Crypty@64
word: Cryptosporidium@65
word: Eminem@66
word: Ted@67
word: Nugent@68
word: Harry@69
word: Potter@70
word: monkey@71
word: school@72
word: college@73
word: drugs@74
word: marijuana@75
word: ferret@76
word: campus@77
word: life@78
word: Taco@79
word: Bell@80
word: Doritos@81
word: NASA@82
word: NASCAR@83
word: Microsoft@84
word: Bill@85
word: Gates@86
word: Books@87
word: Our@88
word: Dumb@89
word: Century@90
word: Dispatches@91
word: from@92
word: the@93
word: Tenth@94
word: Circle@95
word: Onion@96
word: calendars@97
word: Onion@98
word: print@99
word: edition@100
word: Onion@101
word: mobile@102
word: edition@103
word: merchandise@104
word: Smoove@105
word: love@106
word: man@107
word: Jim@108
word: Anchower@109
word: The@110
word: Cruise@111
word: Jean@112
word: Teasdale@113
word: Room@114
word: Jean's@115
word part: Jean@115
word: Own@116
word: Jeanketeer@117
word: Jackie@118
word: Harvey@119
word: The@120
word: Outside@121
word: Scoop@122
word: Herbert@123
word: Kornfeld@124
word: h-dog@125
word part: dog@125
word: point-counterpoint@126
word part: point@126
word part: counterpoint@126
word: Herman@127
word: Zweibel@128
word: The@129
word: Mercantile@130
word: Onion@131
word: Advice@132
word: Reviews@133
word: Justify@134
word: Your@135
word: Existence@136
word: Savage@137
word: Love@138
word: Red@139
word: Meat@140
word: Pathetic@141
word: Geek@142
word: Stories@143
Tag: <META NAME="copyright" CONTENT="(c) Copyright 2002 by Onion, Inc. All
rights reserved.">, matched 20
Tag: <META http-equiv="Content-Type" CONTENT="text/html;
charset=iso-8859-1">, matched 20
Tag: <META NAME="generator" CONTENT="Onion WebDesign">, matched 20
Tag: <LINK REL="stylesheet" TYPE="text/css" HREF="/css/main_mac.css">,
matched 26
href: http://www.theonion.com/css/main_mac.css ()
Rejected: Extension is invalid!
url rejected: (level 1)http://www.theonion.com/css/main_mac.css
Tag: </HEAD>, matched -1
Tag: <BASE TARGET="_parent">, matched 23
Tag: <BODY BGCOLOR="#FFFFFF" MARGINWIDTH=0 MARGINHEIGHT=0 LEFTMARGIN=0
TOPMARGIN=0>, matched -1
Tag: <script language="JavaScript" type="text/javascript"
src="http://66.216.104.232:80/servlet/ajrotator/79/0/viewJScript?pool=52&typ
e=2137">, matched 29
Tag: </script>, matched 30
Tag: </BODY>, matched -1
Tag: </HTML>, matched -1
head:
size = 1953
pick: www.theonion.com, # servers = 1
> www.theonion.com supports HTTP persistent connections (infinite)
ht://dig End Time: Fri Nov 8 13:42:09 2002
2 - Closing previous connection with the remote host
ID: 2 URL: http://www.theonion.com/
Preamble text:
Postamble text:
Note: This message will be sent again if you do not change or
take away the notification of the above mentioned HTML page.
Find out more about the notification service at
http://www.htdig.org/meta.html
Cheers!
ht://Dig Notification Service
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html