I'm using last Sunday's 3.2b4 build, on FreeBSD 4.3-RELEASE. I have plenty of memory and disk space.
My indexing run is abruptly terminating after only indexing a few pages. At first I thought I might have reached a limit; there were 38431 lines with links in them. I tried setting different values for max_doc_size, up to 4000000, made no difference. I reduced the page I was indexing to just 50 links, it still crapped out, until I removed one link in the middle of the list. But this link by itself gets indexed correctly. I'm puzzled! Below is the output when I run it with -vv. The only difference I can see between this and a successful run is the line "+ size = 6614" ht://dig Start Time: Wed Oct 17 12:41:53 2001 New server: www.citynews.com, 80 - Persistent connections: enabled - HEAD before GET: disabled - Timeout: 30 - Connection space: 0 - Max Documents: -1 - TCP retries: 1 - TCP wait time: 5 Trying to retrieve robots.txt file Parsing robots.txt file using myname = htdig Found 'user-agent' line: * Found 'disallow' line: /cgi-bin/ Pattern: /cgi-bin/ pick: www.citynews.com, # servers = 1 0:2:0:http://www.citynews.com/adlist/: title: CityNews Free Photo Classifieds and Chat for US and World Cities META Description: free classifieds,photo classified ads,community calendar,and chat rooms for north america and world cities url rejected: (level 1)http://www.citynews.com/css/citynews.css url rejected: (level 1)http://www.citynews.com/advertising.html Rejected: item in exclude list url rejected: (level 1)http://www.burstnet.com/ads/ad1847a-map.cgi url rejected: (level 1)http://www.citynews.com/banners.html pushing http://kc.citynews.com/5596.html New server: kc.citynews.com, 80 - Persistent connections: enabled - HEAD before GET: disabled - Timeout: 30 - Connection space: 0 - Max Documents: -1 - TCP retries: 1 - TCP wait time: 5 Trying to retrieve robots.txt file Parsing robots.txt file using myname = htdig Found 'user-agent' line: * Found 'disallow' line: /cgi-bin/ Pattern: /cgi-bin/ + pushing http://london.citynews.com/17220.html New server: london.citynews.com, 80 - Persistent connections: enabled - HEAD before GET: disabled - Timeout: 30 - Connection space: 0 - Max Documents: -1 - TCP retries: 1 - TCP wait time: 5 Trying to retrieve robots.txt file Parsing robots.txt file using myname = htdig Found 'user-agent' line: * Found 'disallow' line: /cgi-bin/ Pattern: /cgi-bin/ + size = 6614 pick: london.citynews.com, # servers = 3 1:4:1:http://london.citynews.com/17220.html: size = 3847 pick: kc.citynews.com, # servers = 3 2:3:1:http://kc.citynews.com/5596.html: title: Journal of Geocryology Rejected: item in exclude list url rejected: (level 1)http://www.burstnet.com/ads/ad1847a-map.cgi url rejected: (level 1)http://www.citynews.com/banners.html url rejected: (level 1)http://www.citynews.com/about.html url rejected: (level 1)http://www.recommend-it.com/p.e?677339 Rejected: item in exclude list url rejected: (level 1)http://kc.citynews.com/cgi-bin/pmail.cgi/5596/kc url rejected: (level 1)http://kc.citynews.com/ads4.html url rejected: (level 1)http://kc.citynews.com/ size = 3845 pick: www.citynews.com, # servers = 3 pick: london.citynews.com, # servers = 3 pick: kc.citynews.com, # servers = 3 pick: www.citynews.com, # servers = 3 htdig: Run complete htdig: 3 servers seen: htdig: kc.citynews.com:80 1 document htdig: london.citynews.com:80 1 document htdig: www.citynews.com:80 1 document HTTP statistics =============== Persistent connections : Yes HEAD call before GET : No Connections opened : 6 Connections closed : 5 Changes of server : 2 HTTP Requests : 6 HTTP KBytes requested : 2.28223 HTTP Average request time : 0.166667 secs _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

