> From: Jeff Godin > Sent: Sunday, January 25, 2004 2:35 PM [...] > > Do you have information on what you used to do the extraction and sorting? > > A lot of tools have trouble just getting THAT part right. :) > I'm attaching the "etract_urls.pl" script that I used to find the URL's in messages and to validate them by trying an HTTP "get" on them. I'm thinking now it might be good to change the program so that it also print URL's that it attempted the "get" on that failed, but at the moment it just prints the URL's that worked. Also attached, is a script that processes the output file from extract_urls.pl (which is in the form <host><url>) and finds the unique host names and prints them sorted based upon their "reverse dot" form. I'm a little hazy on whether sort of IP addresses needs to be done on a component basis, but I'm less concerened about that at the moment. [...] > If your goal is just to obtain the whois info for a domain, no need > to reinvent the wheel at this point. See below. If the goal is something > else... please clarify. :) > I'm just experimenting at the moment, but the current goal is to find their whois data and look for relationships. My guess is that I'm duplicating Spamhaus's and others work somewhat, but I'm using this as a "learning experience". Basically, I'm looking for a method to automatically collect URL/domains that that are hosted by spammers. I'd plan on looking both at content, and at spam indications in the whois info. These domains would then be fed back into blacklisting at the gateway, or possibly blacklisting at the SA level. One idea I had was to concatenate the material fetched from URL's referenced in a given mail message, and wrap the current headers from that message around the newly created message body (which contains content from the web pages). This would be fed back into SA, for scoring. If it scores as spam, then that URL is likely from a spammer. Not efficient, but possibly effective. I tried this by hand and it gave the desirable result -- SA detected the web page as being spam. > > > > Regarding whois, I tried a few of the domains in the list and noticed > > that whois turned up empty. Is there a database somewhere that relates > > domain names to their registrar, or to a server that will reply > with their > > whois info? > > For generic top level domains, you have the registries. Then you have > the registrars. As far as maintaining the contact info, those should > be the only folk you need to deal with. > > EPP registries are simple. The registries hold all the whois data. > Most registries are "thin", in that they just refer you to the > registrar's whois server for information. > What's an EPP registry? > Some top level domains won't have available/accurate whois info. > So, there's no requirement that a domain have a valid, correct 'whois' entry? And they can have no 'whois' info. at all? But they have to register with someone right? And that registrar doesn't require 'whois' info? > The WHOIS-SERVERS.NET zone is an excellent resource. The zone contains > CNAME entries to match a TLD with the whois server's A record for that > TLD. > > If you point your whois client at TLD.WHOIS-SERVERS.NET where TLD is the > top level domain that you wish to query, you should get results. > > There are many whois clients and scripts out there. The geektools.com > whois proxy is nice, and I believe you can download the proxy code > itself. Many people like the bbwhois client, and it offers a nice web > interface and a database-backed cache. > > IANA maintains a list of country code top level domains. This list > includes the entity acting as registry, whois servers, etc. > http://www.iana.org/cctld/cctld-whois.htm > Good. thanks. > A similar list of registries for gTLD domains is available here: > http://www.icann.org/registries/listing.html > > I don't think I've answered most of your questions, but hopefully > the information I've provided will be helpful. > Yup. Appreciate it. > Please keep in mind that there are various restrictions placed > on the data in whois, and placed on the access to the whois servers > for each registry. Pay careful attention that your actions do not > cause others harm, or cause yourself to be blacklisted, etc. > > Bulk automated whois queries for any reason place a load on the > target whois server(s). Tread lightly, and use caching as you see > fit. Yeah. I'm not sure how to approach this. I was thinking of randomizing the time to the next query over some largish interval. Trouble is, I don't know the constraints that the most picky 'whois' server might impose.
extract_urls.pl
Description: Binary data
find_unique_hosts.pl
Description: Binary data
