On May 15, 2007, at 9:04 PM, lazy wrote: > Hi, > Im trying to extract the domain name from an url. lets say I call > it full_domain and significant_domain(which is the homepage domain) > > Eg: url=http://en.wikipedia.org/wiki/IPod , > full_domain=en.wikipedia.org ,significant_domain=wikipedia.org > > Using urlsplit (of urlparse module), I will be able to get the > full_domain, but Im wondering how to get significant_domain. I will > not be able to use like counting the number of dots. etc > > Some domains maybe like foo.bar.co.in (where significant_domain= > bar.co.in) > I have around 40M url list. Its ok, if I fallout in few(< 1%) cases. > Although I agree that measuring this error rate itself is not clear, > maybe just based on ituition. > > Anybody have clues about existing url parsers in python to do this. > Searching online couldnt help me much other than > the urlparse/urllib module. > > Worst case is to try to build a table of domain > categories(like .com, .co.il etc and look for it in the suffix rather > than counting dots and just extract the part till the preceding dot), > but Im afraid if I do this, I might miss some domain category.
The best way I know to get an *authoritive* answer is to start with the full_domain and try a whois lookup. If it returns no records, drop everything before the first dot and try again. Repeat until you get a good answer -- this is the significant_domain. hth, Michael -- http://mail.python.org/mailman/listinfo/python-list