Thanks. Hmm, the url list is quite huge(40M). I think it will take a lot of time,for a whois lookup I guess. But yeah, thats seems to be a good way. Probably I will try it with a smaller set (10K) and see the time it takes. If not, I guess I will just build a table of known domains(.com,.org,.co.il etc ) and then I can find the root domain(significant_domain) atleast for those and I hope majority of them fall into this :)
On May 16, 12:32 am, Michael Bentley <[EMAIL PROTECTED]> wrote: > On May 15, 2007, at 9:04 PM, lazy wrote: > > > > > Hi, > > Im trying to extract the domain name from an url. lets say I call > > it full_domain and significant_domain(which is the homepage domain) > > > Eg: url=http://en.wikipedia.org/wiki/IPod, > > full_domain=en.wikipedia.org ,significant_domain=wikipedia.org > > > Using urlsplit (of urlparse module), I will be able to get the > > full_domain, but Im wondering how to get significant_domain. I will > > not be able to use like counting the number of dots. etc > > > Some domains maybe like foo.bar.co.in (where significant_domain= > > bar.co.in) > > I have around 40M url list. Its ok, if I fallout in few(< 1%) cases. > > Although I agree that measuring this error rate itself is not clear, > > maybe just based on ituition. > > > Anybody have clues about existing url parsers in python to do this. > > Searching online couldnt help me much other than > > the urlparse/urllib module. > > > Worst case is to try to build a table of domain > > categories(like .com, .co.il etc and look for it in the suffix rather > > than counting dots and just extract the part till the preceding dot), > > but Im afraid if I do this, I might miss some domain category. > > The best way I know to get an *authoritive* answer is to start with > the full_domain and try a whois lookup. If it returns no records, > drop everything before the first dot and try again. Repeat until you > get a good answer -- this is the significant_domain. > > hth, > Michael -- http://mail.python.org/mailman/listinfo/python-list