Hi, Im trying to extract the domain name from an url. lets say I call it full_domain and significant_domain(which is the homepage domain)
Eg: url=http://en.wikipedia.org/wiki/IPod , full_domain=en.wikipedia.org ,significant_domain=wikipedia.org Using urlsplit (of urlparse module), I will be able to get the full_domain, but Im wondering how to get significant_domain. I will not be able to use like counting the number of dots. etc Some domains maybe like foo.bar.co.in (where significant_domain= bar.co.in) I have around 40M url list. Its ok, if I fallout in few(< 1%) cases. Although I agree that measuring this error rate itself is not clear, maybe just based on ituition. Anybody have clues about existing url parsers in python to do this. Searching online couldnt help me much other than the urlparse/urllib module. Worst case is to try to build a table of domain categories(like .com, .co.il etc and look for it in the suffix rather than counting dots and just extract the part till the preceding dot), but Im afraid if I do this, I might miss some domain category. -- http://mail.python.org/mailman/listinfo/python-list