Here's another hard case. This one might be a bug in urlparse: import urlparse
s = 'ftp://administrator:[EMAIL PROTECTED]/originals/6 june 07/ebay/login/ebayisapi.html' urlparse.urlparse(s) yields: (u'ftp', u'administrator:[EMAIL PROTECTED]', u'/originals/6 june 07/ebay/login/ebayisapi.html', '', '', '') That second field is supposed to be the "hostport" (per the RFC usage of the term; Python uses the term "netloc"), and the username/password should have been parsed and moved to the "username" and "password" fields of the object. So it looks like urlparse doesn't really understand FTP URLs. That's a real URL, from a search for phishing sites. There are lots of hostile URLs out there. Some of which can fool some parsers. John Nagle John Nagle wrote: > [EMAIL PROTECTED] wrote: > >> Once you eliminate IPv6 addresses, parsing is simple. Is there a >> colon? Then there is a port number. Does the left over have any >> characters not in [0123456789.]? Then it is a name, not an IPv4 >> address. >> >> --Michael Dillon >> > > You wish. Hex input of IP addresses is allowed: > > http://0x525eedda > > and > > http://0x52.0x5e.0xed.0xda > > are both "Python.org". Or just put > > 0x52.0x5e.0xed.0xda > > into the address bar of a browser. All these work in Firefox on Windows > and > are recognized as valid IP addresses. > > On the other hand, > > 0x52.com > > is a valid domain name, in use by PairNIC. > > But > > http://test.0xda > > is handled by Firefox on Windows as a domain name. It doesn't resolve, > but it's > sent to DNS. > > So I think the question is whether every term between dots can be parsed as > a decimal or hex number. If all terms can be parsed as a number, and > there are > no more than four of them, it's an IP address. Otherwise it's a domain > name. > > There are phishing sites that pull stuff like this, and I'm parsing a > long list > of such sites. So I really do need to get the hard cases right. > > Is there any library function that correctly tests for an IP address vs. a > domain name based on syntax, i.e. without looking it up in DNS? > > John Nagle -- http://mail.python.org/mailman/listinfo/python-list