Re: URL parsing for the hard cases

John Nagle Sun, 22 Jul 2007 19:12:10 -0700

[EMAIL PROTECTED] wrote:

> Once you eliminate IPv6 addresses, parsing is simple. Is there a
> colon? Then there is a port number. Does the left over have any
> characters not in [0123456789.]? Then it is a name, not an IPv4
> address.
> 
> --Michael Dillon
>


   You wish.  Hex input of IP addresses is allowed:

        http://0x525eedda

and

        http://0x52.0x5e.0xed.0xda

are both "Python.org".  Or just put

        0x52.0x5e.0xed.0xda

into the address bar of a browser.  All these work in Firefox on Windows and
are recognized as valid IP addresses.

On the other hand,
        
        0x52.com

is a valid domain name, in use by PairNIC.

But

        http://test.0xda

is handled by Firefox on Windows as a domain name.  It doesn't resolve, but it's
sent to DNS.

So I think the question is whether every term between dots can be parsed as
a decimal or hex number.  If all terms can be parsed as a number, and there are
no more than four of them, it's an IP address.  Otherwise it's a domain name.

There are phishing sites that pull stuff like this, and I'm parsing a long list
of such sites.  So I really do need to get the hard cases right.

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: URL parsing for the hard cases

Reply via email to