At 01:56 -0500 2001.08.21, Stuart Johnston wrote:
>Does anyone have a simple filter for URL encoding that I can use?

Not simple, no.  :)

This is what I use in Slash, though.  YMMV.  uses HTML::Entities and URI.

The important part to you is probably just the one regex with $URI::uric
and %URI::Escape::escapes.  We have other needs too; stripping out script:
stuff, stripping out the "authority" (has been a problem on Slashdot
comments), remove certain characters, etc.  HTH.

sub fixurl {
        my($url) = @_;

        # Remove quotes and whitespace (we will expect some at beginning and
        # end, probably)
        $url =~ s/["\s]//g;
        # any < or > char after the first char truncates the URL right there
        # (we will expect a trailing ">" probably)
        $url =~ s/^[<>]+//;
        $url =~ s/[<>].*//;
        # strip surrounding ' if exists
        $url =~ s/^'(.+?)'$/$1/g;
        # add '#' to allowed characters; escape anything not allowed.
        $url =~ s/([^$URI::uric#])/$URI::Escape::escapes{$1}/oge;

        if (1) {
                # Strip the authority, if any.
                # This prevents annoying browser-display-exploits
                # like "http:[EMAIL PROTECTED]";.
                # In future we may set up a package global or a field like
                # getCurrentUser()->{state}{fixurlauth} that will allow
                # this behavior to be turned off -- it's wrapped in
                # "if (1)" to remind us of this...
                my $uri = new URI $url;
                if ($uri && $uri->can('host') && $uri->can('authority')) {
                        # don't need to print the port if we
                        # already have the correct port
                        my $host = $uri->can('host_port') &&
                                $uri->port != $uri->default_port
                                ? $uri->host_port
                                : $uri->host;
                        $host =~ tr/A-Za-z0-9.-//cd; # per RFC 1035
                        $uri->authority($host);
                        $url = $uri->canonical->as_string;
                }
        }

        # we don't like SCRIPT at the beginning of a URL
        my $decoded_url = decode_entities($url);
        return $decoded_url =~ s|^\s*\w+script\b.*$||i ? undef : $url;
}

-- 
Chris Nandor                      [EMAIL PROTECTED]    http://pudge.net/
Open Source Development Network    [EMAIL PROTECTED]     http://osdn.com/


Reply via email to