[EMAIL PROTECTED] wrote in news:[EMAIL PROTECTED]:
> According to RFC 2396[1] section 5.2:
>
> g) If the resulting buffer string still begins with one or more
> complete path segments of "..", then the reference is
> considered to be in error. Implementations may handle this
> error by retaining these components in the resolved path (i.e.,
> treating them as part of the final URI), by removing them from
> the resolved path (i.e., discarding relative levels above the
> root), or by avoiding traversal of the reference.
>
> If I read this right, it explicitly allows the urlparse.urljoin behavior
> ("handle this error by retaining these components in the resolved path").
>
Yes, the urljoin behaviour is explicitly allowed, however it is not the
most commonly implemented permitted behaviour. Both IE and Mozilla/Firefox
handle this error by stripping the spurious .. elements from the front of
the path. Apache, and I hope other web servers, work by the third permitted
method, i.e. rejecting requests to these invalid urls.
The net effect of this is that on some sites using a Python spider (e.g.
webchecker.py) will produce a large number of error messages for links
which browsers will actually resolve successfully. (At least that's when I
first noticed this particular problem). Depending on your reasons for
spidering a site this can be either a good thing or an annoyance.
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com