[issue22118] urljoin fails with messy relative URLs

Mike Lissner Fri, 01 Aug 2014 06:39:26 -0700

New submission from Mike Lissner:

Not sure if this is desired behavior, but it's making my code break, so I 
figured I'd get it filed.


I'm trying to crawl this website: 
https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm

Unfortunately, most of the URLs in the HTML are relative, taking the form:

'../../some/path/to/some/pdf.pdf'

I'm using lxml's make_links_absolute() function, which calls urljoin creating 
invalid urls like:

https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf

If you put that into Firefox or wget or whatever, it works, despite being 
invalid and making no sense. 

**It works because those clients fix the problem,** joining the invalid path 
and the URL into:

https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf

I know this will mean making urljoin have a workaround to fix bad HTML, but 
this seems to be what wget, Chrome, Firefox, etc. all do. 

I've never filed a Python bugs before, but is this something we could consider?

----------
components: Library (Lib)
messages: 224500
nosy: Mike.Lissner
priority: normal
severity: normal
status: open
title: urljoin fails with messy relative URLs
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22118>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22118] urljoin fails with messy relative URLs

Reply via email to