...actually, now that I'm reading RFCs again, things like http://www.example.com/a%a3do might be acceptable
If you check RFC 3987, "Internationalized Resource Identifiers (IRIs)", §3.2. Converting URIs to IRIs [1] there's a very similar example: Conversions from URIs to IRIs MUST NOT use any character encoding other than UTF-8 in steps 3 and 4, even if it might be possible to guess from the context that another character encoding than UTF-8 was used in the URI. For example, the URI "http://www.example.org/r%E9sum%E9.html" might with some guessing be interpreted to contain two e-acute characters encoded as iso-8859-1. It must not be converted to an IRI containing these e-acute characters. Otherwise, in the future the IRI will be mapped to "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different URI from "http://www.example.org/r%E9sum%E9.html". [1] http://tools.ietf.org/html/rfc3987#page-14 On Wed, Mar 16, 2016 at 5:52 PM, Paul Tremberth <[email protected]> wrote: > hi orangain, > > The trouble is with non-ASCII characters within links. > > Your tests deal with already (UTF-8 encoded +) percent-escaped URLs > characters, > e.g. > https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6 > for the Unicode https://ja.wikipedia.org/wiki/Wikipedia:ウィキペディアについて > > These links appears like that in the source code, e.g. > <a > href="/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%82%92%E6%8E%A2%E6%A4%9C%E3%81%99%E3%82%8B%E3%81%AB%E3%81%AF" > title="Wikipedia:ウィキペディアを探検するには">Wikipedia:ウィキペディアを探検するには</a> > > so that's fine most of the time (correct percent-escaped UTF-8 encoded > chars) > > When dealing with non-ASCII Unicode characters in URLs (from links > extraction for example), > we'd like to work as browsers do when you type a URL in your address bar, > which is roughly: > - to encode (Unicode) path part of URL to UTF-8, then percent escape that, > - and for the query part of URL, use document encoding (or even better any > form accept-carset from a previous response if possible to know), or UTF-8 > if we don't know as default, before percent escaping that . > > Some Python2-passing tests are skipped in Python3 because they fail > currently (which https://github.com/scrapy/scrapy/pull/1664 tried to > re-enable) > > I believe some tests should not pass in Python2 either (yet they do > currently), > like normalizing "http://www.example.com/a%a3do", which, to me, is not a > valid input -- or it should be left untouched, > since %a3 is not a valid UTF-8+percent-escaped sequence (it's latin-1 > presumably) > > We're working on a new set of tests for these cases and hopefully come up > with proper handling for non-ASCII URLs, in Python2 and Python3. > I'm not sure we won't break existing user code for corner cases, > but doing things similarly to common browsers is the way to go IMO > > Hope this clarifies the matter a bit. > > We'd love your input if you do have tricky corner case URL mangling. > > Paul. > > > On Wed, Mar 16, 2016 at 10:17 AM, Kota Kato <[email protected]> wrote: > >> Hello Scrapy developers, >> >> I'm really pleased with Python 3 support in an upcoming Scrapy 1.1 >> release. I'm thinking about introducing this great release in my blog >> article and a book now authoring. >> >> I have a question about a limitation of handling non-ASCII URLs. The >> release note of 1.1 (*http://doc.scrapy.org/en/master/news.html#news-betapy3 >> <http://doc.scrapy.org/en/master/news.html#news-betapy3>*) says: >> >> > * Scrapy has problems handling non-ASCII URLs in Python 3 >> >> This limitation seems to be big enough to make Japanese people like me >> hesitate to use Scrapy 1.1 in Python 3. However testing with simple spiders >> to crawl non-ASCII URLs ( >> https://gist.github.com/orangain/3724b86a5dc5b2a279f9), I didn't have >> any problem. So my question is: >> >> * What does the limitation exactly mean? >> >> More specifically: >> >> * In my understanding, non-ASCII URLs means URLs contain >> percent-encoded non-ASCII characters. Is this right? Or, does it mean URLs >> contain non-ASCII characters without percent-encoding? >> * What kind of problems will occur? >> * In what component, problems will occur? >> * In what condition, problems will occur? >> >> I've explored the following issues, but I couldn't find a clear answer >> for my question. >> >> HTML entity causes UnicodeEncodeError in LxmlLinkExtractor · Issue #998 · >> scrapy/scrapy >> https://github.com/scrapy/scrapy/issues/998 >> >> Speedup & fix URL parsing · Issue #1306 · scrapy/scrapy >> https://github.com/scrapy/scrapy/issues/1306 >> >> Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode >> character · Issue #1403 · scrapy/scrapy >> https://github.com/scrapy/scrapy/issues/1403 >> >> Exception in LxmLinkExtractor.extract_links 'ascii' codec can't encode >> character · Issue #1405 · scrapy/scrapy >> https://github.com/scrapy/scrapy/issues/1405 >> >> PY3: add back 3 URL normalization tests by redapple · Pull Request #1664 >> · scrapy/scrapy >> https://github.com/scrapy/scrapy/pull/1664 >> >> get_base_url fails for non-ascii URLs in Python 3 · Issue #1783 · >> scrapy/scrapy >> https://github.com/scrapy/scrapy/issues/1783 >> >> Best, >> >> orangain >> -- >> [email protected] >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
