Hi Paul, Thank you for the detailed answer. I understand the problem well and feel relieved.
I think the term "non-ASCII URLs" is a little bit ambiguous. For example, kmike seems to refer to a percent-encoded URL as non-ASCII in https://github.com/scrapy/scrapy/issues/1783. When I hear that Scrapy has troubles in non-ASCII URLs, I think it might affect me. However, when I hear that Scrapy has troubles in non-ASCII characters within URLs, I think it might not affect me in most cases because a sensible webmaster will encode these URLs at servers. If the limitation still remains in the release of 1.1 GA, it might a good idea to use more detailed and focused expression in the release note. If I find a corner case, I will contribute to Scrapy or w3lib. Thank you again, orangain. 2016年3月17日(木) 2:54 Paul Tremberth <[email protected]>: > ...actually, now that I'm reading RFCs again, things like > http://www.example.com/a%a3do might be acceptable > > If you check RFC 3987, "Internationalized Resource Identifiers (IRIs)", > §3.2. Converting URIs to IRIs [1] > there's a very similar example: > > Conversions from URIs to IRIs MUST NOT use any character encoding > other than UTF-8 in steps 3 and 4, even if it might be possible to > guess from the context that another character encoding than UTF-8 was > used in the URI. For example, the URI > "http://www.example.org/r%E9sum%E9.html" might with some guessing be > interpreted to contain two e-acute characters encoded as iso-8859-1. > It must not be converted to an IRI containing these e-acute > characters. Otherwise, in the future the IRI will be mapped to > "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different > URI from "http://www.example.org/r%E9sum%E9.html". > > > [1] http://tools.ietf.org/html/rfc3987#page-14 > > On Wed, Mar 16, 2016 at 5:52 PM, Paul Tremberth <[email protected]> > wrote: > >> hi orangain, >> >> The trouble is with non-ASCII characters within links. >> >> Your tests deal with already (UTF-8 encoded +) percent-escaped URLs >> characters, >> e.g. >> https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6 >> for the Unicode https://ja.wikipedia.org/wiki/Wikipedia:ウィキペディアについて >> >> These links appears like that in the source code, e.g. >> <a >> href="/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%82%92%E6%8E%A2%E6%A4%9C%E3%81%99%E3%82%8B%E3%81%AB%E3%81%AF" >> title="Wikipedia:ウィキペディアを探検するには">Wikipedia:ウィキペディアを探検するには</a> >> >> so that's fine most of the time (correct percent-escaped UTF-8 encoded >> chars) >> >> When dealing with non-ASCII Unicode characters in URLs (from links >> extraction for example), >> we'd like to work as browsers do when you type a URL in your address bar, >> which is roughly: >> - to encode (Unicode) path part of URL to UTF-8, then percent escape that, >> - and for the query part of URL, use document encoding (or even better >> any form accept-carset from a previous response if possible to know), or >> UTF-8 if we don't know as default, before percent escaping that . >> >> Some Python2-passing tests are skipped in Python3 because they fail >> currently (which https://github.com/scrapy/scrapy/pull/1664 tried to >> re-enable) >> >> I believe some tests should not pass in Python2 either (yet they do >> currently), >> like normalizing "http://www.example.com/a%a3do", which, to me, is not a >> valid input -- or it should be left untouched, >> since %a3 is not a valid UTF-8+percent-escaped sequence (it's latin-1 >> presumably) >> >> We're working on a new set of tests for these cases and hopefully come up >> with proper handling for non-ASCII URLs, in Python2 and Python3. >> I'm not sure we won't break existing user code for corner cases, >> but doing things similarly to common browsers is the way to go IMO >> >> Hope this clarifies the matter a bit. >> >> We'd love your input if you do have tricky corner case URL mangling. >> >> Paul. >> >> >> On Wed, Mar 16, 2016 at 10:17 AM, Kota Kato <[email protected]> wrote: >> >>> Hello Scrapy developers, >>> >>> I'm really pleased with Python 3 support in an upcoming Scrapy 1.1 >>> release. I'm thinking about introducing this great release in my blog >>> article and a book now authoring. >>> >>> I have a question about a limitation of handling non-ASCII URLs. The >>> release note of 1.1 (*http://doc.scrapy.org/en/master/news.html#news-betapy3 >>> <http://doc.scrapy.org/en/master/news.html#news-betapy3>*) says: >>> >>> > * Scrapy has problems handling non-ASCII URLs in Python 3 >>> >>> This limitation seems to be big enough to make Japanese people like me >>> hesitate to use Scrapy 1.1 in Python 3. However testing with simple spiders >>> to crawl non-ASCII URLs ( >>> https://gist.github.com/orangain/3724b86a5dc5b2a279f9), I didn't have >>> any problem. So my question is: >>> >>> * What does the limitation exactly mean? >>> >>> More specifically: >>> >>> * In my understanding, non-ASCII URLs means URLs contain >>> percent-encoded non-ASCII characters. Is this right? Or, does it mean URLs >>> contain non-ASCII characters without percent-encoding? >>> * What kind of problems will occur? >>> * In what component, problems will occur? >>> * In what condition, problems will occur? >>> >>> I've explored the following issues, but I couldn't find a clear answer >>> for my question. >>> >>> HTML entity causes UnicodeEncodeError in LxmlLinkExtractor · Issue #998 >>> · scrapy/scrapy >>> https://github.com/scrapy/scrapy/issues/998 >>> >>> Speedup & fix URL parsing · Issue #1306 · scrapy/scrapy >>> https://github.com/scrapy/scrapy/issues/1306 >>> >>> Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode >>> character · Issue #1403 · scrapy/scrapy >>> https://github.com/scrapy/scrapy/issues/1403 >>> >>> Exception in LxmLinkExtractor.extract_links 'ascii' codec can't encode >>> character · Issue #1405 · scrapy/scrapy >>> https://github.com/scrapy/scrapy/issues/1405 >>> >>> PY3: add back 3 URL normalization tests by redapple · Pull Request #1664 >>> · scrapy/scrapy >>> https://github.com/scrapy/scrapy/pull/1664 >>> >>> get_base_url fails for non-ascii URLs in Python 3 · Issue #1783 · >>> scrapy/scrapy >>> https://github.com/scrapy/scrapy/issues/1783 >>> >>> Best, >>> >>> orangain >>> -- >>> [email protected] >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
