Re: Question about limitation of non-ASCII URLs in Scrapy 1.1

Paul Tremberth Sat, 19 Mar 2016 03:53:08 -0700

...actually, now that I'm reading RFCs again, things like
http://www.example.com/a%a3do might be acceptable


If you check RFC 3987, "Internationalized Resource Identifiers (IRIs)",
§3.2.  Converting URIs to IRIs [1]
there's a very similar example:

Conversions from URIs to IRIs MUST NOT use any character encoding
   other than UTF-8 in steps 3 and 4, even if it might be possible to
   guess from the context that another character encoding than UTF-8 was
   used in the URI.  For example, the URI
   "http://www.example.org/r%E9sum%E9.html"; might with some guessing be
   interpreted to contain two e-acute characters encoded as iso-8859-1.
   It must not be converted to an IRI containing these e-acute
   characters.  Otherwise, in the future the IRI will be mapped to
   "http://www.example.org/r%C3%A9sum%C3%A9.html";, which is a different
   URI from "http://www.example.org/r%E9sum%E9.html";.


[1] http://tools.ietf.org/html/rfc3987#page-14

On Wed, Mar 16, 2016 at 5:52 PM, Paul Tremberth <[email protected]>
wrote:

> hi orangain,
>
> The trouble is with non-ASCII characters within links.
>
> Your tests deal with already (UTF-8 encoded +) percent-escaped URLs
> characters,
> e.g.
> https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6
> for the Unicode https://ja.wikipedia.org/wiki/Wikipedia:ウィキペディアについて
>
> These links appears like that in the source code, e.g.
> <a
> href="/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%82%92%E6%8E%A2%E6%A4%9C%E3%81%99%E3%82%8B%E3%81%AB%E3%81%AF"
> title="Wikipedia:ウィキペディアを探検するには">Wikipedia:ウィキペディアを探検するには</a>
>
> so that's fine most of the time (correct percent-escaped UTF-8 encoded
> chars)
>
> When dealing with non-ASCII Unicode characters in URLs (from links
> extraction for example),
> we'd like to work as browsers do when you type a URL in your address bar,
> which is roughly:
> - to encode (Unicode) path part of URL to UTF-8, then percent escape that,
> - and for the query part of URL, use document encoding (or even better any
> form accept-carset from a previous response if possible to know), or UTF-8
> if we don't know as default, before percent escaping that .
>
> Some Python2-passing tests are skipped in Python3 because they fail
> currently (which https://github.com/scrapy/scrapy/pull/1664 tried to
> re-enable)
>
> I believe some tests should not pass in Python2 either (yet they do
> currently),
> like normalizing "http://www.example.com/a%a3do";, which, to me, is not a
> valid input -- or it should be left untouched,
> since %a3 is not a valid UTF-8+percent-escaped sequence (it's latin-1
> presumably)
>
> We're working on a new set of tests for these cases and hopefully come up
> with proper handling for non-ASCII URLs, in Python2 and Python3.
> I'm not sure we won't break existing user code for corner cases,
> but doing things similarly to common browsers is the way to go IMO
>
> Hope this clarifies the matter a bit.
>
> We'd love your input if you do have tricky corner case URL mangling.
>
> Paul.
>
>
> On Wed, Mar 16, 2016 at 10:17 AM, Kota Kato <[email protected]> wrote:
>
>> Hello Scrapy developers,
>>
>> I'm really pleased with Python 3 support in an upcoming Scrapy 1.1
>> release. I'm thinking about introducing this great release in my blog
>> article and a book now authoring.
>>
>> I have a question about a limitation of handling non-ASCII URLs. The
>> release note of 1.1 (*http://doc.scrapy.org/en/master/news.html#news-betapy3
>> <http://doc.scrapy.org/en/master/news.html#news-betapy3>*) says:
>>
>> > * Scrapy has problems handling non-ASCII URLs in Python 3
>>
>> This limitation seems to be big enough to make Japanese people like me
>> hesitate to use Scrapy 1.1 in Python 3. However testing with simple spiders
>> to crawl non-ASCII URLs (
>> https://gist.github.com/orangain/3724b86a5dc5b2a279f9), I didn't have
>> any problem. So my question is:
>>
>> * What does the limitation exactly mean?
>>
>> More specifically:
>>
>> * In my understanding, non-ASCII URLs means URLs contain
>> percent-encoded non-ASCII characters. Is this right? Or, does it mean URLs
>> contain non-ASCII characters without percent-encoding?
>> * What kind of problems will occur?
>> * In what component, problems will occur?
>> * In what condition, problems will occur?
>>
>> I've explored the following issues, but I couldn't find a clear answer
>> for my question.
>>
>> HTML entity causes UnicodeEncodeError in LxmlLinkExtractor · Issue #998 ·
>> scrapy/scrapy
>> https://github.com/scrapy/scrapy/issues/998
>>
>> Speedup & fix URL parsing · Issue #1306 · scrapy/scrapy
>> https://github.com/scrapy/scrapy/issues/1306
>>
>> Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode
>> character · Issue #1403 · scrapy/scrapy
>> https://github.com/scrapy/scrapy/issues/1403
>>
>> Exception in LxmLinkExtractor.extract_links 'ascii' codec can't encode
>> character · Issue #1405 · scrapy/scrapy
>> https://github.com/scrapy/scrapy/issues/1405
>>
>> PY3: add back 3 URL normalization tests by redapple · Pull Request #1664
>> · scrapy/scrapy
>> https://github.com/scrapy/scrapy/pull/1664
>>
>> get_base_url fails for non-ascii URLs in Python 3 · Issue #1783 ·
>> scrapy/scrapy
>> https://github.com/scrapy/scrapy/issues/1783
>>
>> Best,
>>
>> orangain
>> --
>> [email protected]
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Question about limitation of non-ASCII URLs in Scrapy 1.1

Reply via email to