Re: Question about limitation of non-ASCII URLs in Scrapy 1.1

Kota Kato Sat, 19 Mar 2016 00:26:02 -0700

Hi Paul,

Thank you for the detailed answer. I understand the problem well and feel
relieved.


I think the term "non-ASCII URLs" is a little bit ambiguous. For example,
kmike seems to refer to a percent-encoded URL as non-ASCII in
https://github.com/scrapy/scrapy/issues/1783.

When I hear that Scrapy has troubles in non-ASCII URLs, I think it might
affect me. However, when I hear that Scrapy has troubles in non-ASCII
characters within URLs, I think it might not affect me in most cases
because a sensible webmaster will encode these URLs at servers.

If the limitation still remains in the release of 1.1 GA, it might a good
idea to use more detailed and focused expression in the release note.

If I find a corner case, I will contribute to Scrapy or w3lib.

Thank you again,

orangain.

2016年3月17日(木) 2:54 Paul Tremberth <[email protected]>:

> ...actually, now that I'm reading RFCs again, things like
> http://www.example.com/a%a3do might be acceptable
>
> If you check RFC 3987, "Internationalized Resource Identifiers (IRIs)",
> §3.2.  Converting URIs to IRIs [1]
> there's a very similar example:
>
> Conversions from URIs to IRIs MUST NOT use any character encoding
>    other than UTF-8 in steps 3 and 4, even if it might be possible to
>    guess from the context that another character encoding than UTF-8 was
>    used in the URI.  For example, the URI
>    "http://www.example.org/r%E9sum%E9.html"; might with some guessing be
>    interpreted to contain two e-acute characters encoded as iso-8859-1.
>    It must not be converted to an IRI containing these e-acute
>    characters.  Otherwise, in the future the IRI will be mapped to
>    "http://www.example.org/r%C3%A9sum%C3%A9.html";, which is a different
>    URI from "http://www.example.org/r%E9sum%E9.html";.
>
>
> [1] http://tools.ietf.org/html/rfc3987#page-14
>
> On Wed, Mar 16, 2016 at 5:52 PM, Paul Tremberth <[email protected]>
> wrote:
>
>> hi orangain,
>>
>> The trouble is with non-ASCII characters within links.
>>
>> Your tests deal with already (UTF-8 encoded +) percent-escaped URLs
>> characters,
>> e.g.
>> https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6
>> for the Unicode https://ja.wikipedia.org/wiki/Wikipedia:ウィキペディアについて
>>
>> These links appears like that in the source code, e.g.
>> <a
>> href="/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%82%92%E6%8E%A2%E6%A4%9C%E3%81%99%E3%82%8B%E3%81%AB%E3%81%AF"
>> title="Wikipedia:ウィキペディアを探検するには">Wikipedia:ウィキペディアを探検するには</a>
>>
>> so that's fine most of the time (correct percent-escaped UTF-8 encoded
>> chars)
>>
>> When dealing with non-ASCII Unicode characters in URLs (from links
>> extraction for example),
>> we'd like to work as browsers do when you type a URL in your address bar,
>> which is roughly:
>> - to encode (Unicode) path part of URL to UTF-8, then percent escape that,
>> - and for the query part of URL, use document encoding (or even better
>> any form accept-carset from a previous response if possible to know), or
>> UTF-8 if we don't know as default, before percent escaping that .
>>
>> Some Python2-passing tests are skipped in Python3 because they fail
>> currently (which https://github.com/scrapy/scrapy/pull/1664 tried to
>> re-enable)
>>
>> I believe some tests should not pass in Python2 either (yet they do
>> currently),
>> like normalizing "http://www.example.com/a%a3do";, which, to me, is not a
>> valid input -- or it should be left untouched,
>> since %a3 is not a valid UTF-8+percent-escaped sequence (it's latin-1
>> presumably)
>>
>> We're working on a new set of tests for these cases and hopefully come up
>> with proper handling for non-ASCII URLs, in Python2 and Python3.
>> I'm not sure we won't break existing user code for corner cases,
>> but doing things similarly to common browsers is the way to go IMO
>>
>> Hope this clarifies the matter a bit.
>>
>> We'd love your input if you do have tricky corner case URL mangling.
>>
>> Paul.
>>
>>
>> On Wed, Mar 16, 2016 at 10:17 AM, Kota Kato <[email protected]> wrote:
>>
>>> Hello Scrapy developers,
>>>
>>> I'm really pleased with Python 3 support in an upcoming Scrapy 1.1
>>> release. I'm thinking about introducing this great release in my blog
>>> article and a book now authoring.
>>>
>>> I have a question about a limitation of handling non-ASCII URLs. The
>>> release note of 1.1 (*http://doc.scrapy.org/en/master/news.html#news-betapy3
>>> <http://doc.scrapy.org/en/master/news.html#news-betapy3>*) says:
>>>
>>> > * Scrapy has problems handling non-ASCII URLs in Python 3
>>>
>>> This limitation seems to be big enough to make Japanese people like me
>>> hesitate to use Scrapy 1.1 in Python 3. However testing with simple spiders
>>> to crawl non-ASCII URLs (
>>> https://gist.github.com/orangain/3724b86a5dc5b2a279f9), I didn't have
>>> any problem. So my question is:
>>>
>>> * What does the limitation exactly mean?
>>>
>>> More specifically:
>>>
>>> * In my understanding, non-ASCII URLs means URLs contain
>>> percent-encoded non-ASCII characters. Is this right? Or, does it mean URLs
>>> contain non-ASCII characters without percent-encoding?
>>> * What kind of problems will occur?
>>> * In what component, problems will occur?
>>> * In what condition, problems will occur?
>>>
>>> I've explored the following issues, but I couldn't find a clear answer
>>> for my question.
>>>
>>> HTML entity causes UnicodeEncodeError in LxmlLinkExtractor · Issue #998
>>> · scrapy/scrapy
>>> https://github.com/scrapy/scrapy/issues/998
>>>
>>> Speedup & fix URL parsing · Issue #1306 · scrapy/scrapy
>>> https://github.com/scrapy/scrapy/issues/1306
>>>
>>> Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode
>>> character · Issue #1403 · scrapy/scrapy
>>> https://github.com/scrapy/scrapy/issues/1403
>>>
>>> Exception in LxmLinkExtractor.extract_links 'ascii' codec can't encode
>>> character · Issue #1405 · scrapy/scrapy
>>> https://github.com/scrapy/scrapy/issues/1405
>>>
>>> PY3: add back 3 URL normalization tests by redapple · Pull Request #1664
>>> · scrapy/scrapy
>>> https://github.com/scrapy/scrapy/pull/1664
>>>
>>> get_base_url fails for non-ascii URLs in Python 3 · Issue #1783 ·
>>> scrapy/scrapy
>>> https://github.com/scrapy/scrapy/issues/1783
>>>
>>> Best,
>>>
>>> orangain
>>> --
>>> [email protected]
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Question about limitation of non-ASCII URLs in Scrapy 1.1

Reply via email to