Re: Question about limitation of non-ASCII URLs in Scrapy 1.1

Paul Tremberth Sat, 19 Mar 2016 18:30:52 -0700

hi orangain,

The trouble is with non-ASCII characters within links.


Your tests deal with already (UTF-8 encoded +) percent-escaped URLs
characters,
e.g.
https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6
for the Unicode https://ja.wikipedia.org/wiki/Wikipedia:ウィキペディアについて

These links appears like that in the source code, e.g.
<a
href="/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%82%92%E6%8E%A2%E6%A4%9C%E3%81%99%E3%82%8B%E3%81%AB%E3%81%AF"
title="Wikipedia:ウィキペディアを探検するには">Wikipedia:ウィキペディアを探検するには</a>

so that's fine most of the time (correct percent-escaped UTF-8 encoded
chars)

When dealing with non-ASCII Unicode characters in URLs (from links
extraction for example),
we'd like to work as browsers do when you type a URL in your address bar,
which is roughly:
- to encode (Unicode) path part of URL to UTF-8, then percent escape that,
- and for the query part of URL, use document encoding (or even better any
form accept-carset from a previous response if possible to know), or UTF-8
if we don't know as default, before percent escaping that .

Some Python2-passing tests are skipped in Python3 because they fail
currently (which https://github.com/scrapy/scrapy/pull/1664 tried to
re-enable)

I believe some tests should not pass in Python2 either (yet they do
currently),
like normalizing "http://www.example.com/a%a3do";, which, to me, is not a
valid input -- or it should be left untouched,
since %a3 is not a valid UTF-8+percent-escaped sequence (it's latin-1
presumably)

We're working on a new set of tests for these cases and hopefully come up
with proper handling for non-ASCII URLs, in Python2 and Python3.
I'm not sure we won't break existing user code for corner cases,
but doing things similarly to common browsers is the way to go IMO

Hope this clarifies the matter a bit.

We'd love your input if you do have tricky corner case URL mangling.

Paul.


On Wed, Mar 16, 2016 at 10:17 AM, Kota Kato <[email protected]> wrote:

> Hello Scrapy developers,
>
> I'm really pleased with Python 3 support in an upcoming Scrapy 1.1
> release. I'm thinking about introducing this great release in my blog
> article and a book now authoring.
>
> I have a question about a limitation of handling non-ASCII URLs. The
> release note of 1.1 (*http://doc.scrapy.org/en/master/news.html#news-betapy3
> <http://doc.scrapy.org/en/master/news.html#news-betapy3>*) says:
>
> > * Scrapy has problems handling non-ASCII URLs in Python 3
>
> This limitation seems to be big enough to make Japanese people like me
> hesitate to use Scrapy 1.1 in Python 3. However testing with simple spiders
> to crawl non-ASCII URLs (
> https://gist.github.com/orangain/3724b86a5dc5b2a279f9), I didn't have any
> problem. So my question is:
>
> * What does the limitation exactly mean?
>
> More specifically:
>
> * In my understanding, non-ASCII URLs means URLs contain
> percent-encoded non-ASCII characters. Is this right? Or, does it mean URLs
> contain non-ASCII characters without percent-encoding?
> * What kind of problems will occur?
> * In what component, problems will occur?
> * In what condition, problems will occur?
>
> I've explored the following issues, but I couldn't find a clear answer for
> my question.
>
> HTML entity causes UnicodeEncodeError in LxmlLinkExtractor · Issue #998 ·
> scrapy/scrapy
> https://github.com/scrapy/scrapy/issues/998
>
> Speedup & fix URL parsing · Issue #1306 · scrapy/scrapy
> https://github.com/scrapy/scrapy/issues/1306
>
> Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode
> character · Issue #1403 · scrapy/scrapy
> https://github.com/scrapy/scrapy/issues/1403
>
> Exception in LxmLinkExtractor.extract_links 'ascii' codec can't encode
> character · Issue #1405 · scrapy/scrapy
> https://github.com/scrapy/scrapy/issues/1405
>
> PY3: add back 3 URL normalization tests by redapple · Pull Request #1664 ·
> scrapy/scrapy
> https://github.com/scrapy/scrapy/pull/1664
>
> get_base_url fails for non-ascii URLs in Python 3 · Issue #1783 ·
> scrapy/scrapy
> https://github.com/scrapy/scrapy/issues/1783
>
> Best,
>
> orangain
> --
> [email protected]
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Question about limitation of non-ASCII URLs in Scrapy 1.1

Reply via email to