Re: What is the encode format on this website?

Paul Tremberth Wed, 31 Aug 2016 06:16:34 -0700

Hello,

you can fetch this URL with scrapy shell and check the response encoding:


$ scrapy shell "http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=";
2016-08-31 15:05:58 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
(...
2016-08-31 15:05:59 [scrapy] DEBUG: Crawled (200) <GET 
http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=> (referer: None)
(...)
In [1]: response.encoding
Out[1]: 'gb18030'


Note: in the examples below, I'm using Python 3.5

You can also verify the encoding of the URL using parse_qs[1] (in Python3 
you can pass the encoding)

In [2]: from urllib.parse import parse_qs

In [3]: parse_qs('q1=%C0%EF&q2=&q3=&q4=', encoding='gb18030')
Out[3]: {'q1': ['里']}



When building Requests objects, when you build your URLs with Chinese 
characters
 you'll need to either pass safe URL strings with gb18030 encoding (here I 
pass the same 里 4 times (this is just an example obvisouly),

In [4]: from w3lib.url import safe_url_string

In [5]: 
safe_url_string('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=里&q3=里&q4=里', 
encoding=response.encoding)
Out[5]: 
'http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF'

In [6]: from scrapy import Request

In [7]: 
Request('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF')
Out[7]: <GET 
http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF>




or pass the encoding parameter to the Request constructor, otherwise, UTF-8 
is used for query parameters before percent-escaping:

In [9]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里')
Out[9]: <GET http://chengyu.t086.com/chaxun.php?q1=%E9%87%8C&q2=%E9%87%8C>

In [10]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里', 
encoding='gb18030')
Out[10]: <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF>


Hope it helps.

Regards,
Paul.

[1] https://docs.python.org/3/library/urllib.parse.html#urllib.parse.parse_qs



On Tuesday, August 30, 2016 at 11:33:27 AM UTC+2, 李哲 wrote:
>
> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=
>
>
>
> I want to scrape this web, I should fill the &q1="some thing" or &q2= 
> "something", with the Chinese character which is encoded, 
>
> but the query content %C0%EF seems not utf - 8 encoding(or urlencoding, 
> both I have tried).  How can I know the encoding format? 
>
> What is the encoding format here ?  
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: What is the encode format on this website?

Reply via email to