Re: What is the encode format on this website?

李哲 Wed, 31 Aug 2016 22:50:45 -0700

tnx I will read the doc carefully

在 2016年8月31日星期三 UTC+8下午9:15:47，Paul Tremberth写道：
>
> Hello,
>
> you can fetch this URL with scrapy shell and check the response encoding:
>
> $ scrapy shell "http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=";
> 2016-08-31 15:05:58 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
> (...
> 2016-08-31 15:05:59 [scrapy] DEBUG: Crawled (200) <GET 
> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=> (referer: None)
> (...)
> In [1]: response.encoding
> Out[1]: 'gb18030'
>
>
> Note: in the examples below, I'm using Python 3.5
>
> You can also verify the encoding of the URL using parse_qs[1] (in Python3 
> you can pass the encoding)
>
> In [2]: from urllib.parse import parse_qs
>
> In [3]: parse_qs('q1=%C0%EF&q2=&q3=&q4=', encoding='gb18030')
> Out[3]: {'q1': ['里']}
>
>
>
> When building Requests objects, when you build your URLs with Chinese 
> characters
>  you'll need to either pass safe URL strings with gb18030 encoding (here 
> I pass the same 里 4 times (this is just an example obvisouly),
>
> In [4]: from w3lib.url import safe_url_string
>
> In [5]: safe_url_string('
> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=里&q3=里&q4=里 
> <http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%E9%87%8C&q3=%E9%87%8C&q4=%E9%87%8C>',
>  
> encoding=response.encoding)
> Out[5]: '
> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF
> '
>
> In [6]: from scrapy import Request
>
> In [7]: Request('
> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF
> ')
> Out[7]: <GET 
> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF
> >
>
>
>
>
> or pass the encoding parameter to the Request constructor, otherwise, 
> UTF-8 is used for query parameters before percent-escaping:
>
> In [9]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里')
> Out[9]: <GET http://chengyu.t086.com/chaxun.php?q1=%E9%87%8C&q2=%E9%87%8C>
>
> In [10]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里', 
> encoding='gb18030')
> Out[10]: <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF>
>
>
> Hope it helps.
>
> Regards,
> Paul.
>
> [1] 
> https://docs.python.org/3/library/urllib.parse.html#urllib.parse.parse_qs
>
>
>
> On Tuesday, August 30, 2016 at 11:33:27 AM UTC+2, 李哲 wrote:
>>
>> http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=
>>
>>
>>
>> I want to scrape this web, I should fill the &q1="some thing" or &q2= 
>> "something", with the Chinese character which is encoded, 
>>
>> but the query content %C0%EF seems not utf - 8 encoding(or urlencoding, 
>> both I have tried).  How can I know the encoding format? 
>>
>> What is the encoding format here ?  
>>
>>
>>


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: What is the encode format on this website?

Reply via email to