Hello, you can fetch this URL with scrapy shell and check the response encoding:
$ scrapy shell "http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=" 2016-08-31 15:05:58 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot) (... 2016-08-31 15:05:59 [scrapy] DEBUG: Crawled (200) <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4=> (referer: None) (...) In [1]: response.encoding Out[1]: 'gb18030' Note: in the examples below, I'm using Python 3.5 You can also verify the encoding of the URL using parse_qs[1] (in Python3 you can pass the encoding) In [2]: from urllib.parse import parse_qs In [3]: parse_qs('q1=%C0%EF&q2=&q3=&q4=', encoding='gb18030') Out[3]: {'q1': ['里']} When building Requests objects, when you build your URLs with Chinese characters you'll need to either pass safe URL strings with gb18030 encoding (here I pass the same 里 4 times (this is just an example obvisouly), In [4]: from w3lib.url import safe_url_string In [5]: safe_url_string('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=里&q3=里&q4=里', encoding=response.encoding) Out[5]: 'http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF' In [6]: from scrapy import Request In [7]: Request('http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF') Out[7]: <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF&q3=%C0%EF&q4=%C0%EF> or pass the encoding parameter to the Request constructor, otherwise, UTF-8 is used for query parameters before percent-escaping: In [9]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里') Out[9]: <GET http://chengyu.t086.com/chaxun.php?q1=%E9%87%8C&q2=%E9%87%8C> In [10]: Request('http://chengyu.t086.com/chaxun.php?q1=里&q2=里', encoding='gb18030') Out[10]: <GET http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=%C0%EF> Hope it helps. Regards, Paul. [1] https://docs.python.org/3/library/urllib.parse.html#urllib.parse.parse_qs On Tuesday, August 30, 2016 at 11:33:27 AM UTC+2, 李哲 wrote: > > http://chengyu.t086.com/chaxun.php?q1=%C0%EF&q2=&q3=&q4= > > > > I want to scrape this web, I should fill the &q1="some thing" or &q2= > "something", with the Chinese character which is encoded, > > but the query content %C0%EF seems not utf - 8 encoding(or urlencoding, > both I have tried). How can I know the encoding format? > > What is the encoding format here ? > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
