Thank you Paul. I can only find one 'set-cookie' in headers, what happened?
here is the debug information: 2014-07-09 17:45:17+0800 [AmazonSpider] DEBUG: Retrying <GET http://www.amazon.c om> (failed 1 times): User timeout caused connection failure: Getting http://www .amazon.com took longer than 180 seconds.. 2014-07-09 17:45:17+0800 [AmazonSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-07-09 17:45:24+0800 [AmazonSpider] DEBUG: Received cookies from: <200 http: //www.amazon.com> Set-Cookie: skin=noskin; path=/; domain=.amazon.com 2014-07-09 17:45:24+0800 [AmazonSpider] DEBUG: Crawled (200) <GET http://www.ama zon.com> (referer: None) 在 2014年7月2日星期三UTC+8下午10时00分34秒,Paul Tremberth写道: > > A bit more detail if you noticed that response.headers representation > seems to be missing some Set-Cookies values. > In fact you can received multiple Set-Cookie headers, so you need to use > .getlist(headername) to get them all: > > Same example with Amazon.com and COOKIES_DEBUG enabled > > $ scrapy shell "http://www.amazon.com" --set USER_AGENT="Mozilla/5.0 > (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) > Chrome/35.0.1916.153 Safari/537.36" --set COOKIES_DEBUG=1 > 2014-07-02 15:56:32+0200 [scrapy] INFO: Scrapy 0.24.1 started (bot: > scrapybot) > 2014-07-02 15:56:33+0200 [default] DEBUG: Received cookies from: <200 > http://www.amazon.com> > Set-Cookie: skin=noskin; path=/; domain=.amazon.com > Set-Cookie: > x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=; > > path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT > Set-Cookie: session-id-time=2082787201l; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT > Set-Cookie: session-id=182-4946683-0637966; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT > 2014-07-02 15:56:33+0200 [default] DEBUG: Crawled (200) <GET > http://www.amazon.com> (referer: None) > [s] Available Scrapy objects: > [s] crawler <scrapy.crawler.Crawler object at 0x7f1ce9d8fbd0> > [s] item {} > [s] request <GET http://www.amazon.com> > [s] response <200 http://www.amazon.com> > [s] settings <scrapy.settings.Settings object at 0x7f1cea430d50> > [s] spider <Spider 'default' at 0x7f1ce94e5e50> > [s] Useful shortcuts: > [s] shelp() Shell help (print this help) > [s] fetch(req_or_url) Fetch request (or URL) and update local objects > [s] view(response) View response in a browser > > In [1]: response.headers > Out[1]: > {'Cache-Control': 'no-cache', > 'Content-Type': 'text/html; charset=ISO-8859-1', > 'Date': 'Wed, 02 Jul 2014 13:56:32 GMT', > 'Expires': '-1', > 'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW > CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM > NAV INT DEM CNT STA HEA PRE LOC GOV OTC "', > 'Pragma': 'no-cache', > 'Server': 'Server', > 'Set-Cookie': 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT', > 'Vary': 'Accept-Encoding,User-Agent', > 'X-Amz-Id-1': '1ZAYQZK49NGDTCJPSH1C', > 'X-Amz-Id-2': > 'puwShmgjkOwsTu9o4UP22PoJMqv9eeh0EOI52svdSdZ96b9VtkJbPKdwDHuojOay', > 'X-Frame-Options': 'SAMEORIGIN'} > > In [2]: type(response.headers) > Out[2]: scrapy.http.headers.Headers > > In [3]: response.headers.getlist("Set-Cookie") > Out[3]: > ['skin=noskin; path=/; domain=.amazon.com', > > 'x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=; > > path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT', > 'session-id-time=2082787201l; path=/; domain=.amazon.com; expires=Tue, > 01-Jan-2036 08:00:01 GMT', > 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT'] > > In [4]: > > > > And the cookies Scrapy sends: > > In [4]: fetch('http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27') > 2014-07-02 15:59:24+0200 [default] DEBUG: Sending cookies to: <GET > http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> > Cookie: session-id=182-4946683-0637966; session-id-time=2082787201l; > x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=; > > skin=noskin > 2014-07-02 15:59:25+0200 [default] DEBUG: Received cookies from: <200 > http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> > Set-Cookie: ubid-main=183-9706629-1828940; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT > Set-Cookie: session-id-time=2082787201l; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT > Set-Cookie: session-id=182-4946683-0637966; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT > 2014-07-02 15:59:25+0200 [default] DEBUG: Crawled (200) <GET > http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> (referer: None) > [s] Available Scrapy objects: > [s] crawler <scrapy.crawler.Crawler object at 0x7f1ce9d8fbd0> > [s] item {} > [s] request <GET http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27 > > > [s] response <200 http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27 > > > [s] settings <scrapy.settings.Settings object at 0x7f1cea430d50> > [s] spider <Spider 'default' at 0x7f1ce94e5e50> > [s] Useful shortcuts: > [s] shelp() Shell help (print this help) > [s] fetch(req_or_url) Fetch request (or URL) and update local objects > [s] view(response) View response in a browser > > In [5]: response.request.headers.getlist("Cookie") > Out[5]: ['session-id=182-4946683-0637966; session-id-time=2082787201l; > x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=; > > skin=noskin'] > > In [6]: response.headers.getlist("Set-Cookie") > Out[6]: > ['ubid-main=183-9706629-1828940; path=/; domain=.amazon.com; expires=Tue, > 01-Jan-2036 08:00:01 GMT', > 'session-id-time=2082787201l; path=/; domain=.amazon.com; expires=Tue, > 01-Jan-2036 08:00:01 GMT', > 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; > expires=Tue, 01-Jan-2036 08:00:01 GMT'] > > In [7]: > > > > On Wednesday, July 2, 2014 3:51:00 PM UTC+2, Paul Tremberth wrote: >> >> You can get "Set-Cookie" headers from the responses >> >> $ scrapy shell "http://www.amazon.com" --set USER_AGENT="Mozilla/5.0 >> (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) >> Chrome/35.0.1916.153 Safari/537.36" >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Scrapy 0.24.1 started (bot: >> scrapybot) >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Optional features available: ssl, >> http11, boto >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Overridden settings: >> {'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'} >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled extensions: >> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled downloader middlewares: >> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >> ChunkedTransferMiddleware, DownloaderStats >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddleware >> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled item pipelines: >> 2014-07-02 14:53:12+0200 [scrapy] DEBUG: Telnet console listening on >> 127.0.0.1:6023 >> 2014-07-02 14:53:12+0200 [scrapy] DEBUG: Web service listening on >> 127.0.0.1:6080 >> 2014-07-02 14:53:12+0200 [default] INFO: Spider opened >> 2014-07-02 14:53:13+0200 [default] DEBUG: Crawled (200) <GET >> http://www.amazon.com> (referer: None) >> [s] Available Scrapy objects: >> [s] crawler <scrapy.crawler.Crawler object at 0x7f9ff6894bd0> >> [s] item {} >> [s] request <GET http://www.amazon.com> >> [s] response <200 http://www.amazon.com> >> [s] settings <scrapy.settings.Settings object at 0x7f9ff6f35d50> >> [s] spider <Spider 'default' at 0x7f9ff5feae50> >> [s] Useful shortcuts: >> [s] shelp() Shell help (print this help) >> [s] fetch(req_or_url) Fetch request (or URL) and update local objects >> [s] view(response) View response in a browser >> >> In [1]: response.headers >> Out[1]: >> {'Cache-Control': 'no-cache', >> 'Content-Type': 'text/html; charset=ISO-8859-1', >> 'Date': 'Wed, 02 Jul 2014 12:53:13 GMT', >> 'Expires': '-1', >> 'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW >> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM >> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "', >> 'Pragma': 'no-cache', >> 'Server': 'Server', >> 'Set-Cookie': 'session-id=185-4345826-3198169; path=/; domain=. >> amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT', >> 'Vary': 'Accept-Encoding,User-Agent', >> 'X-Amz-Id-1': '0HSR62FXE7WW8GGJ3003', >> 'X-Amz-Id-2': >> 'TX9doI/wHzZDQLi61C/nIydE0Sv7wjkhNs30li5KMVSEWLqRqVSvL03WYmkTnASu', >> 'X-Frame-Options': 'SAMEORIGIN'} >> >> In [2]: >> >> >> And "Cookie" headers from response.requests: >> >> In [2]: fetch('http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27') >> 2014-07-02 15:47:23+0200 [default] DEBUG: Crawled (200) <GET >> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> (referer: None) >> [s] Available Scrapy objects: >> [s] crawler <scrapy.crawler.Crawler object at 0x7f9ff6894bd0> >> [s] item {} >> [s] request <GET >> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> >> [s] response <200 >> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> >> [s] settings <scrapy.settings.Settings object at 0x7f9ff6f35d50> >> [s] spider <Spider 'default' at 0x7f9ff5feae50> >> [s] Useful shortcuts: >> [s] shelp() Shell help (print this help) >> [s] fetch(req_or_url) Fetch request (or URL) and update local objects >> [s] view(response) View response in a browser >> >> In [3]: response.headers >> Out[3]: >> {'Content-Type': 'text/html; charset=ISO-8859-1', >> 'Date': 'Wed, 02 Jul 2014 13:47:22 GMT', >> 'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW >> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM >> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "', >> 'Server': 'Server', >> 'Set-Cookie': 'session-id=185-4345826-3198169; path=/; domain=. >> amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT', >> 'Vary': 'Accept-Encoding,User-Agent', >> 'X-Amz-Id-1': '0C0QXN1ZK555MP10HWB5', >> 'X-Amz-Id-2': >> 'CcXo3odRFUSFkmnICLBbdhYKKmiygNJ/b7c3s74p2mWaRnqldFyDmhrdB9PPVK6O', >> 'X-Frame-Options': 'SAMEORIGIN'} >> >> In [4]: response.request.headers >> Out[4]: >> {'Accept': >> 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', >> 'Accept-Encoding': 'gzip,deflate', >> 'Accept-Language': 'en', >> 'Cookie': 'session-id=185-4345826-3198169; session-id-time=2082787201l; >> x-wl-uid=1/kDeNun+YQYYmW1esQBg6XsiW68oMT1FJXDavoxODm1tzaDnaKf1KOMU+Jmni6iWQngWZhCnOjI=; >> >> skin=noskin', >> 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 >> (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'} >> >> >> >> >> On Wednesday, July 2, 2014 7:18:31 AM UTC+2, Reggie wrote: >>> >>> I want to read cookies when I parse response, but I can't find cookies >>> neither in response.meta or response.headers, how could I read cookies? >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
