Re: How could I read cookies in Spider.parse method?

Reggie Wed, 09 Jul 2014 02:53:32 -0700

Thank you Paul.
I can only find one 'set-cookie' in headers, what happened?


here is the debug information:

2014-07-09 17:45:17+0800 [AmazonSpider] DEBUG: Retrying <GET 
http://www.amazon.c
om> (failed 1 times): User timeout caused connection failure: Getting 
http://www
.amazon.com took longer than 180 seconds..
2014-07-09 17:45:17+0800 [AmazonSpider] INFO: Crawled 0 pages (at 0 
pages/min),
scraped 0 items (at 0 items/min)
2014-07-09 17:45:24+0800 [AmazonSpider] DEBUG: Received cookies from: <200 
http:
//www.amazon.com>
        Set-Cookie: skin=noskin; path=/; domain=.amazon.com
2014-07-09 17:45:24+0800 [AmazonSpider] DEBUG: Crawled (200) <GET 
http://www.ama
zon.com> (referer: None)


在 2014年7月2日星期三UTC+8下午10时00分34秒，Paul Tremberth写道：
>
> A bit more detail if you noticed that response.headers representation 
> seems to be missing some Set-Cookies values.
> In fact you can received multiple Set-Cookie headers, so you need to use 
> .getlist(headername) to get them all:
>
> Same example with Amazon.com and COOKIES_DEBUG enabled
>
> $ scrapy shell "http://www.amazon.com"; --set USER_AGENT="Mozilla/5.0 
> (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) 
> Chrome/35.0.1916.153 Safari/537.36" --set COOKIES_DEBUG=1
> 2014-07-02 15:56:32+0200 [scrapy] INFO: Scrapy 0.24.1 started (bot: 
> scrapybot)
> 2014-07-02 15:56:33+0200 [default] DEBUG: Received cookies from: <200 
> http://www.amazon.com>
> Set-Cookie: skin=noskin; path=/; domain=.amazon.com
> Set-Cookie: 
> x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
>  
> path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT
> Set-Cookie: session-id-time=2082787201l; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT
> Set-Cookie: session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT
> 2014-07-02 15:56:33+0200 [default] DEBUG: Crawled (200) <GET 
> http://www.amazon.com> (referer: None)
> [s] Available Scrapy objects:
> [s]   crawler    <scrapy.crawler.Crawler object at 0x7f1ce9d8fbd0>
> [s]   item       {}
> [s]   request    <GET http://www.amazon.com>
> [s]   response   <200 http://www.amazon.com>
> [s]   settings   <scrapy.settings.Settings object at 0x7f1cea430d50>
> [s]   spider     <Spider 'default' at 0x7f1ce94e5e50>
> [s] Useful shortcuts:
> [s]   shelp()           Shell help (print this help)
> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
> [s]   view(response)    View response in a browser
>
> In [1]: response.headers
> Out[1]: 
> {'Cache-Control': 'no-cache',
>  'Content-Type': 'text/html; charset=ISO-8859-1',
>  'Date': 'Wed, 02 Jul 2014 13:56:32 GMT',
>  'Expires': '-1',
>  'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW 
> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM 
> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "',
>  'Pragma': 'no-cache',
>  'Server': 'Server',
>  'Set-Cookie': 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT',
>  'Vary': 'Accept-Encoding,User-Agent',
>  'X-Amz-Id-1': '1ZAYQZK49NGDTCJPSH1C',
>  'X-Amz-Id-2': 
> 'puwShmgjkOwsTu9o4UP22PoJMqv9eeh0EOI52svdSdZ96b9VtkJbPKdwDHuojOay',
>  'X-Frame-Options': 'SAMEORIGIN'}
>
> In [2]: type(response.headers)
> Out[2]: scrapy.http.headers.Headers
>
> In [3]: response.headers.getlist("Set-Cookie")
> Out[3]: 
> ['skin=noskin; path=/; domain=.amazon.com',
>  
> 'x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
>  
> path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT',
>  'session-id-time=2082787201l; path=/; domain=.amazon.com; expires=Tue, 
> 01-Jan-2036 08:00:01 GMT',
>  'session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT']
>
> In [4]: 
>
>
>
> And the cookies Scrapy sends:
>
> In [4]: fetch('http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27')
> 2014-07-02 15:59:24+0200 [default] DEBUG: Sending cookies to: <GET 
> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
> Cookie: session-id=182-4946683-0637966; session-id-time=2082787201l; 
> x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
>  
> skin=noskin
> 2014-07-02 15:59:25+0200 [default] DEBUG: Received cookies from: <200 
> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
> Set-Cookie: ubid-main=183-9706629-1828940; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT
> Set-Cookie: session-id-time=2082787201l; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT
> Set-Cookie: session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT
> 2014-07-02 15:59:25+0200 [default] DEBUG: Crawled (200) <GET 
> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> (referer: None)
> [s] Available Scrapy objects:
> [s]   crawler    <scrapy.crawler.Crawler object at 0x7f1ce9d8fbd0>
> [s]   item       {}
> [s]   request    <GET http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27
> >
> [s]   response   <200 http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27
> >
> [s]   settings   <scrapy.settings.Settings object at 0x7f1cea430d50>
> [s]   spider     <Spider 'default' at 0x7f1ce94e5e50>
> [s] Useful shortcuts:
> [s]   shelp()           Shell help (print this help)
> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
> [s]   view(response)    View response in a browser
>
> In [5]: response.request.headers.getlist("Cookie")
> Out[5]: ['session-id=182-4946683-0637966; session-id-time=2082787201l; 
> x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
>  
> skin=noskin']
>
> In [6]: response.headers.getlist("Set-Cookie")
> Out[6]: 
> ['ubid-main=183-9706629-1828940; path=/; domain=.amazon.com; expires=Tue, 
> 01-Jan-2036 08:00:01 GMT',
>  'session-id-time=2082787201l; path=/; domain=.amazon.com; expires=Tue, 
> 01-Jan-2036 08:00:01 GMT',
>  'session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT']
>
> In [7]: 
>
>
>
> On Wednesday, July 2, 2014 3:51:00 PM UTC+2, Paul Tremberth wrote:
>>
>> You can get "Set-Cookie" headers from the responses
>>
>> $ scrapy shell "http://www.amazon.com"; --set USER_AGENT="Mozilla/5.0 
>> (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) 
>> Chrome/35.0.1916.153 Safari/537.36"
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Scrapy 0.24.1 started (bot: 
>> scrapybot)
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Optional features available: ssl, 
>> http11, boto
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Overridden settings: 
>> {'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) 
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'}
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled extensions: 
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled downloader middlewares: 
>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>> ChunkedTransferMiddleware, DownloaderStats
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled spider middlewares: 
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>> UrlLengthMiddleware, DepthMiddleware
>> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled item pipelines: 
>> 2014-07-02 14:53:12+0200 [scrapy] DEBUG: Telnet console listening on 
>> 127.0.0.1:6023
>> 2014-07-02 14:53:12+0200 [scrapy] DEBUG: Web service listening on 
>> 127.0.0.1:6080
>> 2014-07-02 14:53:12+0200 [default] INFO: Spider opened
>> 2014-07-02 14:53:13+0200 [default] DEBUG: Crawled (200) <GET 
>> http://www.amazon.com> (referer: None)
>> [s] Available Scrapy objects:
>> [s]   crawler    <scrapy.crawler.Crawler object at 0x7f9ff6894bd0>
>> [s]   item       {}
>> [s]   request    <GET http://www.amazon.com>
>> [s]   response   <200 http://www.amazon.com>
>> [s]   settings   <scrapy.settings.Settings object at 0x7f9ff6f35d50>
>> [s]   spider     <Spider 'default' at 0x7f9ff5feae50>
>> [s] Useful shortcuts:
>> [s]   shelp()           Shell help (print this help)
>> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
>> [s]   view(response)    View response in a browser
>>
>> In [1]: response.headers
>> Out[1]: 
>> {'Cache-Control': 'no-cache',
>>  'Content-Type': 'text/html; charset=ISO-8859-1',
>>  'Date': 'Wed, 02 Jul 2014 12:53:13 GMT',
>>  'Expires': '-1',
>>  'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW 
>> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM 
>> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "',
>>  'Pragma': 'no-cache',
>>  'Server': 'Server',
>>  'Set-Cookie': 'session-id=185-4345826-3198169; path=/; domain=.
>> amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT',
>>  'Vary': 'Accept-Encoding,User-Agent',
>>  'X-Amz-Id-1': '0HSR62FXE7WW8GGJ3003',
>>  'X-Amz-Id-2': 
>> 'TX9doI/wHzZDQLi61C/nIydE0Sv7wjkhNs30li5KMVSEWLqRqVSvL03WYmkTnASu',
>>  'X-Frame-Options': 'SAMEORIGIN'}
>>
>> In [2]: 
>>
>>
>> And "Cookie" headers from response.requests:
>>
>> In [2]: fetch('http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27')
>> 2014-07-02 15:47:23+0200 [default] DEBUG: Crawled (200) <GET 
>> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> (referer: None)
>> [s] Available Scrapy objects:
>> [s]   crawler    <scrapy.crawler.Crawler object at 0x7f9ff6894bd0>
>> [s]   item       {}
>> [s]   request    <GET 
>> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
>> [s]   response   <200 
>> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
>> [s]   settings   <scrapy.settings.Settings object at 0x7f9ff6f35d50>
>> [s]   spider     <Spider 'default' at 0x7f9ff5feae50>
>> [s] Useful shortcuts:
>> [s]   shelp()           Shell help (print this help)
>> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
>> [s]   view(response)    View response in a browser
>>
>> In [3]: response.headers
>> Out[3]: 
>> {'Content-Type': 'text/html; charset=ISO-8859-1',
>>  'Date': 'Wed, 02 Jul 2014 13:47:22 GMT',
>>  'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW 
>> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM 
>> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "',
>>  'Server': 'Server',
>>  'Set-Cookie': 'session-id=185-4345826-3198169; path=/; domain=.
>> amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT',
>>  'Vary': 'Accept-Encoding,User-Agent',
>>  'X-Amz-Id-1': '0C0QXN1ZK555MP10HWB5',
>>  'X-Amz-Id-2': 
>> 'CcXo3odRFUSFkmnICLBbdhYKKmiygNJ/b7c3s74p2mWaRnqldFyDmhrdB9PPVK6O',
>>  'X-Frame-Options': 'SAMEORIGIN'}
>>
>> In [4]: response.request.headers
>> Out[4]: 
>> {'Accept': 
>> 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
>>  'Accept-Encoding': 'gzip,deflate',
>>  'Accept-Language': 'en',
>>  'Cookie': 'session-id=185-4345826-3198169; session-id-time=2082787201l; 
>> x-wl-uid=1/kDeNun+YQYYmW1esQBg6XsiW68oMT1FJXDavoxODm1tzaDnaKf1KOMU+Jmni6iWQngWZhCnOjI=;
>>  
>> skin=noskin',
>>  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 
>> (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'}
>>
>>
>>
>>
>> On Wednesday, July 2, 2014 7:18:31 AM UTC+2, Reggie wrote:
>>>
>>> I want to read cookies when  I parse response,  but I can't find cookies 
>>> neither in response.meta or response.headers,  how could I read cookies?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How could I read cookies in Spider.parse method?

Reply via email to