Re: How could I read cookies in Spider.parse method?

Paul Tremberth Wed, 02 Jul 2014 07:01:01 -0700

A bit more detail if you noticed that response.headers representation seems 
to be missing some Set-Cookies values.
In fact you can received multiple Set-Cookie headers, so you need to use 
.getlist(headername) to get them all:


Same example with Amazon.com and COOKIES_DEBUG enabled

$ scrapy shell "http://www.amazon.com"; --set USER_AGENT="Mozilla/5.0 (X11; 
Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 
Safari/537.36" --set COOKIES_DEBUG=1
2014-07-02 15:56:32+0200 [scrapy] INFO: Scrapy 0.24.1 started (bot: 
scrapybot)
2014-07-02 15:56:33+0200 [default] DEBUG: Received cookies from: <200 
http://www.amazon.com>
Set-Cookie: skin=noskin; path=/; domain=.amazon.com
Set-Cookie: 
x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
 
path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT
Set-Cookie: session-id-time=2082787201l; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT
Set-Cookie: session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT
2014-07-02 15:56:33+0200 [default] DEBUG: Crawled (200) <GET 
http://www.amazon.com> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1ce9d8fbd0>
[s]   item       {}
[s]   request    <GET http://www.amazon.com>
[s]   response   <200 http://www.amazon.com>
[s]   settings   <scrapy.settings.Settings object at 0x7f1cea430d50>
[s]   spider     <Spider 'default' at 0x7f1ce94e5e50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: response.headers
Out[1]: 
{'Cache-Control': 'no-cache',
 'Content-Type': 'text/html; charset=ISO-8859-1',
 'Date': 'Wed, 02 Jul 2014 13:56:32 GMT',
 'Expires': '-1',
 'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW CUR 
ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM NAV 
INT DEM CNT STA HEA PRE LOC GOV OTC "',
 'Pragma': 'no-cache',
 'Server': 'Server',
 'Set-Cookie': 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT',
 'Vary': 'Accept-Encoding,User-Agent',
 'X-Amz-Id-1': '1ZAYQZK49NGDTCJPSH1C',
 'X-Amz-Id-2': 
'puwShmgjkOwsTu9o4UP22PoJMqv9eeh0EOI52svdSdZ96b9VtkJbPKdwDHuojOay',
 'X-Frame-Options': 'SAMEORIGIN'}

In [2]: type(response.headers)
Out[2]: scrapy.http.headers.Headers

In [3]: response.headers.getlist("Set-Cookie")
Out[3]: 
['skin=noskin; path=/; domain=.amazon.com',
 
'x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
 
path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT',
 'session-id-time=2082787201l; path=/; domain=.amazon.com; expires=Tue, 
01-Jan-2036 08:00:01 GMT',
 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; expires=Tue, 
01-Jan-2036 08:00:01 GMT']

In [4]: 



And the cookies Scrapy sends:

In [4]: fetch('http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27')
2014-07-02 15:59:24+0200 [default] DEBUG: Sending cookies to: <GET 
http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
Cookie: session-id=182-4946683-0637966; session-id-time=2082787201l; 
x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
 
skin=noskin
2014-07-02 15:59:25+0200 [default] DEBUG: Received cookies from: <200 
http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
Set-Cookie: ubid-main=183-9706629-1828940; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT
Set-Cookie: session-id-time=2082787201l; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT
Set-Cookie: session-id=182-4946683-0637966; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT
2014-07-02 15:59:25+0200 [default] DEBUG: Crawled (200) <GET 
http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1ce9d8fbd0>
[s]   item       {}
[s]   request    <GET http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
[s]   response   <200 http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27>
[s]   settings   <scrapy.settings.Settings object at 0x7f1cea430d50>
[s]   spider     <Spider 'default' at 0x7f1ce94e5e50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [5]: response.request.headers.getlist("Cookie")
Out[5]: ['session-id=182-4946683-0637966; session-id-time=2082787201l; 
x-wl-uid=1TzxsioAAJu0q37UxjzEKb4UNs0KLyIW8rCypLuAVZpMc8uplJgfLcrbX2StxWEpT59BoUyDBl5A=;
 
skin=noskin']

In [6]: response.headers.getlist("Set-Cookie")
Out[6]: 
['ubid-main=183-9706629-1828940; path=/; domain=.amazon.com; expires=Tue, 
01-Jan-2036 08:00:01 GMT',
 'session-id-time=2082787201l; path=/; domain=.amazon.com; expires=Tue, 
01-Jan-2036 08:00:01 GMT',
 'session-id=182-4946683-0637966; path=/; domain=.amazon.com; expires=Tue, 
01-Jan-2036 08:00:01 GMT']

In [7]: 



On Wednesday, July 2, 2014 3:51:00 PM UTC+2, Paul Tremberth wrote:
>
> You can get "Set-Cookie" headers from the responses
>
> $ scrapy shell "http://www.amazon.com"; --set USER_AGENT="Mozilla/5.0 
> (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) 
> Chrome/35.0.1916.153 Safari/537.36"
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Scrapy 0.24.1 started (bot: 
> scrapybot)
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Optional features available: ssl, 
> http11, boto
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Overridden settings: 
> {'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'}
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, 
> CloseSpider, WebService, CoreStats, SpiderState
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled downloader middlewares: 
> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
> ChunkedTransferMiddleware, DownloaderStats
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled spider middlewares: 
> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
> UrlLengthMiddleware, DepthMiddleware
> 2014-07-02 14:53:12+0200 [scrapy] INFO: Enabled item pipelines: 
> 2014-07-02 14:53:12+0200 [scrapy] DEBUG: Telnet console listening on 
> 127.0.0.1:6023
> 2014-07-02 14:53:12+0200 [scrapy] DEBUG: Web service listening on 
> 127.0.0.1:6080
> 2014-07-02 14:53:12+0200 [default] INFO: Spider opened
> 2014-07-02 14:53:13+0200 [default] DEBUG: Crawled (200) <GET 
> http://www.amazon.com> (referer: None)
> [s] Available Scrapy objects:
> [s]   crawler    <scrapy.crawler.Crawler object at 0x7f9ff6894bd0>
> [s]   item       {}
> [s]   request    <GET http://www.amazon.com>
> [s]   response   <200 http://www.amazon.com>
> [s]   settings   <scrapy.settings.Settings object at 0x7f9ff6f35d50>
> [s]   spider     <Spider 'default' at 0x7f9ff5feae50>
> [s] Useful shortcuts:
> [s]   shelp()           Shell help (print this help)
> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
> [s]   view(response)    View response in a browser
>
> In [1]: response.headers
> Out[1]: 
> {'Cache-Control': 'no-cache',
>  'Content-Type': 'text/html; charset=ISO-8859-1',
>  'Date': 'Wed, 02 Jul 2014 12:53:13 GMT',
>  'Expires': '-1',
>  'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW 
> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM 
> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "',
>  'Pragma': 'no-cache',
>  'Server': 'Server',
>  'Set-Cookie': 'session-id=185-4345826-3198169; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT',
>  'Vary': 'Accept-Encoding,User-Agent',
>  'X-Amz-Id-1': '0HSR62FXE7WW8GGJ3003',
>  'X-Amz-Id-2': 
> 'TX9doI/wHzZDQLi61C/nIydE0Sv7wjkhNs30li5KMVSEWLqRqVSvL03WYmkTnASu',
>  'X-Frame-Options': 'SAMEORIGIN'}
>
> In [2]: 
>
>
> And "Cookie" headers from response.requests:
>
> In [2]: fetch('http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27')
> 2014-07-02 15:47:23+0200 [default] DEBUG: Crawled (200) <GET 
> http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27> (referer: None)
> [s] Available Scrapy objects:
> [s]   crawler    <scrapy.crawler.Crawler object at 0x7f9ff6894bd0>
> [s]   item       {}
> [s]   request    <GET http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27
> >
> [s]   response   <200 http://www.amazon.com/gp/goldbox/ref=cs_top_nav_gb27
> >
> [s]   settings   <scrapy.settings.Settings object at 0x7f9ff6f35d50>
> [s]   spider     <Spider 'default' at 0x7f9ff5feae50>
> [s] Useful shortcuts:
> [s]   shelp()           Shell help (print this help)
> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
> [s]   view(response)    View response in a browser
>
> In [3]: response.headers
> Out[3]: 
> {'Content-Type': 'text/html; charset=ISO-8859-1',
>  'Date': 'Wed, 02 Jul 2014 13:47:22 GMT',
>  'P3P': 'policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW 
> CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM 
> NAV INT DEM CNT STA HEA PRE LOC GOV OTC "',
>  'Server': 'Server',
>  'Set-Cookie': 'session-id=185-4345826-3198169; path=/; domain=.amazon.com; 
> expires=Tue, 01-Jan-2036 08:00:01 GMT',
>  'Vary': 'Accept-Encoding,User-Agent',
>  'X-Amz-Id-1': '0C0QXN1ZK555MP10HWB5',
>  'X-Amz-Id-2': 
> 'CcXo3odRFUSFkmnICLBbdhYKKmiygNJ/b7c3s74p2mWaRnqldFyDmhrdB9PPVK6O',
>  'X-Frame-Options': 'SAMEORIGIN'}
>
> In [4]: response.request.headers
> Out[4]: 
> {'Accept': 
> 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
>  'Accept-Encoding': 'gzip,deflate',
>  'Accept-Language': 'en',
>  'Cookie': 'session-id=185-4345826-3198169; session-id-time=2082787201l; 
> x-wl-uid=1/kDeNun+YQYYmW1esQBg6XsiW68oMT1FJXDavoxODm1tzaDnaKf1KOMU+Jmni6iWQngWZhCnOjI=;
>  
> skin=noskin',
>  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, 
> like Gecko) Chrome/35.0.1916.153 Safari/537.36'}
>
>
>
>
> On Wednesday, July 2, 2014 7:18:31 AM UTC+2, Reggie wrote:
>>
>> I want to read cookies when  I parse response,  but I can't find cookies 
>> neither in response.meta or response.headers,  how could I read cookies?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How could I read cookies in Spider.parse method?

Reply via email to