Re: Is there a breadcrumb trail?

Michael Pastore Sat, 01 Mar 2014 05:03:20 -0800

True has been removed  from my settings.  Thanks again! 

On Thursday, February 13, 2014 12:51:45 AM UTC-5, Nikolaos-Digenis 
Karagiannis wrote:
>
> Yes, Referer survived as a typo. You may want to skip the setting in 
> settings.py though
>
> https://scrapy.readthedocs.org/en/latest/topics/settings.html#std:setting-SPIDER_MIDDLEWARES_BASE
> Enabled by default.
> After seeing the above link you probably notice the bug in your settings. 
> Most people use integers for middleware sorting keys.
> However because True has a __cmp__ method it will be used for sorting:
>
> https://github.com/scrapy/scrapy/blob/c886d7459f0e259606255812102caf77e40aa7e7/scrapy/utils/conf.py#L15-L16
> In a python shell try:
> 1 == True
> sorted([2, True, '0',[]])
> This allows you to accidentally introduce such bugs, using types you 
> didn't mean to sort. And your "True" just did, it moved the RefererMiddleware 
> to the bottom of the spider middleware stack.
> One the other hand, because build_component_list() doesn't check the 
> types of the sorting keys you can use real numbers and theoretically have 
> infinite positions between middlewares.
>
> SPIDER_MIDDLEWARES = {
>
>     'project.downloadermiddlewares.keyoccupier.Above': 740,
>     'georgcantor.uncountability.InfiniteInfinities': 740.5,
>     'project.downloadermiddlewares.keyoccupier.Bellow': 741,
> }
>
> The documentation doesn't specify a type: "their values are the middleware 
> orders"
> You could even use classes with their own __cmp__ method and do some magic.
> Classifying this as a bug or feature remains an open discussion.
> On Thursday, 13 February 2014 01:14:44 UTC+2, Michael Pastore wrote:
>>
>> Nikolaos,
>>
>> Perfect! The Referer Middleware was just what I was looking for (I only 
>> needed to capture the referring url and not the entire breadcrumb trail).
>>
>> It took me a bit of reading through posts to figure out how to actually 
>> retrieve the referring url, and the basics are below:
>>
>> Add to your settings file:
>>
>> SPIDER_MIDDLEWARES = {
>>
>> 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
>> }
>>
>>
>> Then in your spider parser use the following to access the referring url:
>>
>> response.request.headers.get('Referer', None) #btw: 'Referer' is the 
>> correct usage, 'Referrer' will not work
>>
>> Thanks again!
>>
>> On Monday, February 10, 2014 3:00:10 PM UTC-5, Michael Pastore wrote:
>>>
>>> I am writing a crawling spider but for each url visited and parsed, the 
>>> saved item needs to include the originating url.  
>>>
>>> For example, lets say given the start_urls = ["http://www.A.com";] and 
>>> the initial list of urls to follow that are extracted by the 
>>> SgmlLinkExtractor
>>> are ["http://www.B.com";, "http://www.C.com";], the spider engine would 
>>> then schedule a visit to www.B.com then www.C.com.  When the spider 
>>> crawls 
>>> to www.B.com and the parse method extracts some data, I need the 
>>> processed item to include a field with the originating url, which in this 
>>> case is
>>> www.A.com.  
>>>
>>> Like a breadcrumb trail, for each call to the parse method I need to 
>>> look back on step. Is there an existing way to get this information? 
>>>
>>> Much thanks
>>>
>>


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Is there a breadcrumb trail?

Reply via email to