True has been removed from my settings. Thanks again! On Thursday, February 13, 2014 12:51:45 AM UTC-5, Nikolaos-Digenis Karagiannis wrote: > > Yes, Referer survived as a typo. You may want to skip the setting in > settings.py though > > https://scrapy.readthedocs.org/en/latest/topics/settings.html#std:setting-SPIDER_MIDDLEWARES_BASE > Enabled by default. > After seeing the above link you probably notice the bug in your settings. > Most people use integers for middleware sorting keys. > However because True has a __cmp__ method it will be used for sorting: > > https://github.com/scrapy/scrapy/blob/c886d7459f0e259606255812102caf77e40aa7e7/scrapy/utils/conf.py#L15-L16 > In a python shell try: > 1 == True > sorted([2, True, '0',[]]) > This allows you to accidentally introduce such bugs, using types you > didn't mean to sort. And your "True" just did, it moved the RefererMiddleware > to the bottom of the spider middleware stack. > One the other hand, because build_component_list() doesn't check the > types of the sorting keys you can use real numbers and theoretically have > infinite positions between middlewares. > > SPIDER_MIDDLEWARES = { > > 'project.downloadermiddlewares.keyoccupier.Above': 740, > 'georgcantor.uncountability.InfiniteInfinities': 740.5, > 'project.downloadermiddlewares.keyoccupier.Bellow': 741, > } > > The documentation doesn't specify a type: "their values are the middleware > orders" > You could even use classes with their own __cmp__ method and do some magic. > Classifying this as a bug or feature remains an open discussion. > On Thursday, 13 February 2014 01:14:44 UTC+2, Michael Pastore wrote: >> >> Nikolaos, >> >> Perfect! The Referer Middleware was just what I was looking for (I only >> needed to capture the referring url and not the entire breadcrumb trail). >> >> It took me a bit of reading through posts to figure out how to actually >> retrieve the referring url, and the basics are below: >> >> Add to your settings file: >> >> SPIDER_MIDDLEWARES = { >> >> 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True, >> } >> >> >> Then in your spider parser use the following to access the referring url: >> >> response.request.headers.get('Referer', None) #btw: 'Referer' is the >> correct usage, 'Referrer' will not work >> >> Thanks again! >> >> On Monday, February 10, 2014 3:00:10 PM UTC-5, Michael Pastore wrote: >>> >>> I am writing a crawling spider but for each url visited and parsed, the >>> saved item needs to include the originating url. >>> >>> For example, lets say given the start_urls = ["http://www.A.com"] and >>> the initial list of urls to follow that are extracted by the >>> SgmlLinkExtractor >>> are ["http://www.B.com", "http://www.C.com"], the spider engine would >>> then schedule a visit to www.B.com then www.C.com. When the spider >>> crawls >>> to www.B.com and the parse method extracts some data, I need the >>> processed item to include a field with the originating url, which in this >>> case is >>> www.A.com. >>> >>> Like a breadcrumb trail, for each call to the parse method I need to >>> look back on step. Is there an existing way to get this information? >>> >>> Much thanks >>> >>
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
