Seems to be the issue m8 :). Thanks you saved my day!
Den fredagen den 21:e mars 2014 kl. 12:14:23 UTC+1 skrev Paul Tremberth: > > Could it be that readlines() leaves the \n at the end? > try with > > self.agents = [a.strip() for a in f.readlines()] > > or similar > > On Friday, March 21, 2014 12:09:11 PM UTC+1, James Ford wrote: >> >> Sure, >> >> Below you will find a crawl of http://doc.scrapy.org with a depth of 1 >> and extraction of inlinks only. >> >> http://pastebin.com/wE292pQe >> >> As you can see from the stats the status 200 count is only 13. This is >> not the case if I put my agent-list directly in my module or if I disable >> my middleware. >> >> Thanks >> >> Den fredagen den 21:e mars 2014 kl. 10:55:26 UTC+1 skrev Paul Tremberth: >>> >>> Can you share logs? >>> >>> On Fri, Mar 21, 2014 at 10:53 AM, James Ford <simon...@gmail.com> >>> wrote: >>> > Hello, >>> > >>> > I'm having an odd issue with one of my projects. >>> > >>> > I have implemented a custom middleware that rotates user-agent for >>> each >>> > request. >>> > >>> > The middleware works by reading from a file when the middleware is >>> > initialized by putting the contents of the file into a list(in >>> memory). >>> > >>> > According to me this should work fine, but I am getting a large amount >>> of >>> > 400 bad requsts of my crawls? The odd thing is that it works fine if I >>> just >>> > put the agents in a list directly instead of reading from file. >>> > >>> > What can cause this error? Here is my middleware: >>> > >>> > class UserAgentPool(): >>> > def __init__(self): >>> > basepath = os.path.dirname(__file__) >>> > filepath = os.path.abspath(os.path.join(basepath, >>> "agents.txt")) >>> > with open(filepath, 'r') as f: >>> > self.agents = f.readlines() >>> > >>> > def rotate(self): >>> > log.msg("Rotating user agent", level=log.DEBUG) >>> > agent = self.agents.pop(0) >>> > log.msg("Agent popped %s" %agent, level=log.DEBUG) >>> > log.msg("[%s]" % ", ".join(map(str, self.agents)), >>> level=log.DEBUG) >>> > self.agents.append(agent) >>> > return agent >>> > >>> > class UserAgentRotationMiddleware(object): >>> > def __init__(self): >>> > self.pool = UserAgentPool() >>> > >>> > def process_request(self, request, spider): >>> > if getattr(spider, 'agent_rotation', None): >>> > agent = self.pool.rotate() >>> > request.headers.setdefault('User-Agent', agent) >>> > log.msg("Setting User-Agent to %s" >>> > %request.headers["User-Agent"]) >>> > >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> Groups >>> > "scrapy-users" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> an >>> > email to scrapy-users...@googlegroups.com. >>> > To post to this group, send email to scrapy...@googlegroups.com. >>> > Visit this group at http://groups.google.com/group/scrapy-users. >>> > For more options, visit https://groups.google.com/d/optout. >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.