Hi Yung, I recommend that you check out this page: http://blog.scrapinghub.com/2015/04/22/frontera-the-brain-behind-the-crawls/
"last year we had 11 billion requests made on Scrapy Cloud alone" I believe Scraping Hub is the best source of information on scaling Scrapy crawls. Karen On Tuesday, July 21, 2015 at 7:22:41 PM UTC-5, Yung Bubu wrote: > > Hi Everyone, > > I am facing the same problematic as Dan in a project (need for a scalable, > generic crawler; possibly enhanced with probabilistic models). > > I was wondering if you might have some new inputs on the subject, since > the last response was more than 3 years ago. > > How has Scrapy evolved in general and relatively to "competitors" like > Nutch ? > > Your input would be much appreciated ! > > Thanks, > > Yung > > > Le vendredi 17 février 2012 10:31:00 UTC+1, Neverlast N a écrit : >> >> >> "Then, of course, you need to actually extract the data you are >> interested in from these web pages. I am not aware of anything that you >> can plug into scrapy for this (if anyone is, please let me know). There >> are a number of techniques for this, but so far I have not seen good >> open source implementations. If you're writing one, I'd be interested to >> know." >> >> My experience with Scrapy is the same. I want also to extract real estate >> data from 100's of websites and the best I came up with is to create a set >> of tools that will be able to create XPath by giving them some training >> data i.e. manually extracting data for a few pages and then the >> configuration will be automatically generated. >> >> This would probably reduce the configuration time to less than 5 >> minutes/site which means 416hours ... The good news is that this is $3/h >> job and not $25/h job i.e. it would cost you less than $1500 to create the >> configurations and you can massively paralelize to have it done in e.g. a >> week. The bad news is that you will need another $2000 probably to write >> custom scrappers for the 20% of the sites that can't be configured >> semi-automatically. If you choose that 80% is good enough - that's ok. >> >> Still you will have to write the configuration "wizard" software though :) >> >> Note: If it's a 5000 site project probably $10k is funny money. Hosting >> will be costing you ~$1k/month if you are crawling daily. Are you sure you >> NEED all those sites though? >> >> Cheers, >> Dimitris >> >> >> >> > Date: Wed, 15 Feb 2012 15:24:54 -0800 >> > Subject: Can scrapy handle 5000+ website crawl and provide structured >> data? >> > From: [email protected] >> > To: [email protected] >> > >> > Hi, >> > Im looking at crawling 5000 + websites and need a solution. They are >> > real estate listings, so the data is similar, but every site has its >> > own html code - they are all unique sites. No clean datafeed or api is >> > available. >> > >> > I am looking for a solution that is halfway intelligent, or I can >> > program intelligence into it. Something I can just load the root >> > domains into, it crawls, and will capture data between html tags and >> > present it in a somewhat orderly manner. I cannot write a unique >> > parser for every site. What I need is something that will capture >> > everything, then I will know that in say Field XYZ the price has been >> > stored (because the html code on every page of that site that had >> > price was <td id=price> 100 </td> ) for example. >> > >> > Is scrapy for me? >> > >> > Im hoping to load the captured data into some sort of DB, map the >> > fields to what I need (eg find the field that is price and call it >> > price) then that becomes the parser/clean data for that site until the >> > html changes. >> > >> > Any ideas on how to do this with scrapy if possible? >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "scrapy-users" group. >> > To post to this group, send email to [email protected]. >> > To unsubscribe from this group, send email to >> [email protected]. >> > For more options, visit this group at >> http://groups.google.com/group/scrapy-users?hl=en. >> > >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
