Hi Karen, Frontera looks really great !
I will have a deep look at it, it could help me a lot ! Many Thanks, Yung Le mercredi 22 juillet 2015 18:09:09 UTC+2, K Chenette a écrit : > > Hi Yung, > I recommend that you check out this page: > > http://blog.scrapinghub.com/2015/04/22/frontera-the-brain-behind-the-crawls/ > > "last year we had 11 billion requests made on Scrapy Cloud alone" > > I believe Scraping Hub is the best source of information on scaling Scrapy > crawls. > > Karen > > On Tuesday, July 21, 2015 at 7:22:41 PM UTC-5, Yung Bubu wrote: >> >> Hi Everyone, >> >> I am facing the same problematic as Dan in a project (need for a >> scalable, generic crawler; possibly enhanced with probabilistic models). >> >> I was wondering if you might have some new inputs on the subject, since >> the last response was more than 3 years ago. >> >> How has Scrapy evolved in general and relatively to "competitors" like >> Nutch ? >> >> Your input would be much appreciated ! >> >> Thanks, >> >> Yung >> >> >> Le vendredi 17 février 2012 10:31:00 UTC+1, Neverlast N a écrit : >>> >>> >>> "Then, of course, you need to actually extract the data you are >>> interested in from these web pages. I am not aware of anything that you >>> can plug into scrapy for this (if anyone is, please let me know). There >>> are a number of techniques for this, but so far I have not seen good >>> open source implementations. If you're writing one, I'd be interested to >>> know." >>> >>> My experience with Scrapy is the same. I want also to extract real >>> estate data from 100's of websites and the best I came up with is to create >>> a set of tools that will be able to create XPath by giving them some >>> training data i.e. manually extracting data for a few pages and then the >>> configuration will be automatically generated. >>> >>> This would probably reduce the configuration time to less than 5 >>> minutes/site which means 416hours ... The good news is that this is $3/h >>> job and not $25/h job i.e. it would cost you less than $1500 to create the >>> configurations and you can massively paralelize to have it done in e.g. a >>> week. The bad news is that you will need another $2000 probably to write >>> custom scrappers for the 20% of the sites that can't be configured >>> semi-automatically. If you choose that 80% is good enough - that's ok. >>> >>> Still you will have to write the configuration "wizard" software though >>> :) >>> >>> Note: If it's a 5000 site project probably $10k is funny money. Hosting >>> will be costing you ~$1k/month if you are crawling daily. Are you sure you >>> NEED all those sites though? >>> >>> Cheers, >>> Dimitris >>> >>> >>> >>> > Date: Wed, 15 Feb 2012 15:24:54 -0800 >>> > Subject: Can scrapy handle 5000+ website crawl and provide structured >>> data? >>> > From: [email protected] >>> > To: [email protected] >>> > >>> > Hi, >>> > Im looking at crawling 5000 + websites and need a solution. They are >>> > real estate listings, so the data is similar, but every site has its >>> > own html code - they are all unique sites. No clean datafeed or api is >>> > available. >>> > >>> > I am looking for a solution that is halfway intelligent, or I can >>> > program intelligence into it. Something I can just load the root >>> > domains into, it crawls, and will capture data between html tags and >>> > present it in a somewhat orderly manner. I cannot write a unique >>> > parser for every site. What I need is something that will capture >>> > everything, then I will know that in say Field XYZ the price has been >>> > stored (because the html code on every page of that site that had >>> > price was <td id=price> 100 </td> ) for example. >>> > >>> > Is scrapy for me? >>> > >>> > Im hoping to load the captured data into some sort of DB, map the >>> > fields to what I need (eg find the field that is price and call it >>> > price) then that becomes the parser/clean data for that site until the >>> > html changes. >>> > >>> > Any ideas on how to do this with scrapy if possible? >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> > To post to this group, send email to [email protected]. >>> > To unsubscribe from this group, send email to >>> [email protected]. >>> > For more options, visit this group at >>> http://groups.google.com/group/scrapy-users?hl=en. >>> > >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
