Hi Everyone, I am facing the same problematic as Dan in a project (need for a scalable, generic crawler; possibly enhanced with probabilistic models).
I was wondering if you might have some new inputs on the subject, since the last response was more than 3 years ago. How has Scrapy evolved in general and relatively to "competitors" like Nutch ? Your input would be much appreciated ! Thanks, Yung Le vendredi 17 février 2012 10:31:00 UTC+1, Neverlast N a écrit : > > > "Then, of course, you need to actually extract the data you are > interested in from these web pages. I am not aware of anything that you > can plug into scrapy for this (if anyone is, please let me know). There > are a number of techniques for this, but so far I have not seen good > open source implementations. If you're writing one, I'd be interested to > know." > > My experience with Scrapy is the same. I want also to extract real estate > data from 100's of websites and the best I came up with is to create a set > of tools that will be able to create XPath by giving them some training > data i.e. manually extracting data for a few pages and then the > configuration will be automatically generated. > > This would probably reduce the configuration time to less than 5 > minutes/site which means 416hours ... The good news is that this is $3/h > job and not $25/h job i.e. it would cost you less than $1500 to create the > configurations and you can massively paralelize to have it done in e.g. a > week. The bad news is that you will need another $2000 probably to write > custom scrappers for the 20% of the sites that can't be configured > semi-automatically. If you choose that 80% is good enough - that's ok. > > Still you will have to write the configuration "wizard" software though :) > > Note: If it's a 5000 site project probably $10k is funny money. Hosting > will be costing you ~$1k/month if you are crawling daily. Are you sure you > NEED all those sites though? > > Cheers, > Dimitris > > > > > Date: Wed, 15 Feb 2012 15:24:54 -0800 > > Subject: Can scrapy handle 5000+ website crawl and provide structured > data? > > From: [email protected] <javascript:> > > To: [email protected] <javascript:> > > > > Hi, > > Im looking at crawling 5000 + websites and need a solution. They are > > real estate listings, so the data is similar, but every site has its > > own html code - they are all unique sites. No clean datafeed or api is > > available. > > > > I am looking for a solution that is halfway intelligent, or I can > > program intelligence into it. Something I can just load the root > > domains into, it crawls, and will capture data between html tags and > > present it in a somewhat orderly manner. I cannot write a unique > > parser for every site. What I need is something that will capture > > everything, then I will know that in say Field XYZ the price has been > > stored (because the html code on every page of that site that had > > price was <td id=price> 100 </td> ) for example. > > > > Is scrapy for me? > > > > Im hoping to load the captured data into some sort of DB, map the > > fields to what I need (eg find the field that is price and call it > > price) then that becomes the parser/clean data for that site until the > > html changes. > > > > Any ideas on how to do this with scrapy if possible? > > > > -- > > You received this message because you are subscribed to the Google > Groups "scrapy-users" group. > > To post to this group, send email to [email protected] > <javascript:>. > > To unsubscribe from this group, send email to > [email protected] <javascript:>. > > For more options, visit this group at > http://groups.google.com/group/scrapy-users?hl=en. > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
