Re: Can scrapy handle 5000+ website crawl and provide structured data?

K Chenette Wed, 22 Jul 2015 09:09:50 -0700

Hi Yung,
I recommend that you check out this page:
http://blog.scrapinghub.com/2015/04/22/frontera-the-brain-behind-the-crawls/


"last year we had 11 billion requests made on Scrapy Cloud alone"

I believe Scraping Hub is the best source of information on scaling Scrapy 
crawls.

Karen

On Tuesday, July 21, 2015 at 7:22:41 PM UTC-5, Yung Bubu wrote:
>
> Hi Everyone,
>
> I am facing the same problematic as Dan in a project (need for a scalable, 
> generic crawler; possibly enhanced with probabilistic models).
>
> I was wondering if you might have some new inputs on the subject, since 
> the last response was more than 3 years ago.
>
> How has Scrapy evolved in general and relatively to "competitors" like 
> Nutch ?
>
> Your input would be much appreciated !
>
> Thanks,
>
> Yung
>
>
> Le vendredi 17 février 2012 10:31:00 UTC+1, Neverlast N a écrit :
>>
>>  
>> "Then, of course, you need to actually extract the data you are 
>> interested in from these web pages. I am not aware of anything that you 
>> can plug into scrapy for this (if anyone is, please let me know). There 
>> are a number of techniques for this, but so far I have not seen good 
>> open source implementations. If you're writing one, I'd be interested to 
>> know."
>>
>> My experience with Scrapy is the same. I want also to extract real estate 
>> data from 100's of websites and the best I came up with is to create a set 
>> of tools that will be able to create XPath by giving them some training 
>> data i.e. manually extracting data for a few pages and then the 
>> configuration will be automatically generated.
>>
>> This would probably reduce the configuration time to less than 5 
>> minutes/site which means 416hours ... The good news is that this is $3/h 
>> job and not $25/h job i.e. it would cost you less than $1500 to create the 
>> configurations and you can massively paralelize to have it done in e.g. a 
>> week. The bad news is that you will need another $2000 probably to write 
>> custom scrappers for the 20% of the sites that can't be configured 
>> semi-automatically. If you choose that 80% is good enough - that's ok.
>>
>> Still you will have to write the configuration "wizard" software though :)
>>
>> Note: If it's a 5000 site project probably $10k is funny money. Hosting 
>> will be costing you ~$1k/month if you are crawling daily. Are you sure you 
>> NEED all those sites though?
>>
>> Cheers,
>> Dimitris
>>
>>
>>
>> > Date: Wed, 15 Feb 2012 15:24:54 -0800
>> > Subject: Can scrapy handle 5000+ website crawl and provide structured 
>> data?
>> > From: [email protected]
>> > To: [email protected]
>> > 
>> > Hi,
>> > Im looking at crawling 5000 + websites and need a solution. They are
>> > real estate listings, so the data is similar, but every site has its
>> > own html code - they are all unique sites. No clean datafeed or api is
>> > available.
>> > 
>> > I am looking for a solution that is halfway intelligent, or I can
>> > program intelligence into it. Something I can just load the root
>> > domains into, it crawls, and will capture data between html tags and
>> > present it in a somewhat orderly manner. I cannot write a unique
>> > parser for every site. What I need is something that will capture
>> > everything, then I will know that in say Field XYZ the price has been
>> > stored (because the html code on every page of that site that had
>> > price was <td id=price> 100 </td> ) for example.
>> > 
>> > Is scrapy for me?
>> > 
>> > Im hoping to load the captured data into some sort of DB, map the
>> > fields to what I need (eg find the field that is price and call it
>> > price) then that becomes the parser/clean data for that site until the
>> > html changes.
>> > 
>> > Any ideas on how to do this with scrapy if possible?
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups "scrapy-users" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to 
>> [email protected].
>> > For more options, visit this group at 
>> http://groups.google.com/group/scrapy-users?hl=en.
>> > 
>>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Can scrapy handle 5000+ website crawl and provide structured data?

Reply via email to