Re: Can scrapy handle 5000+ website crawl and provide structured data?

Yung Bubu Thu, 23 Jul 2015 04:30:08 -0700

Hi Karen,

Frontera looks really great !


I will have a deep look at it, it could help me a lot !

Many Thanks, 

Yung 



Le mercredi 22 juillet 2015 18:09:09 UTC+2, K Chenette a écrit :
>
> Hi Yung,
> I recommend that you check out this page:
>
> http://blog.scrapinghub.com/2015/04/22/frontera-the-brain-behind-the-crawls/
>
> "last year we had 11 billion requests made on Scrapy Cloud alone"
>
> I believe Scraping Hub is the best source of information on scaling Scrapy 
> crawls.
>
> Karen
>
> On Tuesday, July 21, 2015 at 7:22:41 PM UTC-5, Yung Bubu wrote:
>>
>> Hi Everyone,
>>
>> I am facing the same problematic as Dan in a project (need for a 
>> scalable, generic crawler; possibly enhanced with probabilistic models).
>>
>> I was wondering if you might have some new inputs on the subject, since 
>> the last response was more than 3 years ago.
>>
>> How has Scrapy evolved in general and relatively to "competitors" like 
>> Nutch ?
>>
>> Your input would be much appreciated !
>>
>> Thanks,
>>
>> Yung
>>
>>
>> Le vendredi 17 février 2012 10:31:00 UTC+1, Neverlast N a écrit :
>>>
>>>  
>>> "Then, of course, you need to actually extract the data you are 
>>> interested in from these web pages. I am not aware of anything that you 
>>> can plug into scrapy for this (if anyone is, please let me know). There 
>>> are a number of techniques for this, but so far I have not seen good 
>>> open source implementations. If you're writing one, I'd be interested to 
>>> know."
>>>
>>> My experience with Scrapy is the same. I want also to extract real 
>>> estate data from 100's of websites and the best I came up with is to create 
>>> a set of tools that will be able to create XPath by giving them some 
>>> training data i.e. manually extracting data for a few pages and then the 
>>> configuration will be automatically generated.
>>>
>>> This would probably reduce the configuration time to less than 5 
>>> minutes/site which means 416hours ... The good news is that this is $3/h 
>>> job and not $25/h job i.e. it would cost you less than $1500 to create the 
>>> configurations and you can massively paralelize to have it done in e.g. a 
>>> week. The bad news is that you will need another $2000 probably to write 
>>> custom scrappers for the 20% of the sites that can't be configured 
>>> semi-automatically. If you choose that 80% is good enough - that's ok.
>>>
>>> Still you will have to write the configuration "wizard" software though 
>>> :)
>>>
>>> Note: If it's a 5000 site project probably $10k is funny money. Hosting 
>>> will be costing you ~$1k/month if you are crawling daily. Are you sure you 
>>> NEED all those sites though?
>>>
>>> Cheers,
>>> Dimitris
>>>
>>>
>>>
>>> > Date: Wed, 15 Feb 2012 15:24:54 -0800
>>> > Subject: Can scrapy handle 5000+ website crawl and provide structured 
>>> data?
>>> > From: [email protected]
>>> > To: [email protected]
>>> > 
>>> > Hi,
>>> > Im looking at crawling 5000 + websites and need a solution. They are
>>> > real estate listings, so the data is similar, but every site has its
>>> > own html code - they are all unique sites. No clean datafeed or api is
>>> > available.
>>> > 
>>> > I am looking for a solution that is halfway intelligent, or I can
>>> > program intelligence into it. Something I can just load the root
>>> > domains into, it crawls, and will capture data between html tags and
>>> > present it in a somewhat orderly manner. I cannot write a unique
>>> > parser for every site. What I need is something that will capture
>>> > everything, then I will know that in say Field XYZ the price has been
>>> > stored (because the html code on every page of that site that had
>>> > price was <td id=price> 100 </td> ) for example.
>>> > 
>>> > Is scrapy for me?
>>> > 
>>> > Im hoping to load the captured data into some sort of DB, map the
>>> > fields to what I need (eg find the field that is price and call it
>>> > price) then that becomes the parser/clean data for that site until the
>>> > html changes.
>>> > 
>>> > Any ideas on how to do this with scrapy if possible?
>>> > 
>>> > -- 
>>> > You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> > To post to this group, send email to [email protected].
>>> > To unsubscribe from this group, send email to 
>>> [email protected].
>>> > For more options, visit this group at 
>>> http://groups.google.com/group/scrapy-users?hl=en.
>>> > 
>>>  
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Can scrapy handle 5000+ website crawl and provide structured data?

Reply via email to