Hi Everyone,

I am facing the same problematic as Dan in a project (need for a scalable, 
generic crawler; possibly enhanced with probabilistic models).

I was wondering if you might have some new inputs on the subject, since the 
last response was more than 3 years ago.

How has Scrapy evolved in general and relatively to "competitors" like 
Nutch ?

Your input would be much appreciated !

Thanks,

Yung


Le vendredi 17 février 2012 10:31:00 UTC+1, Neverlast N a écrit :
>
>  
> "Then, of course, you need to actually extract the data you are 
> interested in from these web pages. I am not aware of anything that you 
> can plug into scrapy for this (if anyone is, please let me know). There 
> are a number of techniques for this, but so far I have not seen good 
> open source implementations. If you're writing one, I'd be interested to 
> know."
>
> My experience with Scrapy is the same. I want also to extract real estate 
> data from 100's of websites and the best I came up with is to create a set 
> of tools that will be able to create XPath by giving them some training 
> data i.e. manually extracting data for a few pages and then the 
> configuration will be automatically generated.
>
> This would probably reduce the configuration time to less than 5 
> minutes/site which means 416hours ... The good news is that this is $3/h 
> job and not $25/h job i.e. it would cost you less than $1500 to create the 
> configurations and you can massively paralelize to have it done in e.g. a 
> week. The bad news is that you will need another $2000 probably to write 
> custom scrappers for the 20% of the sites that can't be configured 
> semi-automatically. If you choose that 80% is good enough - that's ok.
>
> Still you will have to write the configuration "wizard" software though :)
>
> Note: If it's a 5000 site project probably $10k is funny money. Hosting 
> will be costing you ~$1k/month if you are crawling daily. Are you sure you 
> NEED all those sites though?
>
> Cheers,
> Dimitris
>
>
>
> > Date: Wed, 15 Feb 2012 15:24:54 -0800
> > Subject: Can scrapy handle 5000+ website crawl and provide structured 
> data?
> > From: [email protected] <javascript:>
> > To: [email protected] <javascript:>
> > 
> > Hi,
> > Im looking at crawling 5000 + websites and need a solution. They are
> > real estate listings, so the data is similar, but every site has its
> > own html code - they are all unique sites. No clean datafeed or api is
> > available.
> > 
> > I am looking for a solution that is halfway intelligent, or I can
> > program intelligence into it. Something I can just load the root
> > domains into, it crawls, and will capture data between html tags and
> > present it in a somewhat orderly manner. I cannot write a unique
> > parser for every site. What I need is something that will capture
> > everything, then I will know that in say Field XYZ the price has been
> > stored (because the html code on every page of that site that had
> > price was <td id=price> 100 </td> ) for example.
> > 
> > Is scrapy for me?
> > 
> > Im hoping to load the captured data into some sort of DB, map the
> > fields to what I need (eg find the field that is price and call it
> > price) then that becomes the parser/clean data for that site until the
> > html changes.
> > 
> > Any ideas on how to do this with scrapy if possible?
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "scrapy-users" group.
> > To post to this group, send email to [email protected] 
> <javascript:>.
> > To unsubscribe from this group, send email to 
> [email protected] <javascript:>.
> > For more options, visit this group at 
> http://groups.google.com/group/scrapy-users?hl=en.
> > 
>  

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to