try pyspider <https://github.com/binux/pyspider>? but needs some customization.
It's using rabbitmq with msgpack <http://msgpack.org/> (to support binary content), a asynchronous HTTP fetcher and distributed/multiprocessing processor. As for your case, maybe scheduler and webui is not needed. You can connect your backend to fetcher then processor then back to your system. You may customize fetcher to support JSON and processor load scripts from local files not from database. On Wednesday, December 3, 2014 7:02:35 PM UTC+8, [email protected] wrote: > > Hi, > I am looking for a web-scraping framework that can be both easy to > use/configure, and in addition would be part of a large backend. The > backend gets thousands of requests per second, and needs to do scraping as > part of its logic. We use RabbitMQ and I wonder if scrapy can be part of > such a system. So, each request carries a different URL (there is a small > set of domains, but the path/query etc is dynamic) > > So, I wonder about the following questions: > 1. Can I pass the URL as an argument? I mean, the spider is known by the > domain, but the URL is dynamic, therefore the spider has to get it > dynamically > 2. Integration, performance and scale: I've read that in a running system > scrapy can be invoked using scrapyd json API that actually opens up a > process. > So, in my system that passes lots of Rabbit messages around, the scraping > would launch a json request and we'll have lots of concurrent processes. > Imagine a single spider tales 2 seconds, then the number of processes would > go up until it chokes the server. I fear this model is problematic. > > I'd appreciate your advise. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
