Ever since last we exchanged ideas, I thought about it quite a lot but sadly found it hard to find some extra real-world uses for the project. In here I'll try to collect the results of my monologuous brainstorm sessions and invite anyone and everyone for open discussion on the subject.
I post this to the scrapy-users list because any user, developer or no, is invited to read this and share their thoughts on my ramblings. The basic idea is to control spiders through an HTTP API (I'll use REST API from now on, correct me on this if you like). Slightly similar to the currently present WebService. - As Shane said before, it would be even more helpful if the same API would allow access to all spiders encapsulated by one and the same project. So, recapped, spiders are mapped by their name on a parameter or on domain. I like the former better as it comes more natural to approach the spider in the same place where you tell it what to do. Both are possible. - One would be able to dispatch jobs to spiders, both by sending start URLs (cfr. start_urls) or sending Request objects (cfr. start_requests). - The user would have full control over the results of these scrapings. The standard case would be for the spider to return Items (cfr. parse). However, the user could also opt for a more interactive approach where he would intercept Responses as well, effectively bypassing the regular parse method. This allows the user to approach the spider more interactively. - The user can choose to control the pipeline items will go through. Pipelines are most often used for cleaning and saving to different formats. Since the user is remote, saving might not make just as much sense as when the user expects results to appear locally. Cleaning items however is quite different. - The API supports authentication. This I should look into more properly but I would like at least support for API keys. Generally these are strings that a user supplies to gain access to the API. These keys could have some rules tied to them, like rated admission or max amount of uses, expiry dates,... - More vague brainstorm stuff, more akin to the currently existing WebService: The user can influence CrawlerSpider Rules, i.e. get, add and delete them. The API is useful for those who want to remotely access their spiders, be this for testing, practical or demo purposes. A cool addition would therefore be to add a small, clean and self-explanatory user web interface. This should then allow viewing "raw" requests and responses as they are passed and gotten from the API, but also clean representations of these messages. This could be really basic by just supplying a visual tree-like representation of such objects, or really advanced like allowing a user to define widgets for how to represent each field. Again, this is just a brainstorm and depends completely on where emphasis of the project lies. I'll close this monologue by taking into account the already existing projects around Scrapy that supply similar functionality. These are Scrapyd and WebService, at least one of which most users have already glanced at. Scrapyd allows starting any spider, and that's basically its greatest trump. WebService on the other hand is automatically enabled for one spider and allows monitoring and controlling that spider, though especially the former. This project is somewhere in between: it should preferably enable access to multiple spiders (inside one project) at the same time, while laying emphasis on taking interactive control. On 18 February 2014 22:13, Shane Evans <[email protected]> wrote: > > Thanks for the great answer, Scrapinghub looks really promising by the >> way. Generating Parsley sounds interesting, but I feel you've basically got >> that covered with slybot and an UI on top of that. >> > Sure. I think there is a lot of interesting work here, but it's not well > defined yet. There are many cases where slybot will not do exactly what you > want, so I like the idea of then generating python and continuing coding > from there. It's also better than browser addons for working with xpaths > (due to the fact it uses scrapy). > > >> >> I'm currently back to looking in the direction of an HTTP API, yet I feel >> the project as we discussed it before is a bit immature on its own. If >> anyone has had any uses for an HTTP API for their Scrapy spiders before >> that required some more intricate functionality, please get back to me so >> we could discuss how such an HTTP API could be extended beyond >> communicating with a simple spider. In the meanwhile, I'll be looking on on >> myself. >> > > I agree, as it stands it's a bit light. I welcome some suggestions, I'll > think about it some more too. > > One addition I thought about was instead of a single spider, wrap a > project and dispatch to any spider. Either based on spider name passed, or > have some domain -> spider mapping. This has come up before and would be > useful. > > > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "scrapy-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/scrapy-users/dJRFIA46MT4/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
