Re: GSoC: HTTP API & Visual Scrapy Browser Plugin

Ruben Vereecken Fri, 21 Feb 2014 07:24:12 -0800

Ever since last we exchanged ideas, I thought about it quite a lot but
sadly found it hard to find some extra real-world uses for the project.
In here I'll try to collect the results of my monologuous brainstorm
sessions and invite anyone and everyone for open discussion on the subject.


I post this to the scrapy-users list because any user, developer or no, is
invited to read this and share their thoughts on my ramblings.

The basic idea is to control spiders through an HTTP API (I'll use REST API
from now on, correct me on this if you like). Slightly similar to the
currently present WebService.

   - As Shane said before, it would be even more helpful if the same API
   would allow access to all spiders encapsulated by one and the same project.
   So, recapped, spiders are mapped by their name on a parameter or on domain.
   I like the former better as it comes more natural to approach the spider in
   the same place where you tell it what to do. Both are possible.
   - One would be able to dispatch jobs to spiders, both by sending start
   URLs (cfr. start_urls) or sending Request objects (cfr. start_requests).
   - The user would have full control over the results of these scrapings.
   The standard case would be for the spider to return Items (cfr. parse).
   However, the user could also opt for a more interactive approach where he
   would intercept Responses as well, effectively bypassing the regular parse
   method. This allows the user to approach the spider more interactively.
   - The user can choose to control the pipeline items will go through.
   Pipelines are most often used for cleaning and saving to different formats.
   Since the user is remote, saving might not make just as much sense as when
   the user expects results to appear locally. Cleaning items however is quite
   different.
   - The API supports authentication. This I should look into more properly
   but I would like at least support for API keys. Generally these are strings
   that a user supplies to gain access to the API. These keys could have some
   rules tied to them, like rated admission or max amount of uses, expiry
   dates,...
   - More vague brainstorm stuff, more akin to the currently existing
   WebService: The user can influence CrawlerSpider Rules, i.e. get, add and
   delete them.

The API is useful for those who want to remotely access their spiders, be
this for testing, practical or demo purposes. A cool addition would
therefore be to add a small, clean and self-explanatory user web interface.
This should then allow viewing "raw" requests and responses as they are
passed and gotten from the API, but also clean representations of these
messages. This could be really basic by just supplying a visual tree-like
representation of such objects, or really advanced like allowing a user to
define widgets for how to represent each field. Again, this is just a
brainstorm and depends completely on where emphasis of the project lies.

I'll close this monologue by taking into account the already existing
projects around Scrapy that supply similar functionality. These are Scrapyd
and WebService, at least one of which most users have already glanced at.
Scrapyd allows starting any spider, and that's basically its greatest
trump. WebService on the other hand is automatically enabled for one spider
and allows monitoring and controlling that spider, though especially the
former. This project is somewhere in between: it should preferably enable
access to multiple spiders (inside one project) at the same time, while
laying emphasis on taking interactive control.


On 18 February 2014 22:13, Shane Evans <[email protected]> wrote:

>
> Thanks for the great answer, Scrapinghub looks really promising by the
>> way. Generating Parsley sounds interesting, but I feel you've basically got
>> that covered with slybot and an UI on top of that.
>>
> Sure. I think there is a lot of interesting work here, but it's not well
> defined yet. There are many cases where slybot will not do exactly what you
> want, so I like the idea of then generating python and continuing coding
> from there. It's also better than browser addons for working with xpaths
> (due to the fact it uses scrapy).
>
>
>>
>> I'm currently back to looking in the direction of an HTTP API, yet I feel
>> the project as we discussed it before is a bit immature on its own. If
>> anyone has had any uses for an HTTP API for their Scrapy spiders before
>> that required some more intricate functionality, please get back to me so
>> we could discuss how such an HTTP API could be extended beyond
>> communicating with a simple spider. In the meanwhile, I'll be looking on on
>> myself.
>>
>
> I agree, as it stands it's a bit light. I welcome some suggestions, I'll
> think about it some more too.
>
> One addition I thought about was instead of a single spider, wrap a
> project and dispatch to any spider. Either based on spider name passed, or
> have some domain -> spider mapping. This has come up before and would be
> useful.
>
>
>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/dJRFIA46MT4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: GSoC: HTTP API & Visual Scrapy Browser Plugin

Reply via email to