Re: GSoC: HTTP API & Visual Scrapy Browser Plugin

Shane Evans Sun, 23 Feb 2014 10:12:06 -0800

I think you've added a lot of useful details to the project and the new
ideas look good to me. I can think of some projects in the past that I
could have used them, and that's always a good sign :)



On 21 February 2014 15:23, Ruben Vereecken <[email protected]> wrote:

> Ever since last we exchanged ideas, I thought about it quite a lot but
> sadly found it hard to find some extra real-world uses for the project.
> In here I'll try to collect the results of my monologuous brainstorm
> sessions and invite anyone and everyone for open discussion on the subject.
>
> I post this to the scrapy-users list because any user, developer or no, is
> invited to read this and share their thoughts on my ramblings.
>
> The basic idea is to control spiders through an HTTP API (I'll use REST
> API from now on, correct me on this if you like). Slightly similar to the
> currently present WebService.
>
>    - As Shane said before, it would be even more helpful if the same API
>    would allow access to all spiders encapsulated by one and the same project.
>    So, recapped, spiders are mapped by their name on a parameter or on domain.
>    I like the former better as it comes more natural to approach the spider in
>    the same place where you tell it what to do. Both are possible.
>    - One would be able to dispatch jobs to spiders, both by sending start
>    URLs (cfr. start_urls) or sending Request objects (cfr. start_requests).
>    - The user would have full control over the results of these
>    scrapings. The standard case would be for the spider to return Items (cfr.
>    parse). However, the user could also opt for a more interactive approach
>    where he would intercept Responses as well, effectively bypassing the
>    regular parse method. This allows the user to approach the spider more
>    interactively.
>    - The user can choose to control the pipeline items will go through.
>    Pipelines are most often used for cleaning and saving to different formats.
>    Since the user is remote, saving might not make just as much sense as when
>    the user expects results to appear locally. Cleaning items however is quite
>    different.
>    - The API supports authentication. This I should look into more
>    properly but I would like at least support for API keys. Generally these
>    are strings that a user supplies to gain access to the API. These keys
>    could have some rules tied to them, like rated admission or max amount of
>    uses, expiry dates,...
>    - More vague brainstorm stuff, more akin to the currently existing
>    WebService: The user can influence CrawlerSpider Rules, i.e. get, add and
>    delete them.
>
> The API is useful for those who want to remotely access their spiders, be
> this for testing, practical or demo purposes. A cool addition would
> therefore be to add a small, clean and self-explanatory user web interface.
> This should then allow viewing "raw" requests and responses as they are
> passed and gotten from the API, but also clean representations of these
> messages. This could be really basic by just supplying a visual tree-like
> representation of such objects, or really advanced like allowing a user to
> define widgets for how to represent each field. Again, this is just a
> brainstorm and depends completely on where emphasis of the project lies.
>
> I'll close this monologue by taking into account the already existing
> projects around Scrapy that supply similar functionality. These are Scrapyd
> and WebService, at least one of which most users have already glanced at.
> Scrapyd allows starting any spider, and that's basically its greatest
> trump. WebService on the other hand is automatically enabled for one spider
> and allows monitoring and controlling that spider, though especially the
> former. This project is somewhere in between: it should preferably enable
> access to multiple spiders (inside one project) at the same time, while
> laying emphasis on taking interactive control.
>
>
> On 18 February 2014 22:13, Shane Evans <[email protected]> wrote:
>
>>
>> Thanks for the great answer, Scrapinghub looks really promising by the
>>> way. Generating Parsley sounds interesting, but I feel you've basically got
>>> that covered with slybot and an UI on top of that.
>>>
>> Sure. I think there is a lot of interesting work here, but it's not well
>> defined yet. There are many cases where slybot will not do exactly what you
>> want, so I like the idea of then generating python and continuing coding
>> from there. It's also better than browser addons for working with xpaths
>> (due to the fact it uses scrapy).
>>
>>
>>>
>>> I'm currently back to looking in the direction of an HTTP API, yet I
>>> feel the project as we discussed it before is a bit immature on its own. If
>>> anyone has had any uses for an HTTP API for their Scrapy spiders before
>>> that required some more intricate functionality, please get back to me so
>>> we could discuss how such an HTTP API could be extended beyond
>>> communicating with a simple spider. In the meanwhile, I'll be looking on on
>>> myself.
>>>
>>
>> I agree, as it stands it's a bit light. I welcome some suggestions, I'll
>> think about it some more too.
>>
>> One addition I thought about was instead of a single spider, wrap a
>> project and dispatch to any spider. Either based on spider name passed, or
>> have some domain -> spider mapping. This has come up before and would be
>> useful.
>>
>>
>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "scrapy-users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/scrapy-users/dJRFIA46MT4/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>>
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: GSoC: HTTP API & Visual Scrapy Browser Plugin

Reply via email to