Re: GSOC : Support for spiders in other languages

Shane Evans Tue, 18 Feb 2014 04:32:25 -0800

You're heading in the right direction.

So, the files used are stdin, stdout and stderr, much like hadoop streaming.


I was thinking a user do something like 'scrapy crawl scrapystreaming -a
cmd=myprogram'. We would write a scrapystreaming spider that would execute
myprogram and manage communication with it. The program would write
requests to stdout and read responses from stdin (let's ignore logging,
signals, stats, etc. for now).

myprogram would do something like:

from json import loads, dumps
# equivalent of scrapy start_requests() - the initial requests
start_request = dict(method='GET', url='http://...')
print dumps(start_request)

for line in sys.stdin:

response = loads(line)

print dumps(parse_item(data))

print dumps(parse_links_to_follow(data))


but of course, it could be written in any language.

I think a good starting place is to look at the Spider class (in github),
requests and responses and understand how spiders interact with scrapy.
Getting a simple example like the above working shouldn't be too much work.

We'll also need to think about testing, so it's worth looking at how scrapy
tests work, We'll need to do something similar, running test spiders,
accessing fake websites. We'll also need to simulate errors like our spider
crashing.



On 18 February 2014 07:09, faisal anees <[email protected]> wrote:

> Hi Shane,
>
> Thanks for your response. I looked at how Hadoop Streaming works, and I
> think I now have a vague idea of how this project will work out.
>
> Suppose we do this for a language X : The user would call some functions,
> which would write to a file. This output file will be given to scrapy as an
> input. We would then parse the file and give them as inputs to particular
> scrapy methods. All of the results would then be written to a file as JSON
> responses. This output file will be given as input to the language X, which
> would then store the results in some data structure.
>
> Am I right ? Right now I am going through the scrapy codebase, after which
> I would provide some examples of what I am trying to do.
>
> Thanks
>
> On Tuesday, February 18, 2014 2:20:14 AM UTC+5:30, shane wrote:
>>
>> Hi Mohammed,
>>
>> It's nice to hear you''ve found Scrapy useful and are interested in GSoC.
>>  Answers to your questions below.
>>
>>
>>
>>> I was interested in this idea on the ideas page "Support for spiders in
>>> other languages". TI had some questions regarding this:
>>>
>>> 1) Do we have to make wrappers or should the code be written in the
>>> other language from scratch ?
>>>
>> The other language part can be written from scratch
>>
>>
>>
>>>
>>> 2) Quoting from the ideas page "The goal of this project is to allow
>>> developers to write spiders simply and easily in any programming language,
>>> while permitting Scrapy to manage concurrency, scheduling, item exporting,
>>> caching, etc."  Does this mean this project will enable any programming
>>> language to use Scrapy ... or will we be adding support for languages
>>> separately one by one?
>>>
>>
>> It will enable any language to be used from Scrapy. Users will simply
>> write a program that can read serialized Scrapy responses (probably as
>> JSON) and write serialized Requests and Items.
>>
>> By adding support for a given language in the form of a library we can
>> make it more pleasant to implement spiders in that language. I used the
>> example of hadoop streaming, which can be used by any language, however if
>> you use a python library like mrjob, hadoopy, dumbo, etc. it's a nicer
>> experience. I added this as a stretch goal - it's optional. I expect we can
>> add something for python to make scrapy spiders run most of the time just
>> by changing an import and possibly add another language or 2 depending on
>> time.
>>
>>
>>
>>>
>>> 3) Which language will be better ? This question will depend on what the
>>> target audience is .. Developers or Scientists ? We can expect developers
>>> to be familiar with Javascript/Ruby/Java/Python/etc , Whereas
>>> Scientists would know C/C++/Python/Java. This is just my view, I might be
>>> wrong too !!
>>>
>>
>> I'm not sure :) I expect C & C++ are probably not that convenient or
>> common for spider code, Java, JS & Ruby would probably be used, and python
>> could be useful for existing scrapy users (e.g. running spiders that crash)
>>
>> Maybe someone reading this wants to make a case for a specific language?
>>
>> Cheers,
>>
>> Shane
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: GSOC : Support for spiders in other languages

Reply via email to