Re: GSOC : Support for spiders in other languages

Shane Evans Sun, 23 Feb 2014 10:34:21 -0800

>
>
> Over the past few days, I have been going over the Spider , requests and
> reponse classes as you had suggested . I also tried to again understand
> hadoop streaming. I have decided to do some changes in my idea of the
> implementation . Instead of files, I will be passing data using pipes by
> forking a process as was originally suggested in the Ideas page .
>


sounds good!



>
> I have implemented some basic functionality, wherein we can create a
> spider and set the domain and start_url properties . For now the "other"
> language is python, but this other language can easily be anything else ,
> as i am only doing writes and reads to stdout and stdin.
>

I think it's a good idea to get it working in python first and then it
should be easy to do the spiders in different languages.



>
> The structure of the program is this :
>
>    - On the terminal you will type this
>       - >> scrapy crawl streaming -a
>       Input=/home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/INPUT.py -a
>       Output=/home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/OUPUT.py
>    - Here /home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/INPUT.py is a
>    python(can be any language) file, which sets the  domain and start_url
>    properties.
>    - Here  /home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/OUPUT.py  is a
>    python file , which will read a json file which is a response as created by
>    the parse method of my spider
>    - streaming.py is the spider
>
> What do you think about this ?
>

That's starting to resemble the idea now :)

I would just use a single python file for the spider, I don''t think
there's any need to separate between input and output. In particular, I
expect there may be more requests that are dependent on previous responses
so interacting with a single process is easier.

>> scrapy crawl streaming -a cmd=/home/faisal/myspider.py
..

I see your parse method in streaming has some items - this shouldn't
happen. The parse method should be writing serialized responses to the
process (and perhaps reading data..). Getting the interaction between
Scrapy/Twisted and the external process to be smooth and robust is going to
take some work, but I'll help with the design once the project starts.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: GSOC : Support for spiders in other languages

Reply via email to