> > > Over the past few days, I have been going over the Spider , requests and > reponse classes as you had suggested . I also tried to again understand > hadoop streaming. I have decided to do some changes in my idea of the > implementation . Instead of files, I will be passing data using pipes by > forking a process as was originally suggested in the Ideas page . >
sounds good! > > I have implemented some basic functionality, wherein we can create a > spider and set the domain and start_url properties . For now the "other" > language is python, but this other language can easily be anything else , > as i am only doing writes and reads to stdout and stdin. > I think it's a good idea to get it working in python first and then it should be easy to do the spiders in different languages. > > The structure of the program is this : > > - On the terminal you will type this > - >> scrapy crawl streaming -a > Input=/home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/INPUT.py -a > Output=/home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/OUPUT.py > - Here /home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/INPUT.py is a > python(can be any language) file, which sets the domain and start_url > properties. > - Here /home/faisal/Dropbox/PROGRAMS/SCRAPY/sandbox/OUPUT.py is a > python file , which will read a json file which is a response as created by > the parse method of my spider > - streaming.py is the spider > > What do you think about this ? > That's starting to resemble the idea now :) I would just use a single python file for the spider, I don''t think there's any need to separate between input and output. In particular, I expect there may be more requests that are dependent on previous responses so interacting with a single process is easier. >> scrapy crawl streaming -a cmd=/home/faisal/myspider.py .. I see your parse method in streaming has some items - this shouldn't happen. The parse method should be writing serialized responses to the process (and perhaps reading data..). Getting the interaction between Scrapy/Twisted and the external process to be smooth and robust is going to take some work, but I'll help with the design once the project starts. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
