Hi all, I would like to work on the scrapy project idea "IPython based IDE for Scrapy". Here are the logs of my discussion with assigned mentor, Mikhail Korobov. Please give me reviews regarding this. Thank you. __
Hi Mike, Thank you so much for replying back! Sorry for the delayed response. I have done some research over the net on how we should go about this project. I have also checkout out all the references that you gave. 1. IPython Widgets are awesome. We definitely need these in order to convert a notebook into a functional IDE for Scrapy. I think we can use selector gadget to our advantage. It is open source and we can convert and import it as an Ipython widget. We can use this widget to let the developer explore the site and generate the required xpath/css selector. We can also use "link" property of Ipython to link between the selector gadget (which is acting here like a frontend ?) and a cell in notebook. Clicking on "Confirm" can execute the cell and it can start scraping the related xpath. We can either incorporate selector gadget in the output, or as a popout, which renders the html of site, and helps the developer get the right xpath. 2. Defaults - We also need to have some default cells. In these cells we can initialize scrapy and our ipython widgets. If we are using any other js then we can use Ipython to load js as well. 3. HTML and JS - It is possible to run html and js inside of ipython cell. I found out about this from - [1] , the section about %%html and %%javascript. Magic functions might come in handy in this project. The one about %%doctest too. 3. Auto completion and Snippets- I found out that Ipython does provide tab completion (using tab) and showing of docstrings (using shift-tab) , so with a little bit of hacking around I feel we can craft the autocompletion feature to our suited requirements. Though it has one dependency of pyreadline. [2] , [3] 4. I also went through inline-requests, and it is possible that inside of Ipython cells we execute scrapy's inline requests for crawling ? As you pointed out in the comments, maybe this can help in linearizing the flow. 5. Scope - I think this is clear that it will be possible to crawl single page and extract data from it, interactively. But we also need to think about crawlers, where we are going to crawl multiple sites and this will happen in a async way ? . So we need a way to handle this too. Scrapy uses async methods to download pages and then calls the callbacks. We can use the "await" functionality to kind of making things synchronized in the notebook ? 6. Live showing of response- It is also possible to bind cell and output such that, all the data that is scraped can be parsed and shown in a meaningful way. All while the scraping is going on. Can be feature ? 7. Distribution - How are we going to distribute this IDE ? I am guessing as a IPython notebook ? But it would be nice if we supplemented this with a bashscript which can check and install all the required dependencies. Or a more advanced version can be, that the script installs scrapy, as well, .. kind of like a complete package for webpage scraping. 8. Practise- Currently I am learning more about scrapy, its source code and ipython wdgets. 9. Github issues and pull requests- I have forked the scrapy repo and would like to work on https://github.com/scrapy/scrapy/issues/1309 and https://github.com/scrapy/scrapy/issues/740 I put my introduction mail on this mailing list- https://groups.google.com/forum/#!forum/scrapy-users Apparently my topic hasn't been approved. It would be great if I could post in the mailing list. Regards, Abhishek Shrivastava [1] IPython Magic Functions - http://ipython.readthedocs.org/en/stable/interactive/magics.html [2] Stack overflow question about autocompletion- http://stackoverflow.com/questions/2603798/ipython-tab-completion-not-working [3] Question about autocompletion and docstrings- http://stackoverflow.com/questions/23687011/ipython-notebook-tab-complete-show-docstring ____________________________________ Hey Abhishek, 2016-03-11 2:16 GMT+05:00 Freakavoid <[email protected]>: > Hi Mike, > > Myself Abhishek Shrivastava, I'm a third year CSE major from UIT-Bhopal, > India. > I am looking to work under ScraPy for GSOC '2016 on the project "IPython > IDE for ScraPy". I am taking the initiative to directly mail you as you are > an assigned mentor for the project. > Glad to hear you've starting early! > > I have set up scrapy for development on my machine. (Ubuntu 14.04) I have > also completed the basic tutorials posted on the documentation. I am trying > to run the scrapy-ipython demo notebook. But I am facing this error here- > http://pastie.org/10755388 > Also, I would like to discuss more about the details about this project - > ( I also mailed in the mailing list the same issues) > Hm, which mailing list do you use? > > 1. When we say IDE, do we want to develop a Jupyter Based IDE backed by > QT/GTK ? The ideas page says "display HTML page inline in the console" I > don't understand this part. Do we mean that we will display HTML page > inline Jupyter console ? > I think we should use web-based Jupyter Notebook, not QT/GTK frontends. Have you tried it? It is very cool :) > 2. Do we want to use the default Jupyter Kernel, or something else ? Like > Splash-Jupyter Kernels ? > I don't know! Something can be done with a standard Jupyter Kernel, some features may require a customized Jupyter Kernel. > 3. Will we be supporting inbuilt templates ? My understanding is that we > will support inbuilt autocompletion, snippets at least ? > I haven't thought about this feature; I've never seen snippets working in IPython notebook. It should be possible to implement them, but I'm not aware of existing implementation. It can be a nice feature. > 4. It would be great if we could somehow make the documentation, function > usage built-in. > Yeah, agreed. This should came automatically because IPython allows to show docstrings, but currently many Scrapy functions and classes don't have docstrings. See https://github.com/scrapy/scrapy/issues/713 - we're moving parts of Scrapy documentation back to the code step by step; help is appreciated. I think getting familiar with Scrapy and submitting a few pull requests is a good start. Checking how are IPython widgets created can be helpful (see https://github.com/ipython/ipywidgets). It could also make sense to do a couple scraping projects, to refresh yourself on what it takes to develop a Scrapy spider and what can be simplified by an interactive environment like IPython. There are several ways IPython can help Scrapy develpers: for example, it can allow to develop css/xpath selectors more visually, it may allow to develop spider callbacks one by one interactively, it may allow to monitor scraping progress. Likely there are other applications. None of them are straightforward to implement because Scrapy runs in a Twisted reactor :) Scrapy code is async, while in most cases IPython cells are sync; this is also a challenge. You may have to figure out how to plug Scrapy into IPython's even loop. Some links for the inspiration: * http://selectorgadget.com/ chrome extension - something like this as an IPython widget may help; * https://gist.github.com/takluyver/b9663b08ac9a4472afa6 - an example of asyncio cells in IPython (got the link from https://github.com/ipython/ipykernel/issues/109); * https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616 may give some ideas on how code in IPython cells may look like. * http://www.youtube.com/watch?v=JPyvmW-eOLs - a monitoring interface for Scrapy spiders This project is hard and requires a lot of creativity; there is no a single best way to do it. There is only a vague idea that IPython can be helpful to develop Scrapy spiders, and an initial prototype (which is very outdated - as you found it doesn't even work as-is). There are many questions to answer, e.g. what should be in an IPython cell? A spider? A spider callback? Something else? Will it help with crawling (i.e. fetching new pages, following links) or only with scraping (i.e. extracting information from responses)? Should it be integrated with Scrapy projects? How to pass settings to spiders developed in IPython? Can IDE help with development of Scrapy middlewares/extensions/pipelines, or is it only for spiders? It is absolutely not necessary to answer all these questions, but that'd be good to figure out what do you want from an IDE, what can be done in GSoC time, and to get a very rough plan on how to implement it. Hope I not overloaded you with information :) Looking forward to hear from you. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
