IPython Based IDE for Scrapy

Abhishek Shrivastava Fri, 11 Mar 2016 15:27:06 -0800

Hi all,

I would like to work on the scrapy project idea "IPython based IDE for 
Scrapy". Here are the logs of my discussion with assigned mentor, Mikhail 
Korobov. Please give me reviews regarding this. Thank you.
__


Hi Mike,

Thank you so much for replying back!
Sorry for the delayed response. I have done some research over the net on 
how we should go about this project. I have also checkout out all the 
references that you gave.

1. IPython Widgets are awesome. We definitely need these in order to 
convert a notebook into a functional IDE for Scrapy. I think we can use 
selector gadget to our advantage. It is open source and we can convert and 
import it as an Ipython widget. We can use this widget to let the developer 
explore the site and generate the required xpath/css selector. We can also 
use "link" property of Ipython to link between the selector gadget (which 
is acting here like a frontend ?) and a cell in notebook. Clicking on 
"Confirm" can execute the cell and it can start scraping the related xpath. 

We can either incorporate selector gadget in the output, or as a popout, 
which renders the html of site, and helps the developer get the right xpath.

2. Defaults - We also need to have some default cells. In these cells we 
can initialize scrapy and our ipython widgets. If we are using any other js 
then we can use Ipython to load js as well.

3. HTML and JS - It is possible to run html and js inside of ipython cell. 
I found out about this from - [1] , the section about %%html and 
%%javascript. Magic functions might come in handy in this project. The one 
about %%doctest too.

3. Auto completion and Snippets-  I found out that Ipython does provide tab 
completion (using tab) and showing of docstrings (using shift-tab) , so 
with a little bit of hacking around I feel we can craft the autocompletion 
feature to our suited requirements. Though it has one dependency of 
pyreadline. [2] , [3]

4. I also went through inline-requests, and it is possible that inside of 
Ipython cells we execute scrapy's inline requests for crawling ? As you 
pointed out in the comments, maybe this can help in linearizing the flow.

5. Scope - I think this is clear that it will be possible to crawl single 
page and extract data from it, interactively. But we also need to think 
about crawlers, where we are going to crawl multiple sites and this will 
happen in a async way ? . So we need a way to handle this too. Scrapy uses 
async methods to download pages and then calls the callbacks. We can use 
the "await" functionality to kind of making things synchronized in the 
notebook ?

6. Live showing of response- It is also possible to bind cell and output 
such that, all the data that is scraped can be parsed and shown in a 
meaningful way. All while the scraping is going on. Can be feature ?

7. Distribution  - How are we going to distribute this IDE ? I am guessing 
as a IPython notebook ? But it would be nice if we supplemented this with a 
bashscript which can check and install all the required dependencies. Or a 
more advanced version can be, that the script installs scrapy, as well, .. 
kind of like a complete package for webpage scraping.

8. Practise- Currently I am learning more about scrapy, its source code and 
ipython wdgets. 

9. Github issues and pull requests- I have forked the scrapy repo and would 
like to work on 
https://github.com/scrapy/scrapy/issues/1309 and 
https://github.com/scrapy/scrapy/issues/740

I put my introduction mail on this mailing list- 
https://groups.google.com/forum/#!forum/scrapy-users Apparently my topic 
hasn't been approved. It would be great if I could post in the mailing 
list. 

Regards,
Abhishek Shrivastava

[1] IPython Magic Functions - 
http://ipython.readthedocs.org/en/stable/interactive/magics.html 
[2] Stack overflow question about autocompletion-  
http://stackoverflow.com/questions/2603798/ipython-tab-completion-not-working
[3] Question about autocompletion and docstrings- 
http://stackoverflow.com/questions/23687011/ipython-notebook-tab-complete-show-docstring

____________________________________
Hey Abhishek,


2016-03-11 2:16 GMT+05:00 Freakavoid <[email protected]>:

> Hi Mike,
>
> Myself Abhishek Shrivastava, I'm a third year CSE major from UIT-Bhopal, 
> India. 
> I am looking to work under ScraPy for GSOC '2016 on the project "IPython 
> IDE for ScraPy". I am taking the initiative to directly mail you as you are 
> an assigned mentor for the project.  
>

Glad to hear you've starting early! 
 

>
> I have set  up scrapy for development on my machine. (Ubuntu 14.04) I have 
> also completed the basic tutorials posted on the documentation. I am trying 
> to run the scrapy-ipython demo notebook. But I am facing this error here-  
> http://pastie.org/10755388
>
Also, I would like to discuss more about the details about this project -
> ( I also mailed in the mailing list the same issues)
>

Hm, which mailing list do you use?
 

>
> 1. When we say IDE, do we want to develop a Jupyter Based IDE backed by 
> QT/GTK ? The ideas page says "display HTML page inline in the console" I 
> don't understand this part. Do we mean that we will display HTML page 
> inline Jupyter console ? 
>

I think we should use web-based Jupyter Notebook, not QT/GTK frontends. 
Have you tried it? It is very cool :)
 

> 2. Do we want to use the default Jupyter Kernel, or something else ? Like 
> Splash-Jupyter Kernels ? 
>

I don't know! Something can be done with a standard Jupyter Kernel, some 
features may require a customized Jupyter Kernel.
 

> 3. Will we be supporting inbuilt templates ? My understanding is that we 
> will support inbuilt autocompletion, snippets at least ? 
>

I haven't thought about this feature; I've never seen snippets working in 
IPython notebook. It should be possible to implement them, but I'm not 
aware of existing implementation. It can be a nice feature.
 

> 4. It would be great if we could somehow make the documentation, function 
> usage built-in. 
>

Yeah, agreed. This should came automatically because IPython allows to show 
docstrings, but currently many Scrapy functions and classes don't have 
docstrings. See https://github.com/scrapy/scrapy/issues/713 - we're moving 
parts of Scrapy documentation back to the code step by step; help is 
appreciated. 

I think getting familiar with Scrapy and submitting a few pull requests is 
a good start. Checking how are IPython widgets created can be helpful (see 
https://github.com/ipython/ipywidgets). It could also make sense to do a 
couple scraping projects, to refresh yourself on what it takes to develop a 
Scrapy spider and what can be simplified by an interactive environment like 
IPython. 

There are several ways IPython can help Scrapy develpers: for example, it 
can allow to develop css/xpath selectors more visually, it may allow to 
develop spider callbacks one by one interactively, it may allow to monitor 
scraping progress. Likely there are other applications. None of them are 
straightforward to implement because Scrapy runs in a Twisted reactor :) 
Scrapy code is async, while in most cases IPython cells are sync; this is 
also a challenge. You may have to figure out how to plug Scrapy into 
IPython's even loop.

Some links for the inspiration:

* http://selectorgadget.com/ chrome extension - something like this as an 
IPython widget may help;
* https://gist.github.com/takluyver/b9663b08ac9a4472afa6 - an example of 
asyncio cells in IPython (got the link from
https://github.com/ipython/ipykernel/issues/109); 
* https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616 may 
give some ideas on how code in IPython cells may look like.
* http://www.youtube.com/watch?v=JPyvmW-eOLs - a monitoring interface for 
Scrapy spiders

This project is hard and requires a lot of creativity; there is no a single 
best way to do it. There is only a vague idea that IPython can be helpful 
to develop Scrapy spiders, and an initial prototype (which is very outdated 
- as you found it doesn't even work as-is). 

There are many questions to answer, e.g. what should be in an IPython cell? 
A spider? A spider callback? Something else? Will it help with crawling 
(i.e. fetching new pages, following links) or only with scraping (i.e. 
extracting information from responses)? Should it be integrated with Scrapy 
projects? How to pass settings to spiders developed in IPython? Can IDE 
help with development of Scrapy middlewares/extensions/pipelines, or is it 
only for spiders?

It is absolutely not necessary to answer all these questions, but that'd be 
good to figure out what do you want from an IDE, what can be done in GSoC 
time, and to get a very rough plan on how to implement it.

Hope I not overloaded you with information :) Looking forward to hear from 
you.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

IPython Based IDE for Scrapy

Reply via email to