hi, all,

i'm peng foo, a post-graduate student major in CS in BUAA,aka  BeiHang 
Univ. BHU, China, which is one of the top univ. of CS in China . (i know it 
may be unfamiliar to you , but it's been shown in the "CSI:CYBER S2E11" :-) 
if you've got interest , i'd love to tell you why my univ. changed its name 
xD   )

<https://pic3.zhimg.com/8559284960725caae8dc347cf4d7943a_b.png>


*I am Python skilled and a lot of experience on web scraping and web 
developing.*


I major in ML/NLP, and Python is the most popular language in the field, so 
i write python a lot, i also help my lab write some frontend work using 
javascript and HTML5 and jquery.

I am an intern at Sogou.Inc  during 11, 2015 - 4,2016, and my mainly job is 
using python do the web scraping job and data cleaning and data anylasis, 
during the time, i write a lot of spiders xD
i've crawled big sites like yahoo/sina finance, tencent news and yahoo 
news, sina weibo( chinese edition of twitter) and so on.

I've  also been interned at Lenovo and i do researching jobs on machine 
learning with Python mainly about gesture identifying at 2014. i meet with 
Ipython Notebook then, by the time it was ipython notebook instead of 
jupyter :) 
and now ipython is the first choice of writing python remotely instead of 
writing python locally and use scp command to transfer the the remote 
server and run it:)

 
i was amazed at what a python IDE in browser can do , and i was shocked 
when i use ipython notebook and metaplotlib to draw a line gragh in the IDE 
instantly! 

and i really think combine scrapy and ipython notebook together is a good 
thing to do.

followings are some thing i think i may meet up with and problems i shall 
solve.

1, there are indeed some problems for showing HTML in HTML, such as class 
conflicting and some layout showing issues because the jupyter console is 
much thinner than the browser.

2, cross site/domain issues such as the code of ajax request and also a 
jump-out-js-code.

3, security issues such as bad codes or some alerts.

4, the performance issue: what shall we do if user write too many show-HTML 
code and the console may be slow. you know it is likely to happen because 
people like to use jupyter to do some inline debug work because jupyter can 
show the results immediately:-)


and followings are the ideas i've always wanted to make it happen and i 
think it may fit the scrapy + ipython idea:

0, formatted dom tree shown in jupyter console. in web scraping , dom is as 
important as HTML, and we could both show HTML page and dom tree in the 
console maybe. 

1, visualized elements selection. it's a feature of most browsers' 
developer's toolkit (F12 in chrome and firebug in firefox), it is such a 
good feature that every time i write a spider, i shall use it to locate the 
elements, i think it would be good if we could make it happen that when i 
select a node in the jupyter console ( based on idea 0 implemented ),  i 
could see the elements in the html page highlighted:)

2. xpah/cssselector generator. i've been wanted to implement this idea for 
a long time xD. when i was to scrap something, i always wanted the xpath 
and the cssselector shall jump out themselves when i  select something in 
the html page i was to scrap, a paragrah or a table or a <a> tag  or a <ul> 
list, and once we can show the html page in the jupyter console, it may 
come true!  the xpath or cssselector could be generated automatically when 
elements in the page are selected and i'm sure it would help a lot!




when i first search for the python projects of gsoc 2016, i did not find 
one that i'd like to participate in. but some time earlier today when i saw 
scrapy in the python projects of gsoc 2016, i was so excited and i thought 
in my mind "this is it!". i just want you to know that i am long for a 
chance work with you and contributing codes to the scrapy project!

if i am not good for the "ipython IDE for scrapy" idea , any other idea is 
okay for me to do :-)


best regards,

peng foo,








 


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to