hi, all, i'm peng foo, a post-graduate student major in CS in BUAA,aka BeiHang Univ. BHU, China, which is one of the top univ. of CS in China . (i know it may be unfamiliar to you , but it's been shown in the "CSI:CYBER S2E11" :-) if you've got interest , i'd love to tell you why my univ. changed its name xD )
<https://pic3.zhimg.com/8559284960725caae8dc347cf4d7943a_b.png> *I am Python skilled and a lot of experience on web scraping and web developing.* I major in ML/NLP, and Python is the most popular language in the field, so i write python a lot, i also help my lab write some frontend work using javascript and HTML5 and jquery. I am an intern at Sogou.Inc during 11, 2015 - 4,2016, and my mainly job is using python do the web scraping job and data cleaning and data anylasis, during the time, i write a lot of spiders xD i've crawled big sites like yahoo/sina finance, tencent news and yahoo news, sina weibo( chinese edition of twitter) and so on. I've also been interned at Lenovo and i do researching jobs on machine learning with Python mainly about gesture identifying at 2014. i meet with Ipython Notebook then, by the time it was ipython notebook instead of jupyter :) and now ipython is the first choice of writing python remotely instead of writing python locally and use scp command to transfer the the remote server and run it:) i was amazed at what a python IDE in browser can do , and i was shocked when i use ipython notebook and metaplotlib to draw a line gragh in the IDE instantly! and i really think combine scrapy and ipython notebook together is a good thing to do. followings are some thing i think i may meet up with and problems i shall solve. 1, there are indeed some problems for showing HTML in HTML, such as class conflicting and some layout showing issues because the jupyter console is much thinner than the browser. 2, cross site/domain issues such as the code of ajax request and also a jump-out-js-code. 3, security issues such as bad codes or some alerts. 4, the performance issue: what shall we do if user write too many show-HTML code and the console may be slow. you know it is likely to happen because people like to use jupyter to do some inline debug work because jupyter can show the results immediately:-) and followings are the ideas i've always wanted to make it happen and i think it may fit the scrapy + ipython idea: 0, formatted dom tree shown in jupyter console. in web scraping , dom is as important as HTML, and we could both show HTML page and dom tree in the console maybe. 1, visualized elements selection. it's a feature of most browsers' developer's toolkit (F12 in chrome and firebug in firefox), it is such a good feature that every time i write a spider, i shall use it to locate the elements, i think it would be good if we could make it happen that when i select a node in the jupyter console ( based on idea 0 implemented ), i could see the elements in the html page highlighted:) 2. xpah/cssselector generator. i've been wanted to implement this idea for a long time xD. when i was to scrap something, i always wanted the xpath and the cssselector shall jump out themselves when i select something in the html page i was to scrap, a paragrah or a table or a <a> tag or a <ul> list, and once we can show the html page in the jupyter console, it may come true! the xpath or cssselector could be generated automatically when elements in the page are selected and i'm sure it would help a lot! when i first search for the python projects of gsoc 2016, i did not find one that i'd like to participate in. but some time earlier today when i saw scrapy in the python projects of gsoc 2016, i was so excited and i thought in my mind "this is it!". i just want you to know that i am long for a chance work with you and contributing codes to the scrapy project! if i am not good for the "ipython IDE for scrapy" idea , any other idea is okay for me to do :-) best regards, peng foo, -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
