Re: GSOC Proposal

Mikhail Korobov Tue, 04 Mar 2014 01:55:31 -0800

A fun fact: "Communication skills" were in "Required skills" section for 
Python 3 porting project at some point :) I found Twisted development quite 
active, and in my experience the feedback is usually fast. Twisted Python 3 
porting project is ongoing; we shouldn't have to contribute scrapy-specific 
patches - we should help with porting of modules that are most important 
for us, following Twisted Python 3 plan. There are variuos reasons Python 3 
support improvements may receive zero feedback from a project: insufficient 
tests, author may be not comfortable maintaining compatibility himself, 
etc., but most of them don't apply to Twisted.


I don't think Twistd will block us because

a) there are many Scrapy modules to port, and if some of them depend on 
unfinished twisted patch one can work on another module;
b) if some patch is on a "critical path" we can always use a patched 
twisted version until changes are done upstream (making sure the patch is 
reasonable and is not accepted by Twisted only because lack of time or some 
minor issues). You won't fail a GSoC because a good patch is not accepted.

That said, Twisted rules are unusually strict for open-source projects - if 
you want to contribute a Python 3 patch then make sure to strictly follow 
https://twistedmatrix.com/trac/wiki/Plan/Python3#Methodology. 

There are some issues with twisted that prevent us from getting started, 
namely twisted trial runner. I hope they are only testing issues as 
twisted.trial itself and twisted networking utilities looks mostly ported.

I think that porting Scrapy to Tornado (or whatever) is a big project on 
itself, and it is better not to aim to both port Scrapy to Python 3 and 
replace Twisted with Tornado during a single summer.

That's great that you've already ported some code to Python 3! This makes 
your application more solid - some links to your contributions will be 
great. Also, contributing to Twisted before project starts is a good way to 
get people to trust you, and also for you to understand the project better.


вторник, 4 марта 2014 г., 13:29:42 UTC+6 пользователь Edwin Marshall 
написал:
>
> I suggested replacing it because scrapy's own gsoc ideas page seems to 
> indicate that it will be trivial: "However, Scrapy only uses a (very 
> small) subset of Twisted. Students working on this should be prepared to 
> port (or drop) certain parts of Twisted that do not yet support Python 3."
>
> That said, if they'd be willing to collaborate and even accept patches to 
> allow the parts that scrapy needs to run on Python 3, then I absolutely 
> think that's a better strategy. My only hesitation comes from having ported 
> an entire networking library (whom I shall not name) for them to not have 
> accepted or even commented on my pull requests after several (4+) months, 
> despite having extensive conversations in IRC. In other words, I don't want 
> to fail at GSOC because some patch didn't get accepted. On the other hand, 
> I understand something as large as adding python 3 support taking a bit to 
> trickle through the proper channels, but zero feedback is rather 
> discouraging.
>
> As for actually doing the port, I would, if accepted, use multiple feature 
> branches so that implementations can be compared. Also, having a tornado 
> (or whatever) branch while waiting on feedback from the twisted developers 
> (for example) would allow me to continue development porting other bits 
> without being blocked by a dependency.
>
> Thanks for your quick reply Mikhail and I look forward to discussing this 
> further, assuming it's appropriate to do so.
>
> [1]
>
> On Monday, March 3, 2014 11:46:15 PM UTC-5, Mikhail Korobov wrote:
>>
>> Hi Edwin,
>>
>> That's great you're considering us for GSoC, nice to meet you! 
>> Familiarizing yourself with the API is a good start. 
>> I'm listed as a mentor for Python 3 porting project, so let me try answer 
>> your question.
>>
>> Scrapy itself is written in Python, but using extensions that depend on C 
>> libraries is OK. We use them already - lxml and pyOpenSSL are not 
>> pure-python, and twisted also has C modules. 
>>
>> Scraping bottleneck is rarely in the event loop (maybe even never). I 
>> don't think moving from twisted can provide big performance benefits, if 
>> any. Also, Twisted is much more than a basic async networking library. It 
>> seems that you already can make Twisted use libuv for reactor - check 
>> https://github.com/saghul/twisted-pyuv (haven't tried it myself).  
>>
>> You're right that replacing twisted with something else (tornado?) could 
>> make porting easier. But it is very ambitiuous project because a lot of 
>> Scrapy inner details depend on twisted. There is a semi-documented feature 
>> that twisted Deferred can be returned from spider and middleware methods, 
>> so twisted almost made its way into Scrapy public interface. This means 
>> replacing twisted with something else is a big change, and it will 
>> inevitably break some user code. It is doable, and it may have some 
>> benefits, but the barrier to entry is high. If you want us to consider this 
>> as a project, you need to provide a detailed plan on how you're going to 
>> implement it (creating such plan could easily take more than a day of 
>> full-time work), and a good description of the benefits of this project. 
>>
>> I'm 80% sure it is a bad idea to replace twisted - it looks hard, it will 
>> break code and there is not a lot of visible benefits. But I haven't 
>> checked how hard it is exactly, what will it break exactly (e.g. can 
>> http://www.tornadoweb.org/en/stable/twisted.html help?), and I may miss 
>> some benefits, so there are remaining 20% :)
>>
>> Collaborating with Twisted people looks like a more straightforward way 
>> to make Scrapy work in Python 3: Scrapy doesn't use all of Twisted, many 
>> parts of Twisted are already ported, and a subset of Twisted is already 
>> usable in Python 3.
>>
>> вторник, 4 марта 2014 г., 4:35:35 UTC+6 пользователь Edwin Marshall 
>> написал:
>>>
>>> *Quick Intro*
>>> My name is Edwin Marshall and currently only write code in my free time 
>>> GSOC seems like a good way to both improve an open source project and add 
>>> applicable experience to my resume so that I can start coding 
>>> professionally. I don't have any experience with scrapy, but have done some 
>>> web scraping in the past. As such, I indent on taking the next few days in 
>>> between work and school to familiarize myself with the API before 
>>> application submissions are open on the 10th. 
>>>
>>> *The Real Question*
>>> I saw that scrapy was interested in porting to python 3 (which I have 
>>> experience doing on multiple projects), but that might also require porting 
>>> of some twisted code. I was wondering if you guys (and gals) have 
>>> considered porting to something pyuv based, such as evergreen? It would be 
>>> interesting to see what performance improvements (if any) could be gained 
>>> by using lightweight threads. An alternative might be gruvi or raw pyuv. I 
>>> was originally thinking about gevent, but last I checked it neither worked 
>>> on python3 nor reliably worked on windows. If the landscape has changed, 
>>> that could also be an option.
>>>
>>> Sorry if any of this sounds presumptuous. I'm horrible at introductions 
>>> and in a bit of a rush (have to go to my unfulfilling day job shortly)
>>>
>>> [1] Evergreen: https://pypi.python.org/pypi/evergreen
>>> [2] Gruvi: https://pypi.python.org/pypi/gruvi/0.9.2
>>> [3] pyuv: https://pypi.python.org/pypi/pyuv
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: GSOC Proposal

Reply via email to