Re: Cpython optimization
Qrees wrote: Hello As my Master's dissertation I chose Cpython optimization. That's why i'd like to ask what are your suggestions what can be optimized. Well, I know that quite a lot. I've downloaded the source code (I plan to work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at the code I've found comment's like this can be optimized by... etc. but maybe you guide me what should I concentrate on in my work? 2.6.3 is about to be replaced with 2.6.4, but even that will not have improvements that have already been added to 3.x or 2.7. I strongly suggest that you work with the 3.2 code, since there is then a maximal chance that your improvements will actually be adopted. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Olof Bjarnason wrote: [snip] A short question after having read through most of this thread, on the same subject (time-optimizing CPython): http://mail.python.org/pipermail/python-list/2007-September/098964.html We are experiencing multi-core processor kernels more and more these days. But they are all still connected to the main memory, right? To me that means, even though some algorithm can be split up into several threads that run on different cores of the processor, that any algorithm will be memory-speed limited. And memory access is a quite common operation for most algorithms. Then one could ask oneself: what is the point of multiple cores, if memory bandwidth is the bottleneck? Specifically, what makes one expect any speed gain from parallelizing a sequential algorithm into four threads, say, when the memory shuffling is the same speed in both scenarios? (Assuming memory access is much slower than ADDs, JMPs and such instructions - a quite safe assumption I presume) [ If every core had it's own primary memory, the situation would be different. It would be more like the situation in a distributed/internet based system, spread over several computers. One could view each core as a separate computer actually ] Don't forget about the on-chip cache! :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
2009/10/22 MRAB pyt...@mrabarnett.plus.com Olof Bjarnason wrote: [snip] A short question after having read through most of this thread, on the same subject (time-optimizing CPython): http://mail.python.org/pipermail/python-list/2007-September/098964.html We are experiencing multi-core processor kernels more and more these days. But they are all still connected to the main memory, right? To me that means, even though some algorithm can be split up into several threads that run on different cores of the processor, that any algorithm will be memory-speed limited. And memory access is a quite common operation for most algorithms. Then one could ask oneself: what is the point of multiple cores, if memory bandwidth is the bottleneck? Specifically, what makes one expect any speed gain from parallelizing a sequential algorithm into four threads, say, when the memory shuffling is the same speed in both scenarios? (Assuming memory access is much slower than ADDs, JMPs and such instructions - a quite safe assumption I presume) [ If every core had it's own primary memory, the situation would be different. It would be more like the situation in a distributed/internet based system, spread over several computers. One could view each core as a separate computer actually ] Don't forget about the on-chip cache! :-) Sorry for continuing slightly OT: Yes, that makes matters even more interesting. Caches for single-cpu-boards speed up memory access quite dramatically. Are caches for multi-core boards shared among the cores? Or do each core have a separate cache? I can only imagine how complicated the read/write logic must be of these tiny electronic devices, in any case. Of course caches makes the memory access-operations must faster, but I'm guessing register instructions are still orders of magnitude faster than (cached) memory access. (or else registers would not really be needed - you could just view the whole primary memory as an array of registers!) So I think my first question is still interesting: What is the point of multiple cores, if memory is the bottleneck? (it helps to think of algorithms such as line-drawing or ray-tracing, which is easy to parallellize, yet I believe are still faster using a single core instead of multiple because of the read/write-to-memory-bottleneck. It does help to bring more workers to the mine if only one is allowed access at a time, or more likely, several are allowed yet it gets so crowded that queues/waiting is inevitable) -- http://mail.python.org/mailman/listinfo/python-list -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
2009/10/23 Olof Bjarnason olof.bjarna...@gmail.com 2009/10/22 MRAB pyt...@mrabarnett.plus.com Olof Bjarnason wrote: [snip] A short question after having read through most of this thread, on the same subject (time-optimizing CPython): http://mail.python.org/pipermail/python-list/2007-September/098964.html We are experiencing multi-core processor kernels more and more these days. But they are all still connected to the main memory, right? To me that means, even though some algorithm can be split up into several threads that run on different cores of the processor, that any algorithm will be memory-speed limited. And memory access is a quite common operation for most algorithms. Then one could ask oneself: what is the point of multiple cores, if memory bandwidth is the bottleneck? Specifically, what makes one expect any speed gain from parallelizing a sequential algorithm into four threads, say, when the memory shuffling is the same speed in both scenarios? (Assuming memory access is much slower than ADDs, JMPs and such instructions - a quite safe assumption I presume) [ If every core had it's own primary memory, the situation would be different. It would be more like the situation in a distributed/internet based system, spread over several computers. One could view each core as a separate computer actually ] Don't forget about the on-chip cache! :-) Sorry for continuing slightly OT: Yes, that makes matters even more interesting. Caches for single-cpu-boards speed up memory access quite dramatically. Are caches for multi-core boards shared among the cores? Or do each core have a separate cache? I can only imagine how complicated the read/write logic must be of these tiny electronic devices, in any case. Of course caches makes the memory access-operations must faster, but I'm guessing register instructions are still orders of magnitude faster than (cached) memory access. (or else registers would not really be needed - you could just view the whole primary memory as an array of registers!) So I think my first question is still interesting: What is the point of multiple cores, if memory is the bottleneck? (it helps to think of algorithms such as line-drawing or ray-tracing, which is easy to parallellize, yet I believe are still faster using a single core instead of multiple because of the read/write-to-memory-bottleneck. It does help to bring more workers to the mine if um typo It does NOT help to .. only one is allowed access at a time, or more likely, several are allowed yet it gets so crowded that queues/waiting is inevitable) -- http://mail.python.org/mailman/listinfo/python-list -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Olof Bjarnason wrote: [snip] A short question after having read through most of this thread, on the same subject (time-optimizing CPython): http://mail.python.org/pipermail/python-list/2007-September/098964.html We are experiencing multi-core processor kernels more and more these days. But they are all still connected to the main memory, right? To me that means, even though some algorithm can be split up into several threads that run on different cores of the processor, that any algorithm will be memory-speed limited. And memory access is a quite common operation for most algorithms. Then one could ask oneself: what is the point of multiple cores, if memory bandwidth is the bottleneck? Specifically, what makes one expect any speed gain from parallelizing a sequential algorithm into four threads, say, when the memory shuffling is the same speed in both scenarios? (Assuming memory access is much slower than ADDs, JMPs and such instructions - a quite safe assumption I presume) Modern (multi-core) processors have several levels of caches that interact with the other cores in different ways. You should read up on NUMA. http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
2009/10/23 Stefan Behnel stefan...@behnel.de Olof Bjarnason wrote: [snip] A short question after having read through most of this thread, on the same subject (time-optimizing CPython): http://mail.python.org/pipermail/python-list/2007-September/098964.html We are experiencing multi-core processor kernels more and more these days. But they are all still connected to the main memory, right? To me that means, even though some algorithm can be split up into several threads that run on different cores of the processor, that any algorithm will be memory-speed limited. And memory access is a quite common operation for most algorithms. Then one could ask oneself: what is the point of multiple cores, if memory bandwidth is the bottleneck? Specifically, what makes one expect any speed gain from parallelizing a sequential algorithm into four threads, say, when the memory shuffling is the same speed in both scenarios? (Assuming memory access is much slower than ADDs, JMPs and such instructions - a quite safe assumption I presume) Modern (multi-core) processors have several levels of caches that interact with the other cores in different ways. You should read up on NUMA. http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Thanks that was a good read. Basically NUMA addresses the problems I mention by creating several primary memories (called banks) - making the motherboard contain several computers (=cpu+primary memory) instead of one primary memory and several cores, if I read it correctly. Stefan -- http://mail.python.org/mailman/listinfo/python-list -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Le Fri, 23 Oct 2009 09:45:06 +0200, Olof Bjarnason a écrit : So I think my first question is still interesting: What is the point of multiple cores, if memory is the bottleneck? Why do you think it is, actually? Some workloads are CPU-bound, some others are memory- or I/O-bound. You will find plenty of benchmarks on the Web showing that some workloads scale almost linearly to the number of CPU cores, while some don't. -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
2009/10/23 Antoine Pitrou solip...@pitrou.net Le Fri, 23 Oct 2009 09:45:06 +0200, Olof Bjarnason a écrit : So I think my first question is still interesting: What is the point of multiple cores, if memory is the bottleneck? Why do you think it is, actually? Some workloads are CPU-bound, some others are memory- or I/O-bound. Big I/O operations have even less chance of gaining anything from being parallellized (quite the opposite I would guess due to synchronization overhead). Operations that uses very little memory, say some mathematical computations that can fit in 16 registers, have a lot to gain in being parallellized, of course. Maybe computing small-formulae fractals is a good example of this? My question was regarding the most hard-to-tell and, sadly also the most common, trying to speed up algorithms that work over big data structures in primary memory. Things like drawing lines, doing image processing, multiplication of large matrices, ray tracing Basically anything with large amounts of input information/buffers/structures, stored in primary memory. This would be way to speed up things in an image processing algorithm: 1. divide the image into four subimages 2. let each core process each part independently 3. fixmerge (along split lines for example) into a resulting, complete image Notice that no gain would be achieved if the memory was shared among the cores, at least not if the operation-per-pixel is faster than the read/write into memory. You will find plenty of benchmarks on the Web showing that some workloads scale almost linearly to the number of CPU cores, while some don't. -- http://mail.python.org/mailman/listinfo/python-list -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Le Fri, 23 Oct 2009 15:53:02 +0200, Olof Bjarnason a écrit : This would be way to speed up things in an image processing algorithm: 1. divide the image into four subimages 2. let each core process each part independently 3. fixmerge (along split lines for example) into a resulting, complete image Well, don't assume you're the first to think about that. I'm sure that performance-conscious image processing software already has this kind of tile-based optimizations. (actually, if you look at benchmarks of 3D rendering which are regularly done by enthusiast websites, it shows exactly that) Antoine. -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
This would be way to speed up things in an image processing algorithm: 1. divide the image into four subimages 2. let each core process each part independently 3. fixmerge (along split lines for example) into a resulting, complete image Well, don't assume you're the first to think about that. I'm sure that performance-conscious image processing software already has this kind of tile-based optimizations. (actually, if you look at benchmarks of 3D rendering which are regularly done by enthusiast websites, it shows exactly that) No I didn't assume I was the first to think about that - I wanted to learn more about how optimization at all is possible/viable with multi-core motherboards, when the memory speed is the bottleneck anyway, regardless of smart caching technologies. I still have not received a convincing answer :) Antoine. -- http://mail.python.org/mailman/listinfo/python-list -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
On Fri, Oct 23, 2009 at 2:31 PM, Olof Bjarnason olof.bjarna...@gmail.com wrote: This would be way to speed up things in an image processing algorithm: 1. divide the image into four subimages 2. let each core process each part independently 3. fixmerge (along split lines for example) into a resulting, complete image Well, don't assume you're the first to think about that. I'm sure that performance-conscious image processing software already has this kind of tile-based optimizations. (actually, if you look at benchmarks of 3D rendering which are regularly done by enthusiast websites, it shows exactly that) This is indeed a tried-and-true method for parallelizing certain image and other grid-based algorithms, but it is in fact not appropriate for a wide variety of techniques. Things like median filters, where f(A|B) != f(A)|f(B) (with | as some sort of concatenation), will not be able to generate correct results given the scheme you outlined. No I didn't assume I was the first to think about that - I wanted to learn more about how optimization at all is possible/viable with multi-core motherboards, when the memory speed is the bottleneck anyway, regardless of smart caching technologies. I still have not received a convincing answer :) Give Ulrich Drepper's What Every Programmer Should Know about Memory a read (http://people.redhat.com/drepper/cpumemory.pdf) and you'll hear all you want to know (and more) about how the memory hierarchy plays with multi-core. I don't contribute to CPython, but I am virtually certain that they are not interested in having the compiler/interpreter try to apply some generic threading to arbitrary code. The vast majority of Python code wouldn't benefit from it even if it worked well, and I'm _very_ skeptical that there is any silver bullet for parallelizing general code. If you think of one, tell Intel, they'll hire you. _Perhaps_ the numpy or scipy people (I am not associated with either of them) would be interested in some threading for certain array operations. Maybe you could write something on top of what they have to speed up array ops. Jason -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
2009/10/22 John Yeung gallium.arsen...@gmail.com On Oct 22, 12:28 am, John Nagle na...@animats.com wrote: The Shed Skin people would welcome some help. http://shed-skin.blogspot.com/ People? It's one guy. It apparently started out as a Master's thesis as well. ;) I am a great admirer of the Shed Skin project, and I would be as happy as anyone to see it progress. However, it seems to me that Shed Skin is at a stage where what it really needs is plain old more work. (I don't want to call it grunt work, but it's things like more testing, implementing more library support, maintaining the Windows build, etc. Very worthy and worthwhile work, but tough to pass off as academic graduate work.) To the original poster: I am not sure if this is too ambitious for your time frame, but one thing many real-world Python users would LOVE is for some way to get rid of the GIL (while still retaining thread safety, single-core performance, etc.). If you could patch CPython to make it GIL-less, I think you would have academically publishable material as well as being a hero in the Python community. :D A short question after having read through most of this thread, on the same subject (time-optimizing CPython): http://mail.python.org/pipermail/python-list/2007-September/098964.html We are experiencing multi-core processor kernels more and more these days. But they are all still connected to the main memory, right? To me that means, even though some algorithm can be split up into several threads that run on different cores of the processor, that any algorithm will be memory-speed limited. And memory access is a quite common operation for most algorithms. Then one could ask oneself: what is the point of multiple cores, if memory bandwidth is the bottleneck? Specifically, what makes one expect any speed gain from parallelizing a sequential algorithm into four threads, say, when the memory shuffling is the same speed in both scenarios? (Assuming memory access is much slower than ADDs, JMPs and such instructions - a quite safe assumption I presume) [ If every core had it's own primary memory, the situation would be different. It would be more like the situation in a distributed/internet based system, spread over several computers. One could view each core as a separate computer actually ] John -- http://mail.python.org/mailman/listinfo/python-list -- twitter.com/olofb olofb.wordpress.com olofb.wordpress.com/tag/english -- http://mail.python.org/mailman/listinfo/python-list
Cpython optimization
Hello As my Master's dissertation I chose Cpython optimization. That's why i'd like to ask what are your suggestions what can be optimized. Well, I know that quite a lot. I've downloaded the source code (I plan to work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at the code I've found comment's like this can be optimized by... etc. but maybe you guide me what should I concentrate on in my work? I've 6-7 month for this and If I create something decent I can publish it. Thank you in advance for any help -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Francesco Bochicchio wrote: As a simple and plain python user, I would value a version of cython that can be used to built faster executables out of almost-python code (that is python code with a few additional restructions). Maybe using typing inference to avoid declaring explicitely the variable types. Sorry, type inference has already landed and will be in Cython 0.12. Plus, you do not need to declare variables in Cython, but you may, if you choose to. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
On Wed, Oct 21, 2009 at 1:28 PM, Qrees qre...@gmail.com wrote: Hello As my Master's dissertation I chose Cpython optimization. That's why i'd like to ask what are your suggestions what can be optimized. Well, I know that quite a lot. I've downloaded the source code (I plan to work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at the code I've found comment's like this can be optimized by... etc. but maybe you guide me what should I concentrate on in my work? I've 6-7 month for this and If I create something decent I can publish it. Thank you in advance for any help Could always port it to CUDA ;) Geremy Condra -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Il Wed, 21 Oct 2009 10:28:55 -0700, Qrees ha scritto: Hello As my Master's dissertation I chose Cpython optimization. That's why i'd like to ask what are your suggestions what can be optimized. Well, I know that quite a lot. I've downloaded the source code (I plan to work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at the code I've found comment's like this can be optimized by... etc. but maybe you guide me what should I concentrate on in my work? I've 6-7 month for this and If I create something decent I can publish it. Thank you in advance for any help If you don't know yet, you could find interesting this project: http://code.google.com/p/unladen-swallow/ They too are trying to improve CPython speed. If you are thinking of language variations that trade some flexiblity for speed, you might be interested in Cython: http://www.cython.org/ As a simple and plain python user, I would value a version of cython that can be used to built faster executables out of almost-python code (that is python code with a few additional restructions). Maybe using typing inference to avoid declaring explicitely the variable types. Another interesting place to go is pypy : http://codespeak.net/pypy/dist/ pypy/doc/ . They too have developed a restriced version of python (RPython, I think) which should be faster than CPython. They don't work with CPython code base, but could give you ideas on what are the bottlenecks of python as a language. Ciao - FB -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
If you don't know yet, you could find interesting this project: http://code.google.com/p/unladen-swallow/ I know about this project. I'll have a look at it, but I'd like to create something of my own. They too are trying to improve CPython speed. If you are thinking of language variations that trade some flexiblity for speed, you might be interested in Cython: http://www.cython.org/ I was thinking about sacrificing some flexibility of Python and thank you for pointing me to this project. I didn't about it. As a simple and plain python user, I would value a version of cython that can be used to built faster executables out of almost-python code (that is python code with a few additional restructions). Maybe using typing inference to avoid declaring explicitely the variable types. Another interesting place to go is pypy :http://codespeak.net/pypy/dist/ pypy/doc/ . They too have developed a restriced version of python (RPython, I think) which should be faster than CPython. They don't work with CPython code base, but could give you ideas on what are the bottlenecks of python as a language. Ciao - FB I'd like to hear a word from developers, what they think about this :). BTW: My seminar deals with object oriented programming. -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
Qrees wrote: Hello As my Master's dissertation I chose Cpython optimization. That's why i'd like to ask what are your suggestions what can be optimized. Well, I know that quite a lot. I've downloaded the source code (I plan to work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at the code I've found comment's like this can be optimized by... etc. but maybe you guide me what should I concentrate on in my work? I've 6-7 month for this and If I create something decent I can publish it. Thank you in advance for any help The Shed Skin people would welcome some help. http://shed-skin.blogspot.com/ Shed Skin is a Python to C++ compiler with automatic type inference. For the programs it can handle, it's the fastest Python implementation by a large margin. The basic restriction in Shed Skin is that, while you don't have to declare types, each variable generally has to stay the same type throughout its life. (Different subclasses of the same class are OK.) This is enough to make huge speedups possible. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Cpython optimization
On Oct 22, 12:28 am, John Nagle na...@animats.com wrote: The Shed Skin people would welcome some help. http://shed-skin.blogspot.com/ People? It's one guy. It apparently started out as a Master's thesis as well. ;) I am a great admirer of the Shed Skin project, and I would be as happy as anyone to see it progress. However, it seems to me that Shed Skin is at a stage where what it really needs is plain old more work. (I don't want to call it grunt work, but it's things like more testing, implementing more library support, maintaining the Windows build, etc. Very worthy and worthwhile work, but tough to pass off as academic graduate work.) To the original poster: I am not sure if this is too ambitious for your time frame, but one thing many real-world Python users would LOVE is for some way to get rid of the GIL (while still retaining thread safety, single-core performance, etc.). If you could patch CPython to make it GIL-less, I think you would have academically publishable material as well as being a hero in the Python community. :D John -- http://mail.python.org/mailman/listinfo/python-list