Re: [Tutor] threading mind set
On Mon, 2012-05-14 at 10:31 +1000, Steven D'Aprano wrote: [...] > No hard compared to what? Compared to sequential programming. [...] > My argument is that once you move beyond the one-operation-after-another > programming model, almost any parallel processing problem is harder than the > equivalent sequential version, inherently due to the parallelism. Except > perhaps for "embarrassingly parallel" problems, parallelism adds complexity > even if your framework abstracts away most of the tedious detail like > semaphores. > > http://en.wikipedia.org/wiki/Embarrassingly_parallel > > Once you move beyond sequential execution, you have to think about issues > that > don't apply to sequential programs: how to divide the task up between > processes/threads/actors/whatever, how to manage their synchronization, > resource starvation (e.g. deadlocks, livelocks), etc. Actor systems, dataflow systems and CSP (Communicating Sequential Processes), do not guarantee lack of deadlock or livelock, but the whole "processes communicating by passing messages not by sharing data" make it hugely easier to reason about what is happening. Moreover if like with CSP, your actors or dataflow systems enforce sequential actors/operators then it gets even better. The secret to parallel processing (in general, there are always exception/corner cases) is to write sequential bits that then communicate using queues or channels. No semaphores. No locks. No monitors. These are tools for operating systems folk and for folk creating actor, dataflow and CSP queues and channels. > We have linear minds and it doesn't take that many real-time parallel tasks > to > overwhelm the human brain. I'm not saying that people can't reason in > parallel, because we clearly can and do, but it's inherently harder than > sequential reasoning. I think if you delve into the psychology of it, our minds are far from linear. Certainly at the electro-chemical level the brain is a massively parallel machine. Over the last 50 years, we have enshrined single processor, single memory into our entire thinking about computing and programming. Our education systems enforce sequential programming for all but the final parallel programming option. The main reason for parallel programming being labelled hard is that we have the wrong tools for reasoning about it. This is the beauty of the 1960s/1970s models of actors, dataflow and CSP, you deconstruct the problem into small bits each of which are sequential and comprehensible, then the overall behaviour of the system is an emergent property of the interaction between these small subsystems. Instead of trying to reason about all the communications systems wide, we just worry about what happens with a small subsystem. The hard part is the decomposition. But then the hard part of software has always been the algorithm. You highlight "embarrassingly parallel" which is the simplest decomposition possible, straight scatter/gather, aka map/reduce. More often that not this is handled by a façade such as "parallel reduce". It is perhaps worth noting that "Big Data" is moving to dataflow processing in a "Big Way" :-) Data mining and the like has been revolutionized by changing it's perception of algorithm and how to decompose problems. [...] > Python doesn't have a GIL. Some Python implementations do, most obviously > CPython, the reference implementation. But Jython and IronPython don't. If > the > GIL is a problem for your program, consider running it on Jython or > IronPython. It is true that Python doesn't have a GIL, thanks for the correction. CPython and (until recently) PyPy have a GIL. The PyPy folk are experimenting with software transactional memory (STM) in the interpreter to be able to remove the GIL. To date things are looking very positive. PyPy will rock :-) Although Guido had said (EuroPython 2010) he is happy to continue with the GIL in CPython, there are subversive elements (notable the PyPy folk) who are trying to show that STM will work with CPython as well. Jython is sadly lagging behind in terms of versions of Python supported and is increasingly becoming irrelevant -- unless someone does something soon. Groovy, JRuby and Clojure are the dynamic languages of choice on the JVM. IronPython is an interesting option except that there is all the FUD about use of the CLR and having to buy extortion^H^H^H^H^H^H^H^H^H licencing money to Microsoft. Also Microsoft ceasing to fund IronPython (and IronRuby) is a clear indicator that Microsoft have no intention of supporting use of Python on CLR. Thus it could end up in the same state as Jython. -- Russel. = Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.win...@ekiga.net 41 Buckmaster Roadm: +44 7770 465 077 xmpp: rus...@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder signature.asc Description: This is a digita
Re: [Tutor] threading mind set
On Sun, May 13, 2012 at 8:31 PM, Steven D'Aprano wrote: >> Using processes and message passing, using dataflow, actors or CSP, >> parallelism and concurrency is far more straightforward. Not easy, >> agreed, but then programming isn't easy. > > My argument is that once you move beyond the one-operation-after-another > programming model, almost any parallel processing problem is harder than the > equivalent sequential version, inherently due to the parallelism. Except > perhaps for "embarrassingly parallel" problems, parallelism adds complexity > even if your framework abstracts away most of the tedious detail like > semaphores. If you agree that embarrassingly parallel multithreaded frameworks are easy, what do you think of dataflow programming? It is exactly the same, except that you can have multiple tasks, where one task depends on the output of a previous task. It shares the property that it makes no difference in what order things are executed (or sequential vs parallel), so long as the data dependencies are respected -- so it's another case where you don't actually have to think in a non-sequential manner. (Rather, think in a "vectorized" per-work-item manner.) http://en.wikipedia.org/wiki/Dataflow_programming It should be clear that not all ways of programming multithreaded code are equal, and some are easier than others. In particular, having mutable state shared between two concurrently-executing procedures is phenomenally hard, and when it's avoided things become simpler. -- Devin ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] threading mind set
Russel Winder wrote: Steven, On Sun, 2012-05-13 at 10:22 +1000, Steven D'Aprano wrote: carlo locci wrote: Hello All, I've started to study python a couple of month ago(and I truly love it :)), however I'm having some problems understanding how to modify a sequential script and make it multithreaded (I think it's because I'm not used to think in that way), No, that's because multithreading and parallel processing is hard. Shared memory multithreading may be hard due to locks, semaphores, monitors, etc., but concurrency and parallelism need not be hard. No hard compared to what? Using processes and message passing, using dataflow, actors or CSP, parallelism and concurrency is far more straightforward. Not easy, agreed, but then programming isn't easy. My argument is that once you move beyond the one-operation-after-another programming model, almost any parallel processing problem is harder than the equivalent sequential version, inherently due to the parallelism. Except perhaps for "embarrassingly parallel" problems, parallelism adds complexity even if your framework abstracts away most of the tedious detail like semaphores. http://en.wikipedia.org/wiki/Embarrassingly_parallel Once you move beyond sequential execution, you have to think about issues that don't apply to sequential programs: how to divide the task up between processes/threads/actors/whatever, how to manage their synchronization, resource starvation (e.g. deadlocks, livelocks), etc. We have linear minds and it doesn't take that many real-time parallel tasks to overwhelm the human brain. I'm not saying that people can't reason in parallel, because we clearly can and do, but it's inherently harder than sequential reasoning. The GIL in Python is a bad thing for parallelism. Using the multiprocessing package or concurrent.futures gets over the problem. Well sort of, these processes are a bit heavyweight compared to what can be achieved on the JVM or with Erlang. Python doesn't have a GIL. Some Python implementations do, most obviously CPython, the reference implementation. But Jython and IronPython don't. If the GIL is a problem for your program, consider running it on Jython or IronPython. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] threading mind set
Steven, On Sun, 2012-05-13 at 10:22 +1000, Steven D'Aprano wrote: > carlo locci wrote: > > Hello All, > > I've started to study python a couple of month ago(and I truly love it :)), > > however I'm having some problems understanding how to modify a sequential > > script and make it multithreaded (I think it's because I'm not used to > > think in that way), > > No, that's because multithreading and parallel processing is hard. Shared memory multithreading may be hard due to locks, semaphores, monitors, etc., but concurrency and parallelism need not be hard. Using processes and message passing, using dataflow, actors or CSP, parallelism and concurrency is far more straightforward. Not easy, agreed, but then programming isn't easy. > > as well as when it's best to use it(some say that > > because of the GIL I won't get any real benefit from threading my script). > > That depends on what your script does. > > In a nutshell, if your program is limited by CPU processing, then using > threads in Python won't help. (There are other things you can do instead, > such > as launching new Python processes.) The GIL in Python is a bad thing for parallelism. Using the multiprocessing package or concurrent.futures gets over the problem. Well sort of, these processes are a bit heavyweight compared to what can be achieved on the JVM or with Erlang. > If your program is limited by disk or network I/O, then there is a > possibility > you can speed it up with threads. Or better still use an event based system, cf Twisted. [...] > -- Russel. = Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.win...@ekiga.net 41 Buckmaster Roadm: +44 7770 465 077 xmpp: rus...@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder signature.asc Description: This is a digitally signed message part ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] threading mind set
bob gailer wrote: On 5/12/2012 8:22 PM, Steven D'Aprano wrote: By the way, in future, please don't decorate your code with stars: I think you got stars because the code was posted in HTML and bolded. Plain text readers add the * to show emphasis. I think you have it the other way around: if you add asterisks around text, some plain text readers hide the * and bold the text. At least, I've never seen anything which does it the other way around. (Possibly until now.) In any case, I'm using Thunderbird, and it does NOT show stars around text unless they are already there. When I look at the raw email source, I can see the asterisks there. Perhaps Carlo's mail client is trying to be helpful, and failing miserably. While converting HTML tags into simple markup is a nice thing to do for plain text, it plays havoc with code. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] threading mind set
def read(): couple of observations 1 - it is customary to put all import statements at the beginning of the file. 2 - it is customary to begin variable and function names with a lower case letter. 3 - it is better to avoid using built-in function names common method names (e.g. read). def read(): import csv with open('C:\\test\\VDB.csv', 'rb') as somefile: read = csv.reader(somefile) l = [] for row in read: l += row return l def DirGetSize(cartella): import os cartella_size = 0 for (path, dirs, files) in os.walk(cartella): for x in files: filename = os.path.join(path, x) cartella_size += os.path.getsize(filename) return cartella_size import os.path for x in read(): if not os.path.exists(x): print ' DOES NOT EXIST ON', x else: S = DirGetSize(x) print 'the file size of', x, 'is',S -- Bob Gailer 919-636-4239 Chapel Hill NC ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] threading mind set
On 5/12/2012 8:22 PM, Steven D'Aprano wrote: By the way, in future, please don't decorate your code with stars: I think you got stars because the code was posted in HTML and bolded. Plain text readers add the * to show emphasis. When i copied and pasted the code it came out fine. carlo: in future please post plain text rather than HTML. -- Bob Gailer 919-636-4239 Chapel Hill NC ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] threading mind set
carlo locci wrote: Hello All, I've started to study python a couple of month ago(and I truly love it :)), however I'm having some problems understanding how to modify a sequential script and make it multithreaded (I think it's because I'm not used to think in that way), No, that's because multithreading and parallel processing is hard. as well as when it's best to use it(some say that because of the GIL I won't get any real benefit from threading my script). That depends on what your script does. In a nutshell, if your program is limited by CPU processing, then using threads in Python won't help. (There are other things you can do instead, such as launching new Python processes.) If your program is limited by disk or network I/O, then there is a possibility you can speed it up with threads. It's my understanding that threading a program in python can be useful when we've got some I/O involved, To see the benefit of threads, it's not enough to have "some" I/O, you need *lots* of I/O. Threads have some overhead. Unless you save at least as much time as just starting and managing the threads consumes, you won't see any speed up. In my experience, for what little it's worth [emphasis on "little"], unless you can keep at least four threads busy doing separate I/O, it probably isn't worth the time and effort. And it's probably not worth it for trivial scripts -- who cares if you speed your script up from 0.2 seconds to 0.1 seconds? But as a learning exercise, sure, go ahead and convert your script to threads. One experiment is worth a dozen opinions. You can learn more about threading from here: http://www.doughellmann.com/PyMOTW/threading/ By the way, in future, please don't decorate your code with stars: * def read():* *import csv* *with open('C:\\test\\VDB.csv', 'rb') as somefile:* [...] We should be able to copy and paste your code and have it run immediately, not have to spend time editing it by hand to turn it back into valid Python code that doesn't give a SyntaxError on every line. See also this: http://sscce.org/ -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] threading mind set
Hello All, I've started to study python a couple of month ago(and I truly love it :)), however I'm having some problems understanding how to modify a sequential script and make it multithreaded (I think it's because I'm not used to think in that way), as well as when it's best to use it(some say that because of the GIL I won't get any real benefit from threading my script). It's my understanding that threading a program in python can be useful when we've got some I/O involved, so here is my case, I wrote a quite simple script that reads the first column from a csv file and insert every row of the value into a tuple, then I created a function which gets me the size of a given path/folder and I made it loop so that it'll print the the folder dimension of each path is in the tuple previously created. Here's the code: * def read():* *import csv* *with open('C:\\test\\VDB.csv', 'rb') as somefile:* *read = csv.reader(somefile)* *l = []* *for row in read:* *l += row* *return l* * * *def DirGetSize(cartella):* *import os* *cartella_size = 0* *for (path, dirs, files) in os.walk(cartella):* *for x in files:* *filename = os.path.join(path, x)* *cartella_size += os.path.getsize(filename)* *return cartella_size* * * *import os.path* *for x in read():* *if not os.path.exists(x):* *print ' DOES NOT EXIST ON', x* *else:* *S = DirGetSize(x)* *print 'the file size of', x, 'is',S* * * The script works quite well(at least does what I want), but my real question is will I gain any better performance, in terms of speed, out of it, if I multithread it? The csv file contains a list of server/path/folder therefore I though that If I would multitread it I's gonna became much faster since it will perform the *DirGetSize,* function almost concurrently, although I'm quite confused by the subject, so I'm not really sure. I would really appreciate anyone who would make me understand when it's useful to implement a multreaded script and when it's not and why :),(Maybe I'm asking to much), as well as any good resources where I can study from. Thank you in advance to anyone who will reply me as well as thank you for having such a mailinglist(I discovered it when I had watched a google I/O conference on youtube). Thank you guys. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor