On Thu, Jan 02, 2014 at 04:15:06AM -0500, Keith Winston wrote: > Separately, I'm also curious about how to process big files. For example, I > was trying to play 100 million games of chutes & ladders, and I crashed my > machine, I believe: the game results, including 4 ints & 2 short lists of > ints per game, are gathered into a list, so it can become a pretty big > list.
How to process big files, in order of "best" to "worst" (in my opinion): (1) Get more memory. (2) Find a way to process them in small chunks, e.g. a line at a time, rather than try to hold the entire file in memory at once. (3) Split them into small files, then process each file in turn. (4) Pass them to external tools which are optimized for dealing with huge files. (5) Find a way to process them using mmap. (6) Write your own tool for handling huge files. > I need to do stats and other analyses on it in the end (okay, I > really don't NEED to play 100 million games of chutes & ladders, but as > long as I have...): I suppose I could break it into manageable (maybe 1 > million games each), but that will make some of the stats either clunky or > wrong (I haven't really attacked that part yet). It shouldn't. Well, it might, but only if you're doing some pretty sophisticated statistical analysis. Most common statistics can be calculated on a single pass through the data. If you need to calculate (say) the mean, variance and standard deviation, it is moderately straight forward to calculate all three in a single pass without needing to have all the data in memory at once. (Feel free to ask for more detail if you wish.) Even statistics like median, which normally requires the entire data set to be in memory so it can be sorted, can be calculated with a couple of passes through the data. I think there is even a single pass algorithm for it. > And since I'm not REALLY ready to ask this question, I'll tack it on to the > end... I'm also beginning to think about how to speed it up: I'm imagining > my two options are going to be to code some sections in a faster language > (i.e. C), or maybe to introduce multi-threading since I'm working on a > multicore machine generally (core I7), and I'm doing a lot of iterations of > the same thing with no important order... seems like a good candidate. A game of Chutes and Ladders (or as we call it in Australia, Snakes and Ladders) surely doesn't need to be optimized. It will spend most of its time waiting for the human user, who is probably a hundred thousand or million times slower than the computer. But let's say you want to make it faster regardless. How to make it faster: (1) Use a smarter algorithm or less wasteful implementation. (2) Get more memory. (3) Get a faster computer. (4) I think what you are trying to do is a good candidate for PyPy, the optimizing Python JIT compiler. Especially if you are playing multiple millions of games. (5) Profile, profile, profile, profile. Only once you have profiled your code can you possibly hope to guess where the bottlenecks are. After 15 or 20 years of programming with Python, my guesses for the bottlenecks are still wrong more often than they are right. Assuming you identify the actual bottlenecks: (6) Try writing them in Cython, which will often give you good speedups. (7) Or write a C extension. Should you use threads? Probably not. - Unless the bottleneck in your code is disk or network I/O, threads will not help, they'll just make your program slower. - Unless you use IronPython or Jython instead, in which case treads *might* help. But probably not as much as you think. - If your bottleneck is CPU processing, you can use the multiprocessing module instead of threads. - You did profile your code first didn't you? > Now, > I'm probably pretty far from that piece (in my learning process), but this > is moving along pretty well so I'm open to suggestions about how to > proceed. I've started switching up my code a fair bit to try to make it > more OOP, though I'm still rough on that part. Advice about optimization from some experts: http://en.wikipedia.org/wiki/Program_optimization#Quotes My favourite quote is: The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor