Re: [Tutor] Shelve & immutable objects
> Separately, I'm also curious about how to process big files. For example, I > was trying to play 100 million games of chutes & ladders Without doing the 100,000,000, you could try either researching the nums, or trying an algorithm that tried intervals, and narrowed down the best , and numerically decisive amount of largest intervals of chute and ladders.. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
> Separately, I'm also curious about how to process big files. For example, I > was trying to play 100 million games of chutes & ladders, and I crashed my > machine, I believe: the game results, including 4 ints & 2 short lists of > ints per game, are gathered into a list, so it can become a pretty big list. > I need to do stats and other analyses on it in the end (okay, I really don't > NEED to play 100 million games of chutes & ladders, but as long as I > have...): I suppose I could break it into manageable (maybe 1 million games > each), but that will make some of the stats either clunky or wrong (I > haven't really attacked that part yet). This is probably one of those cases where you don't want to persist this in active memory, but rather store it in some kind of long-term store. We can do a back of the envelope calculation to see why. An int's native size is 4 bytes on 32-bit machines. This is Python, so there's a bit more overhead. Let's roughly say that each record about 10 ints, so about 40 bytes. 100 * 10^6 records * 40 bytes per record = 4 * 10^9 bytes. ### >>> import humanize >>> humanize.intword(4*10**9) '4 billion' ### So that's about 4 billion bytes, as a very rough guess. Unfortunately, when we get to those magnitudes, that's way too large to fit into a standard computer's memory. 32-bit machines can only address up to about 4 billion bytes: >>> humanize.intword(2**32) '4.3 billion' So trying to juggle all those records in short-term RAM is a non-starter: it won't work. We'll want to do something different instead, such as saving each record into a persistent on-disk, external database. If you're interested, bring this up as a separate topic on python-tutor, and I'm sure folks here will be happy to talk about it. Good luck! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
On Thu, Jan 2, 2014 at 4:15 AM, Keith Winston wrote: > Thanks for all this Eryksun (and Mark!), but... I don't understand why you > brought gdbm in? Is it something underlying shelve, or a better approach, or > something else? That last part really puts me in a pickle, and I don't > understand why. A Shelf is backed by a container with the following mapping methods: keys __contains__ __getitem__ __setitem__ __delitem__ __len__ Shelf will also try to call `close` and `sync` on the container if available. For some reason no one has made Shelf into a context manager (i.e. __enter__ and __exit__), so remember to close() it. For demonstration purposes, you can use a dict with Shelf: >>> sh = shelve.Shelf(dict={}) >>> sh['alist'] = [1,2,3] The mapping is referenced in the (badly named) `dict` attribute: >>> sh.dict {b'alist': b'\x80\x03]q\x00(K\x01K\x02K\x03e.'} Keys are encoded as bytes (UTF-8 default) and the value is serialized using pickle. This is to support using a database from the dbm module. shelve.open returns an instance of shelve.DbfilenameShelf, which is a subclass of Shelf specialized to open a dbm database. Here's an overview of Unix dbm databases that Google turned up: http://www.unixpapa.com/incnote/dbm.html Note the size restrictions for keys and values in ndbm, which gdbm doesn't have. Using gdbm lifts the restriction on the size of pickled objects (the docs vaguely suggest to keep them "fairly small"). Unfortunately gdbm isn't always available. On my system, dbm defaults to creating a _gdbm.gdbm database, where _gdbm is an extension module that wraps the GNU gdbm library (e.g. libgdbm.so.3). You can use a different database with Shelf (or a subclass), so long as it has the required methods. For example, shelve.BsdDbShelf is available for use with pybsddb (Debian package "python3-bsddb3"). It exposes the bsddb3 database methods `first`, `next`, `previous`, `last` and `set_location`. > Separately, I'm also curious about how to process big files. > ... > I'm also beginning to think about how to speed it up: I defer to Steven's sage advice. Look into using databases such as sqlite3, numpy, and also add the multiprocessing and concurrent.futures modules to your todo list. Even if you know C/C++, I suggest you use Cython to create CPython extension modules. http://www.cython.org ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
On Thu, Jan 02, 2014 at 04:15:06AM -0500, Keith Winston wrote: > Separately, I'm also curious about how to process big files. For example, I > was trying to play 100 million games of chutes & ladders, and I crashed my > machine, I believe: the game results, including 4 ints & 2 short lists of > ints per game, are gathered into a list, so it can become a pretty big > list. How to process big files, in order of "best" to "worst" (in my opinion): (1) Get more memory. (2) Find a way to process them in small chunks, e.g. a line at a time, rather than try to hold the entire file in memory at once. (3) Split them into small files, then process each file in turn. (4) Pass them to external tools which are optimized for dealing with huge files. (5) Find a way to process them using mmap. (6) Write your own tool for handling huge files. > I need to do stats and other analyses on it in the end (okay, I > really don't NEED to play 100 million games of chutes & ladders, but as > long as I have...): I suppose I could break it into manageable (maybe 1 > million games each), but that will make some of the stats either clunky or > wrong (I haven't really attacked that part yet). It shouldn't. Well, it might, but only if you're doing some pretty sophisticated statistical analysis. Most common statistics can be calculated on a single pass through the data. If you need to calculate (say) the mean, variance and standard deviation, it is moderately straight forward to calculate all three in a single pass without needing to have all the data in memory at once. (Feel free to ask for more detail if you wish.) Even statistics like median, which normally requires the entire data set to be in memory so it can be sorted, can be calculated with a couple of passes through the data. I think there is even a single pass algorithm for it. > And since I'm not REALLY ready to ask this question, I'll tack it on to the > end... I'm also beginning to think about how to speed it up: I'm imagining > my two options are going to be to code some sections in a faster language > (i.e. C), or maybe to introduce multi-threading since I'm working on a > multicore machine generally (core I7), and I'm doing a lot of iterations of > the same thing with no important order... seems like a good candidate. A game of Chutes and Ladders (or as we call it in Australia, Snakes and Ladders) surely doesn't need to be optimized. It will spend most of its time waiting for the human user, who is probably a hundred thousand or million times slower than the computer. But let's say you want to make it faster regardless. How to make it faster: (1) Use a smarter algorithm or less wasteful implementation. (2) Get more memory. (3) Get a faster computer. (4) I think what you are trying to do is a good candidate for PyPy, the optimizing Python JIT compiler. Especially if you are playing multiple millions of games. (5) Profile, profile, profile, profile. Only once you have profiled your code can you possibly hope to guess where the bottlenecks are. After 15 or 20 years of programming with Python, my guesses for the bottlenecks are still wrong more often than they are right. Assuming you identify the actual bottlenecks: (6) Try writing them in Cython, which will often give you good speedups. (7) Or write a C extension. Should you use threads? Probably not. - Unless the bottleneck in your code is disk or network I/O, threads will not help, they'll just make your program slower. - Unless you use IronPython or Jython instead, in which case treads *might* help. But probably not as much as you think. - If your bottleneck is CPU processing, you can use the multiprocessing module instead of threads. - You did profile your code first didn't you? > Now, > I'm probably pretty far from that piece (in my learning process), but this > is moving along pretty well so I'm open to suggestions about how to > proceed. I've started switching up my code a fair bit to try to make it > more OOP, though I'm still rough on that part. Advice about optimization from some experts: http://en.wikipedia.org/wiki/Program_optimization#Quotes My favourite quote is: The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
Thanks for all this Eryksun (and Mark!), but... I don't understand why you brought gdbm in? Is it something underlying shelve, or a better approach, or something else? That last part really puts me in a pickle, and I don't understand why. Separately, I'm also curious about how to process big files. For example, I was trying to play 100 million games of chutes & ladders, and I crashed my machine, I believe: the game results, including 4 ints & 2 short lists of ints per game, are gathered into a list, so it can become a pretty big list. I need to do stats and other analyses on it in the end (okay, I really don't NEED to play 100 million games of chutes & ladders, but as long as I have...): I suppose I could break it into manageable (maybe 1 million games each), but that will make some of the stats either clunky or wrong (I haven't really attacked that part yet). And since I'm not REALLY ready to ask this question, I'll tack it on to the end... I'm also beginning to think about how to speed it up: I'm imagining my two options are going to be to code some sections in a faster language (i.e. C), or maybe to introduce multi-threading since I'm working on a multicore machine generally (core I7), and I'm doing a lot of iterations of the same thing with no important order... seems like a good candidate. Now, I'm probably pretty far from that piece (in my learning process), but this is moving along pretty well so I'm open to suggestions about how to proceed. I've started switching up my code a fair bit to try to make it more OOP, though I'm still rough on that part. K ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
On Wed, Jan 1, 2014 at 11:43 AM, Keith Winston wrote: > Thanks Danny, I don't understand the re-persisted part, but I'll look into > it. Shelf.close calls Shelf.sync to flush the cache. Here's what it does: >>> print(inspect.getsource(shelve.Shelf.sync)) def sync(self): if self.writeback and self.cache: self.writeback = False for key, entry in self.cache.items(): self[key] = entry self.writeback = True self.cache = {} if hasattr(self.dict, 'sync'): self.dict.sync() If the writeback cache is enabled, then loading a key also caches the item in a dict. `sync` writes all of the cached items back and clears the cache (for which it should be using `self.cache.clear()`). It also syncs the underlying database if it has a `sync` attribute, which the gdbm database [1] requires. Per its man page, "gdbm defaults to no-sync mode" [2]. So actually the "f" flag for opening in fast mode is obsolete [3]. 1. If you're using a Debian-family Linux, install the packages python-gdbm and python3-gdbm. 2. http://linux.die.net/man/3/gdbm 3. http://docs.python.org/3/library/dbm#dbm.gnu.open Have a look at what gets stored in the database. First lets create a database and add a list: >>> import shelve, dbm >>> sh = shelve.open('class_shelve') >>> sh['alist'] = [1, 2, 3] You could leave the Shelf open and use its underlying `sh.dict`, but I'll reopen the database instead: >>> sh.close() >>> db = dbm.open('class_shelve') >>> data = db['alist'] >>> data b'\x80\x03]q\x00(K\x01K\x02K\x03e.' shelve serialized the list to a byte string with the pickle protocol. This is a program that instructs the pickle VM to build the list. You can manually load the list with pickle.loads: >>> import pickle >>> pickle.loads(data) [1, 2, 3] You can disassemble the pickle with pickletools.dis, if you're curious: >>> import pickletools >>> pickletools.dis(data) 0: \x80 PROTO 3 2: ]EMPTY_LIST 3: qBINPUT 0 5: (MARK 6: KBININT11 8: KBININT12 10: KBININT13 12: eAPPENDS(MARK at 5) 13: .STOP highest protocol among opcodes = 2 ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
On 01/01/2014 16:43, Keith Winston wrote: Thanks Danny, I don't understand the re-persisted part, but I'll look into it. I realized I hadn't done enough homework to justify a question right after I sent the first half of that one! Happy New Year! You do infinitely more work than some who pose questions here!!! Happy New Year to everybody. And as I've already said elsewhere, let's hope 2014 is more code, less bugs :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
Thanks Danny, I don't understand the re-persisted part, but I'll look into it. I realized I hadn't done enough homework to justify a question right after I sent the first half of that one! Happy New Year! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
On Tue, Dec 31, 2013 at 11:01 PM, Keith Winston wrote: > I'm working my way slowly through Programming Python by Mark Lutz, and as an > example of data persistence, he uses this example: Ooops; the email got cut off a bit early. Can you try again? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
According to: http://docs.python.org/2/library/shelve.html The shelve can be opened in 'writeback' mode, which I think might be relevant to your question. "By default modified objects are written only when assigned to the shelf (see Example). If the optional writebackparameter is set to True, all entries accessed are also cached in memory, and written back on sync() and close(); this can make it handier to mutate mutable entries in the persistent dictionary, but, if many entries are accessed, it can consume vast amounts of memory for the cache, and it can make the close operation very slow since all accessed entries are written back (there is no way to determine which accessed entries are mutable, nor which ones were actually mutated)." Let's try it: ## >>> import shelve >>> db = shelve.open('class-shelve') >>> db['a-list'] = [1, 2, 3] >>> db.close() >>> db = shelve.open('class-shelve', writeback=True) >>> db['a-list'].append("four") >>> db.close() >>> db = shelve.open('class-shelve') >>> db['a-list'] [1, 2, 3, 'four'] ## So yes, you should be able to use a shelve in writeback mode to automatically persist the mutable structures. That being said, the docs do say to be aware of the implications: it means every accessed entry's going to be re-persisted because the shelve does not really watch for mutations: it just checks for access. Happy New Year! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Shelve & immutable objects
So sorry, I hit return: here's the example: import shelve db = shelve.open('class-shelve') sue = db['sue'] sue.giveRaise(.25) db['sue'] = sue tom = db['tom'] tom.giveRaise(.20) db['tom'] = tom db.close() Is it possible to dispense with the assignment/reassignment and just use (open shelve) db['sue'].giveRaise(.25) db['sue'].giveRaise(.25) (close shelve) or is the assignment (or bounding?) necessary to unshelve/reshelve the items... ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Shelve & immutable objects
I'm working my way slowly through Programming Python by Mark Lutz, and as an example of data persistence, he uses this example: -- Keith ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor