Re: [Tutor] Shelve & immutable objects

2014-01-02 Thread David Hutto
> Separately, I'm also curious about how to process big files. For example,
I
> was trying to play 100 million games of chutes & ladders

Without doing the 100,000,000, you could try either researching the nums,
or trying an algorithm that tried intervals, and narrowed down the best ,
and numerically decisive amount of largest intervals of chute and ladders..
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-02 Thread Danny Yoo
> Separately, I'm also curious about how to process big files. For example, I
> was trying to play 100 million games of chutes & ladders, and I crashed my
> machine, I believe: the game results, including 4 ints & 2 short lists of
> ints per game, are gathered into a list, so it can become a pretty big list.
> I need to do stats and other analyses on it in the end (okay, I really don't
> NEED to play 100 million games of chutes & ladders, but as long as I
> have...): I suppose I could break it into manageable (maybe 1 million games
> each), but that will make some of the stats either clunky or wrong (I
> haven't really attacked that part yet).

This is probably one of those cases where you don't want to persist
this in active memory, but rather store it in some kind of long-term
store.

We can do a back of the envelope calculation to see why.  An int's
native size is 4 bytes on 32-bit machines.  This is Python, so there's
a bit more overhead.  Let's roughly say that each record about 10
ints, so about 40 bytes.

100 * 10^6 records * 40 bytes per record = 4 * 10^9 bytes.

###
>>> import humanize
>>> humanize.intword(4*10**9)
'4 billion'
###

So that's about 4 billion bytes, as a very rough guess.

Unfortunately, when we get to those magnitudes, that's way too large
to fit into a standard computer's memory.  32-bit machines can only
address up to about 4 billion bytes:


>>> humanize.intword(2**32)
'4.3 billion'


So trying to juggle all those records in short-term RAM is a
non-starter: it won't work.  We'll want to do something different
instead, such as saving each record into a persistent on-disk,
external database.

If you're interested, bring this up as a separate topic on
python-tutor, and I'm sure folks here will be happy to talk about it.


Good luck!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-02 Thread eryksun
On Thu, Jan 2, 2014 at 4:15 AM, Keith Winston  wrote:
> Thanks for all this Eryksun (and Mark!), but... I don't understand why you
> brought gdbm in? Is it something underlying shelve, or a better approach, or
> something else? That last part really puts me in a pickle, and I don't
> understand why.

A Shelf is backed by a container with the following mapping methods:

keys
__contains__
__getitem__
__setitem__
__delitem__
__len__

Shelf will also try to call `close` and `sync` on the container if
available. For some reason no one has made Shelf into a context
manager (i.e. __enter__ and __exit__), so remember to close() it.

For demonstration purposes, you can use a dict with Shelf:

>>> sh = shelve.Shelf(dict={})
>>> sh['alist'] = [1,2,3]

The mapping is referenced in the (badly named) `dict` attribute:

>>> sh.dict
{b'alist': b'\x80\x03]q\x00(K\x01K\x02K\x03e.'}

Keys are encoded as bytes (UTF-8 default) and the value is serialized
using pickle. This is to support using a database from the dbm module.

shelve.open returns an instance of shelve.DbfilenameShelf, which is a
subclass of Shelf specialized to open a dbm database.

Here's an overview of Unix dbm databases that Google turned up:

http://www.unixpapa.com/incnote/dbm.html

Note the size restrictions for keys and values in ndbm, which gdbm
doesn't have. Using gdbm lifts the restriction on the size of pickled
objects (the docs vaguely suggest to keep them "fairly small").
Unfortunately gdbm isn't always available.

On my system, dbm defaults to creating a _gdbm.gdbm database, where
_gdbm is an extension module that wraps the GNU gdbm library (e.g.
libgdbm.so.3).

You can use a different database with Shelf (or a subclass), so long
as it has the required methods. For example, shelve.BsdDbShelf is
available for use with pybsddb (Debian package "python3-bsddb3"). It
exposes the bsddb3 database methods `first`, `next`, `previous`,
`last` and `set_location`.

> Separately, I'm also curious about how to process big files.
> ...
> I'm also beginning to think about how to speed it up:

I defer to Steven's sage advice.

Look into using databases such as sqlite3, numpy, and also add the
multiprocessing and concurrent.futures modules to your todo list. Even
if you know C/C++, I suggest you use Cython to create CPython
extension modules.

http://www.cython.org
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-02 Thread Steven D'Aprano
On Thu, Jan 02, 2014 at 04:15:06AM -0500, Keith Winston wrote:

> Separately, I'm also curious about how to process big files. For example, I
> was trying to play 100 million games of chutes & ladders, and I crashed my
> machine, I believe: the game results, including 4 ints & 2 short lists of
> ints per game, are gathered into a list, so it can become a pretty big
> list. 

How to process big files, in order of "best" to "worst" (in my opinion):

(1) Get more memory.

(2) Find a way to process them in small chunks, e.g. a line at a time, 
rather than try to hold the entire file in memory at once.

(3) Split them into small files, then process each file in turn.

(4) Pass them to external tools which are optimized for dealing with 
huge files.

(5) Find a way to process them using mmap.

(6) Write your own tool for handling huge files.


> I need to do stats and other analyses on it in the end (okay, I
> really don't NEED to play 100 million games of chutes & ladders, but as
> long as I have...): I suppose I could break it into manageable (maybe 1
> million games each), but that will make some of the stats either clunky or
> wrong (I haven't really attacked that part yet).

It shouldn't. Well, it might, but only if you're doing some pretty 
sophisticated statistical analysis. Most common statistics can be 
calculated on a single pass through the data. If you need to calculate 
(say) the mean, variance and standard deviation, it is moderately 
straight forward to calculate all three in a single pass without needing 
to have all the data in memory at once.

(Feel free to ask for more detail if you wish.)

Even statistics like median, which normally requires the entire data set 
to be in memory so it can be sorted, can be calculated with a couple of 
passes through the data. I think there is even a single pass algorithm 
for it.


> And since I'm not REALLY ready to ask this question, I'll tack it on to the
> end... I'm also beginning to think about how to speed it up: I'm imagining
> my two options are going to be to code some sections in a faster language
> (i.e. C), or maybe to introduce multi-threading since I'm working on a
> multicore machine generally (core I7), and I'm doing a lot of iterations of
> the same thing with no important order... seems like a good candidate.

A game of Chutes and Ladders (or as we call it in Australia, Snakes 
and Ladders) surely doesn't need to be optimized. It will spend most of 
its time waiting for the human user, who is probably a hundred thousand 
or million times slower than the computer.

But let's say you want to make it faster regardless. How to make it 
faster:

(1) Use a smarter algorithm or less wasteful implementation.

(2) Get more memory.

(3) Get a faster computer.

(4) I think what you are trying to do is a good candidate for PyPy, the 
optimizing Python JIT compiler. Especially if you are playing multiple 
millions of games.

(5) Profile, profile, profile, profile.

Only once you have profiled your code can you possibly hope to guess 
where the bottlenecks are. After 15 or 20 years of programming with 
Python, my guesses for the bottlenecks are still wrong more often than 
they are right.

Assuming you identify the actual bottlenecks:

(6) Try writing them in Cython, which will often give you good speedups.

(7) Or write a C extension.

Should you use threads? Probably not.

- Unless the bottleneck in your code is disk or network I/O, 
  threads will not help, they'll just make your program slower.

- Unless you use IronPython or Jython instead, in which case 
  treads *might* help. But probably not as much as you think.

- If your bottleneck is CPU processing, you can use the 
  multiprocessing module instead of threads.

- You did profile your code first didn't you?


> Now,
> I'm probably pretty far from that piece (in my learning process), but this
> is moving along pretty well so I'm open to suggestions about how to
> proceed. I've started switching up my code a fair bit to try to make it
> more OOP, though I'm still rough on that part.


Advice about optimization from some experts:

http://en.wikipedia.org/wiki/Program_optimization#Quotes

My favourite quote is:

The First Rule of Program Optimization: 
Don't do it. 

The Second Rule of Program Optimization (for experts only!): 
Don't do it yet.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-02 Thread Keith Winston
Thanks for all this Eryksun (and Mark!), but... I don't understand why you
brought gdbm in? Is it something underlying shelve, or a better approach,
or something else? That last part really puts me in a pickle, and I don't
understand why.

Separately, I'm also curious about how to process big files. For example, I
was trying to play 100 million games of chutes & ladders, and I crashed my
machine, I believe: the game results, including 4 ints & 2 short lists of
ints per game, are gathered into a list, so it can become a pretty big
list. I need to do stats and other analyses on it in the end (okay, I
really don't NEED to play 100 million games of chutes & ladders, but as
long as I have...): I suppose I could break it into manageable (maybe 1
million games each), but that will make some of the stats either clunky or
wrong (I haven't really attacked that part yet).

And since I'm not REALLY ready to ask this question, I'll tack it on to the
end... I'm also beginning to think about how to speed it up: I'm imagining
my two options are going to be to code some sections in a faster language
(i.e. C), or maybe to introduce multi-threading since I'm working on a
multicore machine generally (core I7), and I'm doing a lot of iterations of
the same thing with no important order... seems like a good candidate. Now,
I'm probably pretty far from that piece (in my learning process), but this
is moving along pretty well so I'm open to suggestions about how to
proceed. I've started switching up my code a fair bit to try to make it
more OOP, though I'm still rough on that part.

K
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-01 Thread eryksun
On Wed, Jan 1, 2014 at 11:43 AM, Keith Winston  wrote:
> Thanks Danny, I don't understand the re-persisted part, but I'll look into
> it.

Shelf.close calls Shelf.sync to flush the cache. Here's what it does:

>>> print(inspect.getsource(shelve.Shelf.sync))
def sync(self):
if self.writeback and self.cache:
self.writeback = False
for key, entry in self.cache.items():
self[key] = entry
self.writeback = True
self.cache = {}
if hasattr(self.dict, 'sync'):
self.dict.sync()

If the writeback cache is enabled, then loading a key also caches the
item in a dict. `sync` writes all of the cached items back and clears
the cache (for which it should be using `self.cache.clear()`).

It also syncs the underlying database if it has a `sync` attribute,
which the gdbm database [1] requires. Per its man page, "gdbm defaults
to no-sync mode" [2]. So actually the "f" flag for opening in fast
mode is obsolete [3].

1. If you're using a Debian-family Linux, install the
   packages python-gdbm and python3-gdbm.
2. http://linux.die.net/man/3/gdbm
3. http://docs.python.org/3/library/dbm#dbm.gnu.open

Have a look at what gets stored in the database. First lets create a
database and add a list:

>>> import shelve, dbm
>>> sh = shelve.open('class_shelve')
>>> sh['alist'] = [1, 2, 3]

You could leave the Shelf open and use its underlying `sh.dict`, but
I'll reopen the database instead:

>>> sh.close()
>>> db = dbm.open('class_shelve')
>>> data = db['alist']
>>> data
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

shelve serialized the list to a byte string with the pickle protocol.
This is a program that instructs the pickle VM to build the list. You
can manually load the list with pickle.loads:

>>> import pickle
>>> pickle.loads(data)
[1, 2, 3]

You can disassemble the pickle with pickletools.dis, if you're curious:

>>> import pickletools
>>> pickletools.dis(data)
0: \x80 PROTO  3
2: ]EMPTY_LIST
3: qBINPUT 0
5: (MARK
6: KBININT11
8: KBININT12
   10: KBININT13
   12: eAPPENDS(MARK at 5)
   13: .STOP
highest protocol among opcodes = 2
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-01 Thread Mark Lawrence

On 01/01/2014 16:43, Keith Winston wrote:

Thanks Danny, I don't understand the re-persisted part, but I'll look
into it. I realized I hadn't done enough homework to justify a question
right after I sent the first half of that one! Happy New Year!



You do infinitely more work than some who pose questions here!!!  Happy 
New Year to everybody.  And as I've already said elsewhere, let's hope 
2014 is more code, less bugs :)


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2014-01-01 Thread Keith Winston
Thanks Danny, I don't understand the re-persisted part, but I'll look into
it. I realized I hadn't done enough homework to justify a question right
after I sent the first half of that one! Happy New Year!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2013-12-31 Thread Danny Yoo
On Tue, Dec 31, 2013 at 11:01 PM, Keith Winston  wrote:
> I'm working my way slowly through Programming Python by Mark Lutz, and as an
> example of data persistence, he uses this example:


Ooops; the email got cut off a bit early.  Can you try again?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2013-12-31 Thread Danny Yoo
According to:

   http://docs.python.org/2/library/shelve.html

The shelve can be opened in 'writeback' mode, which I think might be
relevant to your question.

"By default modified objects are written only when assigned to the
shelf (see Example). If the optional writebackparameter is set to
True, all entries accessed are also cached in memory, and written back
on sync() and close(); this can make it handier to mutate mutable
entries in the persistent dictionary, but, if many entries are
accessed, it can consume vast amounts of memory for the cache, and it
can make the close operation very slow since all accessed entries are
written back (there is no way to determine which accessed entries are
mutable, nor which ones were actually mutated)."

Let's try it:

##
>>> import shelve
>>> db = shelve.open('class-shelve')
>>> db['a-list'] = [1, 2, 3]
>>> db.close()
>>> db = shelve.open('class-shelve', writeback=True)
>>> db['a-list'].append("four")
>>> db.close()
>>> db = shelve.open('class-shelve')
>>> db['a-list']
[1, 2, 3, 'four']
##


So yes, you should be able to use a shelve in writeback mode to
automatically persist the mutable structures.  That being said, the
docs do say to be aware of the implications: it means every accessed
entry's going to be re-persisted because the shelve does not really
watch for mutations: it just checks for access.

Happy New Year!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Shelve & immutable objects

2013-12-31 Thread Keith Winston
So sorry, I hit return: here's the example:

import shelve
db = shelve.open('class-shelve')

sue = db['sue']
sue.giveRaise(.25)
db['sue'] = sue

tom = db['tom']
tom.giveRaise(.20)
db['tom'] = tom
db.close()

Is it possible to dispense with the assignment/reassignment and just use

(open shelve)
db['sue'].giveRaise(.25)
db['sue'].giveRaise(.25)
(close shelve)

or is the assignment (or bounding?) necessary to unshelve/reshelve the
items...
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] Shelve & immutable objects

2013-12-31 Thread Keith Winston
I'm working my way slowly through Programming Python by Mark Lutz, and as
an example of data persistence, he uses this example:


-- 
Keith
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor