RE: Populating a dictionary, fast [SOLVED SOLVED]

2007-11-16 Thread Michael Bacarella
> Do you really believe that you cannot create or delete a large > dictionary with python versions less than 2.5 (on a 64 bit or multi- > cpu system)? That a bug of this magnitude has not been noticed until > someone posted on clp? You're right, it is completely inappropriate for us to be showing

RE: Populating a dictionary, fast [SOLVED SOLVED]

2007-11-15 Thread Michael Bacarella
> On Thu, 15 Nov 2007 15:51:25 -0500, Michael Bacarella wrote: > > > Since some people missed the EUREKA!, here's the executive summary: > > > > Python2.3: about 45 minutes > > Python2.4: about 45 minutes > > Python2.5: about _30 seconds_ &g

RE: Populating a dictionary, fast [SOLVED SOLVED]

2007-11-15 Thread Michael Bacarella
> On Nov 15, 2:11 pm, Istvan Albert <[EMAIL PROTECTED]> wrote: > > There is nothing wrong with neither creating nor deleting > > dictionaries. > > I suspect what happened is this: on 64 bit > machines the data structures for creating dictionaries > are larger (because pointers take twice as much s

RE: Populating a dictionary, fast [SOLVED]

2007-11-13 Thread Michael Bacarella
> Shouldn't this be: > > id2name[key >> 40][key & 0xff] = name Yes, exactly, I had done hex(pow(2,40)) when I meant hex(pow(2,40)-1) I sent my correction a few minutes afterwards but Mailman queued it for moderator approval (condition with replying to myself?) -- http://mail.pytho

RE: Populating a dictionary, fast [SOLVED]

2007-11-13 Thread Michael Bacarella
See end for solution. > >> (3) Are you sure you need all eight-million-plus items in the cache > >> all at once? > > > > Yes. > > I remain skeptical, but what do I know, I don't even know what you're > doing with the data once you have it :-) It's OK, I'd be skeptical too. ;) > $ cat /proc/cpui

RE: Populating a dictionary, fast [SOLVED SOLVED]

2007-11-12 Thread Michael Bacarella
> > You can download the list of keys from here, it's 43M gzipped: > > http://www.sendspace.com/file/9530i7 > > > > and see it take about 45 minutes with this: > > > > $ cat cache-keys.py > > #!/usr/bin/python > > v = {} > > for line in open('keys.txt'): > > v[long(line.strip())] = True

RE: Populating a dictionary, fast

2007-11-12 Thread Michael Bacarella
> > and see it take about 45 minutes with this: > > > > $ cat cache-keys.py > > #!/usr/bin/python > > v = {} > > for line in open('keys.txt'): > > v[long(line.strip())] = True > > On my system (windows vista) your code (using your data) runs in: > > 36 seconds with python 2.4 > 25 seconds

RE: Populating a dictionary, fast [SOLVED]

2007-11-12 Thread Michael Bacarella
> id2name[key >> 40][key & 0x100] = name Oops, typo. It's actually: Id2name[key >> 40][key & 0xff] = name -- http://mail.python.org/mailman/listinfo/python-list

Re: Populating a dictionary, fast

2007-11-11 Thread Michael Bacarella
> > I tried your code (with one change, time on feedback lines) and got the > > same terrible > > performance against my data set. > > > > To prove that my machine is sane, I ran the same against your generated >> sample file and got _excellent_ performance. Start to finish in under a minute.

Re: Populating a dictionary, fast

2007-11-11 Thread Michael Bacarella
> > This would seem to implicate the line id2name[id] = name as being excruciatingly slow. > > As others have pointed out there is no way that this takes 45 > minutes.Must be something with your system or setup. > > A functionally equivalent code for me runs in about 49 seconds! > (it ends up usi

Re: Populating a dictionary, fast

2007-11-11 Thread Michael Bacarella
Firstly, thank you for all of your help so far, I really appreciate it. > > So, you think the Python's dict implementation degrades towards O(N) > > performance when it's fed millions of 64-bit pseudo-random longs? > > No. Yes. I tried your code (with one change, time on feedback lines) and go

Re: Populating a dictionary, fast

2007-11-11 Thread Michael Bacarella
> Steven D'Aprano wrote: > > (2) More memory will help avoid paging. If you can't get more memory, try > > more virtual memory. It will still be slow, but at least the operating > > system doesn't have to try moving blocks around as much. > > Based on his previous post, it would seem he has 7GB

Re: Populating a dictionary, fast

2007-11-11 Thread Michael Bacarella
> - Original Message > From: Paul Rubin <http://[EMAIL PROTECTED]> > To: python-list@python.org > Sent: Sunday, November 11, 2007 12:45:44 AM > Subject: Re: Populating a dictionary, fast > > Michael Bacarella <[EMAIL PROTECTED]> writes: > > If on

Re: Populating a dictionary, fast

2007-11-10 Thread Michael Bacarella
> That's an awfully complicated way to iterate over a file. Try this > instead: > > id2name = {} > for line in open('id2name.txt'): >id,name = line.strip().split(':') >id = long(id) >id2name[id] = name > > > This takes about 45 *minutes* > > > On my system, it takes about a minute an

Populating a dictionary, fast

2007-11-10 Thread Michael Bacarella
The id2name.txt file is an index of primary keys to strings. They look like this: 11293102971459182412:Descriptive unique name for this record\n 950918240981208142:Another name for another record\n The file's properties are: # wc -l id2name.txt 8191180 id2name.txt # du -h id2name.txt 517M

RE: Using python as primary language

2007-11-08 Thread Michael Bacarella
> > > > A multithreaded application in Python will only use a single CPU > on > > > > multi-CPU machines due to big interpreter lock, whereas the > "right > > > thing" > > > > happens in Java. > > > > > > Note that this is untrue for many common uses of threading (e.g. > using > > > threads to wait

RE: Using python as primary language

2007-11-08 Thread Michael Bacarella
> > How do you feel about multithreading support? > > > > A multithreaded application in Python will only use a single CPU on > > multi-CPU machines due to big interpreter lock, whereas the "right > thing" > > happens in Java. > > Note that this is untrue for many common uses of threading (e.g. u

RE: Using python as primary language

2007-11-08 Thread Michael Bacarella
> In our company we are looking for one language to be used as default > language. So far Python looks like a good choice (slacking behind > Java). A few requirements that the language should be able cope with > are: How do you feel about multithreading support? A multithreaded application in Pyt

RE: Populating huge data structures from disk

2007-11-06 Thread Michael Bacarella
> > Very sure. If we hit the disk at all performance drops > > unacceptably. The application has low locality of reference so > > on-demand caching isn't an option. We get the behavior we want when > > we pre-cache; the issue is simply that it takes so long to build > > this cache. > > The way I

RE: How do I get the PC's Processor speed?

2007-11-06 Thread Michael Bacarella
> > This information is hardware dependent and probably unreliable. > > > > Why not run a benchmark and report the results instead? > > Like bogomips? http://en.wikipedia.org/wiki/Bogomips> > > That's an interesting idea, but this is in a login script, so I can't > exactly run benchmarks while lo

RE: Populating huge data structures from disk

2007-11-06 Thread Michael Bacarella
> Note that you're not doing the same thing at all. You're > pre-allocating the array in the C code, but not in Python (and I don't > think you can). Is there some reason you're growing a 8 gig array 8 > bytes at a time? > > They spend about the same amount of time in system, but Python spends 4.7

RE: How do I get the PC's Processor speed?

2007-11-06 Thread Michael Bacarella
> On the problem PCs, both of these methods give me the same information > (i.e. only the processor name). However, if I go to "System > Properties" and look at the "General" tab, it lists the CPU name and > processor speed. Does anyone else know of another way to get at this > information? This i

RE: Populating huge data structures from disk

2007-11-06 Thread Michael Bacarella
> > For various reasons I need to cache about 8GB of data from disk into core on > > application startup. > > Are you sure? On PC hardware, at least, doing this doesn't make any > guarantee that accessing it actually going to be any faster. Is just > mmap()ing the file a problem for some reason? >

Populating huge data structures from disk

2007-11-06 Thread Michael Bacarella
For various reasons I need to cache about 8GB of data from disk into core on application startup. Building this cache takes nearly 2 hours on modern hardware. I am surprised to discover that the bottleneck here is CPU. The reason this is surprising is because I expect something like this to

httplib hangs in read / strace says recvfrom()

2007-09-13 Thread Michael Bacarella
tplib.py from Python 2.3 and also dropped in the one from Python 2.5 with no difference. Running on Linux kernel 2.6 (CentOS's, specifically). Any responses CC me as I'm not subscribed [since Python has worked so flawlessly for me otherwise ] -- Michael Bacarella <[EMAIL PROTECTE