> Do you really believe that you cannot create or delete a large
> dictionary with python versions less than 2.5 (on a 64 bit or multi-
> cpu system)? That a bug of this magnitude has not been noticed until
> someone posted on clp?
You're right, it is completely inappropriate for us to be showing
> On Thu, 15 Nov 2007 15:51:25 -0500, Michael Bacarella wrote:
>
> > Since some people missed the EUREKA!, here's the executive summary:
> >
> > Python2.3: about 45 minutes
> > Python2.4: about 45 minutes
> > Python2.5: about _30 seconds_
&g
> On Nov 15, 2:11 pm, Istvan Albert <[EMAIL PROTECTED]> wrote:
> > There is nothing wrong with neither creating nor deleting
> > dictionaries.
>
> I suspect what happened is this: on 64 bit
> machines the data structures for creating dictionaries
> are larger (because pointers take twice as much s
> Shouldn't this be:
>
> id2name[key >> 40][key & 0xff] = name
Yes, exactly, I had done hex(pow(2,40)) when I meant hex(pow(2,40)-1)
I sent my correction a few minutes afterwards but Mailman
queued it for moderator approval (condition with replying to
myself?)
--
http://mail.pytho
See end for solution.
> >> (3) Are you sure you need all eight-million-plus items in the cache
> >> all at once?
> >
> > Yes.
>
> I remain skeptical, but what do I know, I don't even know what you're
> doing with the data once you have it :-)
It's OK, I'd be skeptical too. ;)
> $ cat /proc/cpui
> > You can download the list of keys from here, it's 43M gzipped:
> > http://www.sendspace.com/file/9530i7
> >
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
>
> On my system (windows vista) your code (using your data) runs in:
>
> 36 seconds with python 2.4
> 25 seconds
> id2name[key >> 40][key & 0x100] = name
Oops, typo. It's actually:
Id2name[key >> 40][key & 0xff] = name
--
http://mail.python.org/mailman/listinfo/python-list
> > I tried your code (with one change, time on feedback lines) and got
the
> > same terrible
> > performance against my data set.
> >
> > To prove that my machine is sane, I ran the same against your
generated
>> sample file and got _excellent_ performance. Start to finish in
under a minute.
> > This would seem to implicate the line id2name[id] = name as being
excruciatingly slow.
>
> As others have pointed out there is no way that this takes 45
> minutes.Must be something with your system or setup.
>
> A functionally equivalent code for me runs in about 49 seconds!
> (it ends up usi
Firstly, thank you for all of your help so far, I really appreciate it.
> > So, you think the Python's dict implementation degrades towards
O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?
>
> No.
Yes.
I tried your code (with one change, time on feedback lines) and go
> Steven D'Aprano wrote:
> > (2) More memory will help avoid paging. If you can't get more memory,
try
> > more virtual memory. It will still be slow, but at least the
operating
> > system doesn't have to try moving blocks around as much.
>
> Based on his previous post, it would seem he has 7GB
> - Original Message
> From: Paul Rubin <http://[EMAIL PROTECTED]>
> To: python-list@python.org
> Sent: Sunday, November 11, 2007 12:45:44 AM
> Subject: Re: Populating a dictionary, fast
>
> Michael Bacarella <[EMAIL PROTECTED]> writes:
> > If on
> That's an awfully complicated way to iterate over a file. Try this
> instead:
>
> id2name = {}
> for line in open('id2name.txt'):
>id,name = line.strip().split(':')
>id = long(id)
>id2name[id] = name
>
> > This takes about 45 *minutes*
> >
> On my system, it takes about a minute an
The id2name.txt file is an index of primary keys to strings. They look like
this:
11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n
The file's properties are:
# wc -l id2name.txt
8191180 id2name.txt
# du -h id2name.txt
517M
> > > > A multithreaded application in Python will only use a single CPU
> on
> > > > multi-CPU machines due to big interpreter lock, whereas the
> "right
> > > thing"
> > > > happens in Java.
> > >
> > > Note that this is untrue for many common uses of threading (e.g.
> using
> > > threads to wait
> > How do you feel about multithreading support?
> >
> > A multithreaded application in Python will only use a single CPU on
> > multi-CPU machines due to big interpreter lock, whereas the "right
> thing"
> > happens in Java.
>
> Note that this is untrue for many common uses of threading (e.g. u
> In our company we are looking for one language to be used as default
> language. So far Python looks like a good choice (slacking behind
> Java). A few requirements that the language should be able cope with
> are:
How do you feel about multithreading support?
A multithreaded application in Pyt
> > Very sure. If we hit the disk at all performance drops
> > unacceptably. The application has low locality of reference so
> > on-demand caching isn't an option. We get the behavior we want when
> > we pre-cache; the issue is simply that it takes so long to build
> > this cache.
>
> The way I
> > This information is hardware dependent and probably unreliable.
> >
> > Why not run a benchmark and report the results instead?
> > Like bogomips? http://en.wikipedia.org/wiki/Bogomips>
>
> That's an interesting idea, but this is in a login script, so I can't
> exactly run benchmarks while lo
> Note that you're not doing the same thing at all. You're
> pre-allocating the array in the C code, but not in Python (and I don't
> think you can). Is there some reason you're growing a 8 gig array 8
> bytes at a time?
>
> They spend about the same amount of time in system, but Python spends 4.7
> On the problem PCs, both of these methods give me the same information
> (i.e. only the processor name). However, if I go to "System
> Properties" and look at the "General" tab, it lists the CPU name and
> processor speed. Does anyone else know of another way to get at this
> information?
This i
> > For various reasons I need to cache about 8GB of data from disk into
core on
> > application startup.
>
> Are you sure? On PC hardware, at least, doing this doesn't make any
> guarantee that accessing it actually going to be any faster. Is just
> mmap()ing the file a problem for some reason?
>
For various reasons I need to cache about 8GB of data from disk into core on
application startup.
Building this cache takes nearly 2 hours on modern hardware. I am surprised
to discover that the bottleneck here is CPU.
The reason this is surprising is because I expect something like this to
tplib.py from Python 2.3 and also dropped in the one from
Python 2.5 with no difference. Running on Linux kernel 2.6 (CentOS's,
specifically).
Any responses CC me as I'm not subscribed [since Python has worked so
flawlessly for me otherwise ]
--
Michael Bacarella <[EMAIL PROTECTE
25 matches
Mail list logo