bearophileh...@lycos.com a écrit :
On Apr 28, 2:54 pm, forrest yang <gforrest.y...@gmail.com> wrote:
i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
   arr=line.strip().split('\t')
   dict[line.split(None, 1)[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

Keys are integers,

Actually strings. But this is probably not the problem here.

so they are very efficiently managed by the dict.
If I do this:
d = dict.fromkeys(xrange(9000000))
It takes only a little more than a second on my normal PC.
So probably the problem isn't in the dict, it's the I/O

If the OP experiments a noticeable slow down during the process then I doubt the problem is with IO. If he finds the process to be slow but of constant slowness, then it may or not have to with IO, but possibly not as the single factor.

Hint : don't guess, profile.

and/or the
list allocation. A possible suggestion is to not split the arrays,

The OP is actually splitting a string.

but
keep it as strings, and split them only when you use them:

d = {}
for line in open(file):
  line = line.strip()
  d[line.split(None, 1)[0]] = line

You still split the string - but only once, which is indeed better !-)

Bu you can have your cake and eat it too:

d = {}
for line in open(thefile):
   arr = line.strip().split()
   d[arr[0]] = arr


if that's not fast enough you can simplify it:

d = {}
for line in open(file):
  d[line.split(None, 1)[0]] = line

I doubt this will save that much processing time...

If you have memory problems still, then you can only keep the line
number as dict values, of even absolute file positions, to seek later.
You can also use memory mapped files.

Tell us how is the performance now.

IMHO, not much better...
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to