Re: memory efficient set/dictionary

2007-06-11 Thread Josiah Carlson
koara wrote:
>>> I would recommend you to use a database since it meets your
>>> requirements (off-memory, fast, persistent). The bsdddb module
>>> (berkeley db) even gives you a dictionary like interface.
>>> http://www.python.org/doc/lib/module-bsddb.html
>> Standard SQL databases can work for this, but generally your
>> recommendation of using bsddb works very well for int -> int mappings.
>> In particular, I would suggest using a btree, if only because I have had
>> troubles in the past with colliding keys in the bsddb.hash (and recno is
>> just a flat file, and will attempt to create a file i*(record size) to
>> write to record number i .
>>
>> As an alternative, there are many search-engine known methods for
>> mapping int -> [int, int, ...], which can be implemented as int -> int,
>> where the second int is a pointer to an address on disk.  Looking into a
>> few of the open source search implementations may be worthwhile.
> 
> Thanks guys! I will look into bsddb, hopefully this doesn't keep all
> keys in memory, i couldn't find answer to that during my (very brief)
> look into the documentation.

No, bsddb does not keep all data in memory.


> And how about the extra memory used for set/dict'ing of integers? Is
> there a simple answer?

A non-long integer for a 32 bit Python uses 12 bytes.  A non-long 
integer for a 64 bit Python uses 24 bytes.

Each entry in a dictionary for a 32 bit Python uses 12 bytes; 4 for the 
hash, 4 for the key pointer, 4 for the value pointer.  Double that to 24 
bytes each in the 64 bit Python (I don't know if the hash actually has 
its range increased, but if one doesn't double the size of the pointers, 
then you have alignment issues that can slow down dictionary access).

Sets only have the hash and key pointers, so only use 8/16 bytes per entry.


  - Josiah
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-11 Thread koara
> > I would recommend you to use a database since it meets your
> > requirements (off-memory, fast, persistent). The bsdddb module
> > (berkeley db) even gives you a dictionary like interface.
> >http://www.python.org/doc/lib/module-bsddb.html
>
> Standard SQL databases can work for this, but generally your
> recommendation of using bsddb works very well for int -> int mappings.
> In particular, I would suggest using a btree, if only because I have had
> troubles in the past with colliding keys in the bsddb.hash (and recno is
> just a flat file, and will attempt to create a file i*(record size) to
> write to record number i .
>
> As an alternative, there are many search-engine known methods for
> mapping int -> [int, int, ...], which can be implemented as int -> int,
> where the second int is a pointer to an address on disk.  Looking into a
> few of the open source search implementations may be worthwhile.

Thanks guys! I will look into bsddb, hopefully this doesn't keep all
keys in memory, i couldn't find answer to that during my (very brief)
look into the documentation.

And how about the extra memory used for set/dict'ing of integers? Is
there a simple answer?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-10 Thread Josiah Carlson
Rafael Darder Calvo wrote:
>> > > Please recommend a module that allows persistent set/dict storage +
>> > > fast query that best fits my problem,
>> >
>> > What is the problem you are trying to solve? How many keys do you have?
>>
>> Corpus processing. There are in the order of billions to tens of
>> billions keys (64bit integers).
>>
> I would recommend you to use a database since it meets your
> requirements (off-memory, fast, persistent). The bsdddb module
> (berkeley db) even gives you a dictionary like interface.
> http://www.python.org/doc/lib/module-bsddb.html

Standard SQL databases can work for this, but generally your 
recommendation of using bsddb works very well for int -> int mappings. 
In particular, I would suggest using a btree, if only because I have had 
troubles in the past with colliding keys in the bsddb.hash (and recno is 
just a flat file, and will attempt to create a file i*(record size) to 
write to record number i .

As an alternative, there are many search-engine known methods for 
mapping int -> [int, int, ...], which can be implemented as int -> int, 
where the second int is a pointer to an address on disk.  Looking into a 
few of the open source search implementations may be worthwhile.

  - Josiah
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-10 Thread Rafael Darder Calvo
> > > Please recommend a module that allows persistent set/dict storage +
> > > fast query that best fits my problem,
> >
> > What is the problem you are trying to solve? How many keys do you have?
>
> Corpus processing. There are in the order of billions to tens of
> billions keys (64bit integers).
>
I would recommend you to use a database since it meets your
requirements (off-memory, fast, persistent). The bsdddb module
(berkeley db) even gives you a dictionary like interface.
http://www.python.org/doc/lib/module-bsddb.html

regards,
Rafa
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-10 Thread koara
Hello Steven,

On Jun 10, 5:29 pm, Steven D'Aprano
<[EMAIL PROTECTED]> wrote:
> > ...
> How do you know it won't fit in main memory if you don't know the
> overhead? A guess? You've tried it and your computer crashed?

exactly

> > Please recommend a module that allows persistent set/dict storage +
> > fast query that best fits my problem,
>
> Usually I love guessing what people's problems are before making a
> recommendation, but I'm feeling whimsical so I think I'll ask first.
>
> What is the problem you are trying to solve? How many keys do you have?

Corpus processing. There are in the order of billions to tens of
billions keys (64bit integers).

> Can you group them in some way, e.g. alphabetically? Do you need to search
> on random keys, or can you queue them and do them in the order of your
> choice?


Yes, keys in sets and dictionaries can be grouped in any way, order
doesn't matter. Not sure what you mean.
Yes, i need fast random access (at least i do without having to
rethink and rewrite everything, which is what i'd like to avoid with
the help of this thread :-)

Thanks for the reply!


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-10 Thread Steven D'Aprano
On Sun, 10 Jun 2007 07:27:56 -0700, koara wrote:

> What is the best to go about using a large set (or dictionary) that
> doesn't fit into main memory? What is Python's (2.5 let's say)
> overhead for storing int in the set, and how much for storing int ->
> int mapping in the dict?

How do you know it won't fit in main memory if you don't know the
overhead? A guess? You've tried it and your computer crashed?


> Please recommend a module that allows persistent set/dict storage +
> fast query that best fits my problem, 

Usually I love guessing what people's problems are before making a
recommendation, but I'm feeling whimsical so I think I'll ask first.

What is the problem you are trying to solve? How many keys do you have?
Can you group them in some way, e.g. alphabetically? Do you need to search
on random keys, or can you queue them and do them in the order of your
choice?


> and as lightweight as possible.
> For queries, the hit ratio is about 10%. Fast updates would be nice,
> but i can rewrite the algo so that the data is static, so update speed
> is not critical.
> 
> Or am i better off not using Python here? Cheers.


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


memory efficient set/dictionary

2007-06-10 Thread koara
What is the best to go about using a large set (or dictionary) that
doesn't fit into main memory? What is Python's (2.5 let's say)
overhead for storing int in the set, and how much for storing int ->
int mapping in the dict?

Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem, and as lightweight as possible.
For queries, the hit ratio is about 10%. Fast updates would be nice,
but i can rewrite the algo so that the data is static, so update speed
is not critical.

Or am i better off not using Python here? Cheers.

-- 
http://mail.python.org/mailman/listinfo/python-list