I have a bit of a specialized request. I'm reading a table of strings (specifically fixed length 36 char uuids generated via uuid.uuid4() in the standard library) from a file and creating a set out of it. Then my program is free to make whatever modifications to this set.
When I go back to save this set, I'd like to be able to only save the new items. Currently I am creating a copy of the set as soon as I load it and then when I go back to save it, i'm calculating the difference and saving just the difference. There are many problems with this approach so far: 1) Calculating the difference via the standard set implementation is not very scalable -> O(n) I presume 2) Maintaining a copy wastes memory 3) I don't have a good solution if I delete items from the set (calculating the difference will return an empty set but I need to actually delete stuff). I was thinking of writing a specialized set implementation (I have no idea how to start on this) which tracks new items (added after initialization) in a separate space and exposes a new API call which would enable me to directly get those values. This is kind of ugly and doesn't solve problem 3. I also thought of using a hastable but I'm storing many ( > 1 million) of these sets in the same file (most of them will be empty or contain just a few values but a few of them will be very large - excess of 10,000 items). The file is already separated into tablespaces. Does anyone have any ideas? Thanks in advance. Prateek PS: Yes - I need blazing fast performance - simply pickling/unpickling won't do. Memory constraints are important but definitely secondary. Disk space constraints are not very important. -- http://mail.python.org/mailman/listinfo/python-list