Prateek <[EMAIL PROTECTED]> wrote: > I have a bit of a specialized request. > > I'm reading a table of strings (specifically fixed length 36 char > uuids generated via uuid.uuid4() in the standard library) from a file > and creating a set out of it. > Then my program is free to make whatever modifications to this set. > > When I go back to save this set, I'd like to be able to only save the > new items. Currently I am creating a copy of the set as soon as I load > it and then when I go back to save it, i'm calculating the difference > and saving just the difference. > > There are many problems with this approach so far: > 1) Calculating the difference via the standard set implementation is > not very scalable -> O(n) I presume
Yep, you do have to look at every item, so O(N) is an obvious lower bound. > 2) Maintaining a copy wastes memory > 3) I don't have a good solution if I delete items from the set > (calculating the difference will return an empty set but I need to > actually delete stuff). (3) is easy -- the difference originalset-finalset is the set of things you have to delete. > Does anyone have any ideas? Your problem is really a thinly disguised version of one of the most ancient and venerable ones in data processing -- the "master file / transaction file" problem. With sets (which are built as hash tables) you have very powerful weapons for this fray (it used to be, many years ago, that sorting and "in-step processing of sorted files" was the only feasible approach to such problems), but you still have to wield them properly. In your shoes, I would write a class whose instances hold three sets: -- the "master set" is what you originally read from the file -- the "added set" is the set of things you've added since then -- the "deleted set" is the set of things you've deleted since them You can implement a set-like interface; for example, your .add method will need to remove the item from the deleted set (if it was there), then add it to the added set (if it wasn't already in the master set); and so on, for each and every method you actually desire (including no doubt __contains__ for membership testing). E.g., with suitably obvious names for the three sets, we could have...: class cleverset(object): ... def __contains__(self, item): if item in self.deleted: return False elif item in self.added: return True else: return item in self.master When the time comes to save back to disk, you'll perform deletions and additions based on self.deleted and self.added, of course. I'm not addressing the issue of how to best represent such sets on disk because it's not obvious to me what operations you need to perform on the on-disk representation -- for example, deletions can be very costly unless you can afford to simply set a "logically deleted" flag on deleted items; but, depending on how often you delete stuff, that will eventually degrade performance due to fragmentation, and maintaining the indexing (if you need one) can be a bear. Normally, I would use a good relational database (which has no doubt already solved on my behalf all of these issues) for purpose of persistent storage of such data -- usually sqlite for reasonably small amounts of data, PostgreSQL if I need enormous scalability and other "enterprise" features, but I hear other people are happy with many other engines for relational DBs, such as Oracle (used to be superb if you could afford full-time specialized db admins), IBM's DB2, SAP's database (now opensourced), Firebird, even MySQL (I've never, but NEVER, had any good experience with the latter, but I guess maybe it's only me, as otherwise it could hardly be so popular). The main point here is that a relational DB's tables _should_ be already very well optimized for storing "sets", since those table ARE exactly nothing but sets of tuples (and a 1-item tuple is a tuple for all that:-). Alex -- http://mail.python.org/mailman/listinfo/python-list