Re: Writing huge Sets() to disk
Hi, could someone tell me what all does and what all doesn't copy references in python. I have found my script after reaching some state and taking say 600MB, pushes it's internal dictionaries to hard disk. The for loop consumes another 300MB (as gathered by vmstat) to push the data to dictionaries, then releases little bit less than 300MB and the program start to fill-up again it's internal dictionaries, when full will do the flush again ... The point here is, that this code takes a lot of extra memory. I believe it's the references problem, and I remeber complains of frineds facing same problem. I'm a newbie, yes, but don't have this problem with Perl. OK, I want to improve my Pyhton knowledge ... :-)) def push_to_disk(self): _dict_on_disk_tuple = (None, self._dict_on_disk1, self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, self._dict_on_disk20) _size = 0 # # sizes of these tmpdicts range from 10-1 entries for each! for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20): _size += 1 if _tmpdict: _dict_on_disk = _dict_on_disk_tuple[_size] for _word, _value in _tmpdict.iteritems(): try: _string = _dict_on_disk[_word] # I discard _a and _b, maybe _string.find(' ') combined with slice would do better? _abs_count, _a, _b, _expected_freq = _string.split() _abs_count = int(_abs_count).__add__(_value) _t = (str(_abs_count), '0', '0', '0') except KeyError: _t = (str(_value), '0', '0', '0') # this writes a copy to the dict, right? _dict_on_disk[_word] = ' '.join(_t) # # clear the temporary dictionaries in ourself # I think this works as expected and really does release memory # for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20): _tmpdict.clear() The above routine doesn't release of the memory back when it exits. See, the loop takes 25 minutes already, and it's prolonging as the program is in about 1/3 or 1/4 of the total input. The rest of my code is fast in contrast to this (below 1 minute). -rw--- 1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db -rw--- 1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db -rw--- 1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db -rw--- 1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db -rw--- 1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db -rw--- 1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db -rw--- 1 mmokrejs users 19488 Jan 17 11:02 diskdict6.db -rw--- 1 mmokrejs users 66584576 Jan 17 10:59 diskdict5.db -rw--- 1 mmokrejs users 5750784 Jan 17 10:57 diskdict4.db -rw--- 1 mmokrejs users311296 Jan 17 10:57 diskdict3.db -rw--- 1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db -rw--- 1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db -rw--- 1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db -rw--- 1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db -rw--- 1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db -rw--- 1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db -rw--- 1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db -rw--- 1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db Some spoke about mmaped files. Could I take advantage of that with bsddb module or bsddb? Is gdbm better in some ways? Recently you have said dictionary operations are fast ... Once more. I want to turn of locking support. I can make the values as strings of fixed size, if mmap() would be available. The number of keys doesn't grow much in time, mostly there are only updates. Thaks for any ideas. martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Martin MOKREJ© wrote: Hi, could someone tell me what all does and what all doesn't copy references in python. I have found my script after reaching some state and taking say 600MB, pushes it's internal dictionaries to hard disk. The for loop consumes another 300MB (as gathered by vmstat) to push the data to dictionaries, then releases little bit less than 300MB and the program start to fill-up again it's internal dictionaries, when full will do the flush again ... Almost anything you do copies references. The point here is, that this code takes a lot of extra memory. I believe it's the references problem, and I remeber complains of frineds facing same problem. I'm a newbie, yes, but don't have this problem with Perl. OK, I want to improve my Pyhton knowledge ... :-)) long code extract snipped The above routine doesn't release of the memory back when it exits. That's probably because there isn't any memory it can reasonable be expected to release. What memory would *you* expect it to release? The member variables are all still accessible as member variables until you run your loop at the end to clear them, so no way could Python release them. Some hints: When posting code, try to post complete examples which actually work. I don't know what type the self._dict_on_diskXX variables are supposed to be. It makes a big difference if they are dictionaries (so you are trying to hold everything in memory at one time) or shelve.Shelf objects which would store the values on disc in a reasonably efficient manner. Even if they are Shelf objects, I see no reason here why you have to process everything at once. Write a simple function which processes one tmpdict object into one dict_on_disk object and then closes the dict_on_disk object. If you want to compare results later then do that by reopening the dict_on_disk objects when you have deleted all the tmpdicts. Extract out everything you want to do into a class which has at most one tmpdict and one dict_on_disk That way your code will be a lot easier to read. Make your code more legible by using fewer underscores. What on earth is the point of an explicit call to __add__? If Guido had meant us to use __add__ he woudn't have created '+'. What is the purpose of dict_on_disk? Is it for humans to read the data? If not, then don't store everything as a string. Far better to just store a tuple of your values then you don't have to use split or cast the strings to integers. If you do want humans to read some final output then produce that separately from the working data files. You split out 4 values from dict_on_disk and set three of them to 0. If that really what you meant or should you be preserving the previous values? Here is some (untested) code which might help you: import shelve def push_to_disc(data, filename): database = shelve.open(filename) try: for key in data: if database.has_key(key): count, a, b, expected = database[key] database[key] = count+data[key], a, b, expected else: database[key] = data[key], 0, 0, 0 finally: database.close() data.clear() Call that once for each input dictionary and your data will be written out to a disc file and the internal dictionary cleared without any great spike of memory use. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Duncan Booth wrote: Martin MOKREJ wrote: Hi, could someone tell me what all does and what all doesn't copy references in python. I have found my script after reaching some state and taking say 600MB, pushes it's internal dictionaries to hard disk. The for loop consumes another 300MB (as gathered by vmstat) to push the data to dictionaries, then releases little bit less than 300MB and the program start to fill-up again it's internal dictionaries, when full will do the flush again ... Almost anything you do copies references. But what does this?: x = 'x' a = x[2:] b = z + len(x) dict[a] = b The point here is, that this code takes a lot of extra memory. I believe it's the references problem, and I remeber complains of frineds facing same problem. I'm a newbie, yes, but don't have this problem with Perl. OK, I want to improve my Pyhton knowledge ... :-)) long code extract snipped The above routine doesn't release of the memory back when it exits. That's probably because there isn't any memory it can reasonable be expected to release. What memory would *you* expect it to release? Thos 300MB, they get allocated/reserved when the posted loop get's executed. When the loops exits, almost all is returned/deallocated. Yes, almost. :( The member variables are all still accessible as member variables until you run your loop at the end to clear them, so no way could Python release them. OK, I wanted to know if there's some assignment using a reference, which makes the internal garbage collector not to recycle the memory, as, for example, the target dictionary still keeps reference to the temporary dictionary. Some hints: When posting code, try to post complete examples which actually work. I don't know what type the self._dict_on_diskXX variables are supposed to be. It makes a big difference if they are dictionaries (so you are trying to hold everything in memory at one time) or shelve.Shelf objects which would store the values on disc in a reasonably efficient manner. The self._dict_on_diskXX are bsddb files, self._tmpdictXX are builtin dictionaries. Even if they are Shelf objects, I see no reason here why you have to I gathered from previous discussion it's faster to use bsddb directly, so no shelve. process everything at once. Write a simple function which processes one tmpdict object into one dict_on_disk object and then closes the That's what I do, but in the for loop ... dict_on_disk object. If you want to compare results later then do that by OK, I got your point. reopening the dict_on_disk objects when you have deleted all the tmpdicts. That's what I do (not shown). Extract out everything you want to do into a class which has at most one tmpdict and one dict_on_disk That way your code will be a lot easier to read. Make your code more legible by using fewer underscores. What on earth is the point of an explicit call to __add__? If Guido had meant us to use __add__ he woudn't have created '+'. To invoke additon directly on the object. It's faster than letting python to figure out that I sum up int() plus int(). It definitely has helped a lot when using Decimal(a) + Decimal(b), where I got rid of thousands of Decimal(__new__), __init__ and I think few other methods of decimal as well - I think getcontext too. What is the purpose of dict_on_disk? Is it for humans to read the data? If not, then don't store everything as a string. Far better to just store a For humans is processed later. tuple of your values then you don't have to use split or cast the strings bsddb creaps on me that I can store as a key or value only a string. I'd love to store tuple. import bsddb _words1 = bsddb.btopen('db1.db', 'c') _words1['a'] = 1 Traceback (most recent call last): File stdin, line 1, in ? File /usr/lib/python2.3/bsddb/__init__.py, line 120, in __setitem__ self.db[key] = value TypeError: Key and Data values must be of type string or None. How can I record a number then? to integers. If you do want humans to read some final output then produce that separately from the working data files. You split out 4 values from dict_on_disk and set three of them to 0. If that really what you meant or should you be preserving the previous values? No, overwrite them, i.e. invalidate them. Originally I recorded only first, but to compute the latter numbers is so expensive I have to store them. As walking through the dictionaries is so slow, I gave up on an idea to store just one, and a lot later in the program walk once again through the dictionary and 'fix' it by computing missing values. Here is some (untested) code which might help you: import shelve Why shelve? To have the ability to record tuple? Isn't it cheaper to convert to string and back and write to bsddb compared to this overhead? def push_to_disc(data, filename): database = shelve.open(filename) try: for key in data: if database.has_key(key): count, a, b, expected = database[key]
Re: Writing huge Sets() to disk
Steve Holden wrote: Martin MOKREJ wrote: Hi, could someone tell me what all does and what all doesn't copy references in python. I have found my script after reaching some state and taking say 600MB, pushes it's internal dictionaries to hard disk. The for loop consumes another 300MB (as gathered by vmstat) to push the data to dictionaries, then releases little bit less than 300MB and the program start to fill-up again it's internal dictionaries, when full will do the flush again ... The point here is, that this code takes a lot of extra memory. I believe it's the references problem, and I remeber complains of frineds facing same problem. I'm a newbie, yes, but don't have this problem with Perl. OK, I want to improve my Pyhton knowledge ... :-)) Right ho! In fact I suspect you are still quite new to programming as a whole, for reasons that may become clear as we proceed. def push_to_disk(self): _dict_on_disk_tuple = (None, self._dict_on_disk1, self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, self._dict_on_disk20) The None above is there just as I didn't want to evaluate all the time something like x+1: for x in _dict_on_disk_tuple: key = newdict[x+1][key] = ... This None doesn't hurt me, then x contains exactly the value I need several times in the loop ... It's a bit unfortunate that all those instance variables are global to the method, as it means we can't clearly see what you intend them to do. However ... Whenever I see such code, it makes me suspect that the approach to the problem could be more subtle. It appears you have decided to partition your data into twenty chunks somehow. The algorithm is clearly not coded in a way that would make it easy to modify the number of chunks. No, it's not, but that's not the speed problem, really. ;) [Hint: by easy I mean modifying a statement that reads chunks = 20 to read chunks = 40 for example]. To avoid this, we might use (say) a list of temp edicts, for example (the length of this could easily then be parameterized as mentioned. So where (my psychic powers tell me) your __init__() method currently contains self._dict_on_disk1 = something() self._dict_on_disk2 = something() ... self._dict_on_disk20 = something() Almost. They are bsddb dictionary files. I would have written self._disk_dicts = [] for i in range(20): self._disk_dicts.append(something) Than again, I probably have an advantage over you. I'm such a crappy typist I can guarantee I'd make at least six mistakes doing it your way :-) Originally I had this, I just to get of one small list. ;) _size = 0 What with all these leading underscores I presume it must be VERY important to keep these object's instance variables private. Do you have a particular reason for that, or just general Perl-induced paranoia? :-) See below. ;) # # sizes of these tmpdicts range from 10-1 entries for each! for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20): _size += 1 if _tmpdict: _dict_on_disk = _dict_on_disk_tuple[_size] for _word, _value in _tmpdict.iteritems(): try: _string = _dict_on_disk[_word] # I discard _a and _b, maybe _string.find(' ') combined with slice would do better? _abs_count, _a, _b, _expected_freq = _string.split() _abs_count = int(_abs_count).__add__(_value) _t = (str(_abs_count), '0', '0', '0') except KeyError: _t = (str(_value), '0', '0', '0') # this writes a copy to the dict, right? _dict_on_disk[_word] = ' '.join(_t) # # clear the temporary dictionaries in ourself # I think this works as expected and really does release memory # for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20): _tmpdict.clear() There you go again with that huge tuple. You
Re: Writing huge Sets() to disk
Tim Peters wrote: [Martin MOKREJ] ... I gave up the theoretical approach. Practically, I might need up to store maybe those 1E15 keys. We should work on our multiplication skills here wink. You don't have enough disk space to store 1E15 keys. If your keys were just one byte each, you would need to have 4 thousand disks of 250GB each to store 1E15 keys. How much disk space do you actually have? I'm betting you have no more than one 250GB disk. ... [Istvan Albert] On my system storing 1 million words of length 15 as keys of a python dictionary is around 75MB. Fine, that's what I wanted to hear. How do you improve the algorithm? Do you delay indexing to the very latest moment or do you let your computer index 999 999 times just for fun? It remains wholly unclear to me what the algorithm you want might be. As I mentioned before, if you store keys in sorted text files, you can do intersection and difference very efficiently just by using the Unix `comm` utiltity. This comm(1) approach doesn't work for me. It somehow fails to detect common entries when the offset is too big. file 1: A F G I K M N R V AA AI FG FR GF GI GR IG IK IN IV KI MA NG RA RI VF AIK FGR FRA GFG GIN GRI IGI IGR IKI ING IVF KIG MAI NGF RAA RIG file 2: W W W W W W W W W W AA AI FG FR GF GI GR IG IK IN IV KI MA NG RA RI VF A A A A A A A A A A A A -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
[Martin MOKREJ] This comm(1) approach doesn't work for me. It somehow fails to detect common entries when the offset is too big. file 1: A F G I K M N R V AA AI FG FR GF GI GR IG IK IN IV KI MA NG RA RI VF AIK FGR FRA GFG GIN GRI IGI IGR IKI ING IVF KIG MAI NGF RAA RIG file 2: W W W W W W W W W W AA AI FG FR GF GI GR IG IK IN IV KI MA NG RA RI VF A A A A A A A A A A A A I'll repeat: As I mentioned before, if you store keys in sorted text files ... Those files aren't in sorted order, so of course `comm` can't do anything useful with them. Do `man sort`; sorting is not optional here. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Tim Peters wrote: [Martin MOKREJ] This comm(1) approach doesn't work for me. It somehow fails to detect common entries when the offset is too big. [...] I'll repeat: As I mentioned before, if you store keys in sorted text files ... Those files aren't in sorted order, so of course `comm` can't do anything useful with them. Do `man sort`; sorting is not optional here. I did read the manpage, but somehow it seems I did not execute sort(1) from within my python code, so it was unsorted and did did not realize it yet. Thanks! m. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
On Mon, 10 Jan 2005 17:11:09 +0100, =?ISO-8859-2?Q?Martin_MOKREJ=A9?= [EMAIL PROTECTED] wrote: Hi, I have sets.Set() objects having up to 20E20 items, What notation are you using when you write 20E20? IOW, ISTM 1E9 is a billion. So 20E20 would be 2000 billion billion. Please clarify ;-) each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. How can I write them efficiently to disk? To be more exact, I have 20 sets. _set1 has 1E20 keys of size 1 character. alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E') for aa1 in alphabet: # l = [aa1] #_set1.add(aa1) for aa2 in alphabet: # l.append(aa2) #_set2.add(''.join(l)) [cut] The reason I went for sets instead of lists is the speed, availability of unique, common and other methods. What would you propose as an elegant solution? Actually, even those nested for loops take ages. :( If you will explain a little what you are doing with these set items perhaps someone will think of another way to represent and use your data. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Martin MOKREJ wrote: But I don't think I can use one-way hashes, as I need to reconstruct the string later. I have to study hard to get an idea what the proposed code really does. Scott David Daniels wrote: Tim Peters wrote: Call the set of all English words E; G, C, and P similarly. This Python expression then gives the set of words common to all 4: E G C P and for those unique to Polish. P - E - G - C One attack is to define a hash to get sizes smaller. Suppose you find your word sets are 10**8 large, and you find you need to deal with sets of size 10**6. Then if you define a hash that produces an integer below 100, you'll find: E G C P == Union(E_k G_k C_k P_k) for k in range(100) P - E - G - C == Union(P_k - E_k - G_k - C_k) for k in range(100) where X_k = [v for v in X if somehash(v) == k] I don't understand here what would be the result from the hashing function. :( ... Can you clarify this please in more detail? The trick is to build the X_k values into files. For example: for base in range(0, 100, 10): # work in blocks of 10 (or whatever) for language in ('English', 'German', 'Czech', 'Polish'): source = open(language) dests = [open('%s_%s.txt' % (language.lower(), base + n), 'w') for n in range(10)] for word in source: # Actually this is probably more involved code = somehash(word) - base if 0 = code base: dests[code].write(word + '\n') for dest in dests: dest.close() source.close() After running the above code, you get 400 files with names like, for example, 'english_33.txt'. 'english_33.txt' contains only words in English which hash to 33. Then you need to sort the different files (like 'english_33.txt') before using them in the next phase. If you can sort and dupe-elim in the same step, do that and get rid of the calls to dupelim below. Otherwise use dupelim below when reading the sorted files. If you must, you might want to do the sort and dupe-elim in Python: for language in ('english', 'german', 'czech', 'polish'): for hashcode in range(100): filename = '%s_%s.txt' % (language, hashcode) source = open(filename) lines = sorted(source) source.close() dest = open(filename, 'w') for line in dupelim(lines): dest.write(line) dest.close() def inboth(left, right): def leftonly(left, right): Aaaah, so the functions just walk one line in left, one line in right, if values don't match the value in left is unique, it walks again one line in left and checks if it already matches the value in right file in the last position, and so on until it find same value in the right file? Exactly, you must make sure you deal with cases where you pass values, match values, and run out of source -- that keeps these from being four- liners. For example: Ponly = open('polishonly.txt', 'w') every = open('every.txt', 'w') for hashcode in range(100): English = open('english_%s.txt' % hashcode) ^^ this is some kind of eval? If hashcode == 33 then 'english_%s.txt' % hashcode == 'english_33.txt' So we are just finding the language-specific for-this-hash source. ... for unmatched in leftonly(leftonly(leftonly(dupelim(Polish), dupelim(English)), dupelim(German)), dupelim(Czech)): Ponly.write(unmatched) for source in (Polish, English, German, Czech):#(paraphrased) source.seek(0) for match in inboth(inboth(dupelim(Polish), upelim(English)), inboth(dupelim(German), dupelim(Czech))): every.write(match) for source in (Polish, English, German, Czech):#(paraphrased) source.close() Ponly.close() every.close() --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
On Tue, Jan 11, 2005 at 12:33:42AM +0200, Simo Melenius wrote: John Lenton [EMAIL PROTECTED] writes: you probably want to look into building set-like objects ontop of tries, given the homogeneity of your language. You should see imrpovements both in size and speed. Ternary search trees give _much_ better space-efficiency compared to tries, at the expense of only slightly slower build and search time. This is especially essential as the OP mentioned he could have huge sets of data. hmm! sounds like *some*body needs to go read up on ternary trees, then. Ok, ok, I'm going... -- John Lenton ([EMAIL PROTECTED]) -- Random fortune: Fortune finishes the great quotations, #6 But, soft! What light through yonder window breaks? It's nothing, honey. Go back to sleep. signature.asc Description: Digital signature -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Martin MOKREJ¦ [EMAIL PROTECTED] writes: I have sets.Set() objects having up to 20E20 items, just imagine, you want to compare how many words are in English, German, Czech, Polish disctionary. You collect words from every language and record them in dict or Set, as you wish. Once you have those Set's or dict's for those 4 languages, you ask for common words and for those unique to Polish. I have no estimates of real-world numbers, but we might be in range of 1E6 or 1E8? I believe in any case, huge. They'll be less than 1e6 and so not huge by the standards of today's computers. You could use ordinary dicts or sets. 1e20 is another matter. I doubt that there are any computers in the world with that much storage. How big is your dataset REALLY? I wanted to be able to get a list of words NOT found in say Polish, and therefore wanted to have a list of all, theoretically existing words. In principle, I can drop this idea of having ideal, theoretical lexicon. But have to store those real-world dictionaries anyway to hard drive. One way you could do it is by dumping all the words sequentially to disk, then sorting the resulting disk file using an O(n log n) offline algorithm. Basically data sets of this size are outside of what you can easily handle with builtin Python operations without putting some thought into algorithms and data structures. From ribosome I'm guessing you're doing computational biology. If you're going to be writing code for these kinds of problems on a regular basis, you probably ought to read a book or two on the topic. CLRS is a good choice: http://theory.lcs.mit.edu/~clr/ http://mitpress.mit.edu/algorithms/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
=?windows-1252?Q?Martin_MOKREJ=8A?= [EMAIL PROTECTED] writes: Yes, I'm. I still don't get what that acronym CLRS stands for ... :( CLRS = the names of the authors, Cormen, Leiserson, Rivest, and Stein, if I spelled those correctly. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
[Istvan Albert] #- I think that you need to first understand how dictionaries work. #- The time needed to insert a key is independent of #- the number of values in the dictionary. [Batista, Facundo] Are you sure? I think that is true while the hashes don't collide. If you have collisions, time starts to depend of element quantity. But I'm not sure Tim sure can enlighten us. I can only point the way to enlightenment -- you have to contemplate your own navel (or Guido's, but he's kind of shy that way). What Istvan Albert said is close enough in context. The *expected* (mean) number of collisions in a Python dict probe is less than 1, regardless of dict size. That assumes the collision distribution is no worse than random. It's possible that all dict keys hash to the same table slot, and then each insertion is O(len(dict)). It's possible to contrive such cases even if the hash function is a good one. But it's not going to happen by accident (and, when it does wink, open a bug report -- we'll improve the key type's hash function then). -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Batista, Facundo wrote: [Martin MOKREJ?] #- I have sets.Set() objects having up to 20E20 items, #- each is composed of up to 20 characters. Keeping Are you really sure?? Either I'll have to construct them all over again say 20-30 times, or I'll find a way to keep them on disk. #- How can I write them efficiently to disk? To be more exact, I think that there's some mistake here. At least you'll need a disk of 34694 EXABYTES!!! Hmm, you are right. So 20E15 then? I definitely need to be in range 1-14. ;-) Anyway they are large. M. -- http://mail.python.org/mailman/listinfo/python-list
RE: Writing huge Sets() to disk
Title: RE: Writing huge Sets() to disk [Martin MOKREJ] #- At least you'll need a disk of 34694 EXABYTES!!! #- #- Hmm, you are right. So 20E15 then? I definitely need to be Right. Now you only need 355 PETABytes. Nowadays disk is cheap, but... #- in range 1-14. ;-) Why? . Facundo Bitcora De Vuelo: http://www.taniquetil.com.ar/plog PyAr - Python Argentina: http://pyar.decode.com.ar/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADVERTENCIA. La informacin contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacin confidencial o propietaria, cuya divulgacin es sancionada por la ley. Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no est autorizado a divulgar, copiar, distribuir o retener informacin (o parte de ella) contenida en este mensaje. Por favor notifquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magntico) que pueda haber realizado del mismo. Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefnica Comunicaciones Personales S.A. o alguna empresa asociada. Los mensajes electrnicos pueden ser alterados, motivo por el cual Telefnica Comunicaciones Personales S.A. no aceptar ninguna obligacin cualquiera sea el resultante de este mensaje. Muchas Gracias. -- http://mail.python.org/mailman/listinfo/python-list
RE: Writing huge Sets() to disk
Title: RE: Writing huge Sets() to disk [Martin MOKREJ?] #- I have sets.Set() objects having up to 20E20 items, #- each is composed of up to 20 characters. Keeping Are you really sure?? #- How can I write them efficiently to disk? To be more exact, I think that there's some mistake here. At least you'll need a disk of 34694 EXABYTES!!! . Facundo Bitcora De Vuelo: http://www.taniquetil.com.ar/plog PyAr - Python Argentina: http://pyar.decode.com.ar/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADVERTENCIA. La informacin contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacin confidencial o propietaria, cuya divulgacin es sancionada por la ley. Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no est autorizado a divulgar, copiar, distribuir o retener informacin (o parte de ella) contenida en este mensaje. Por favor notifquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magntico) que pueda haber realizado del mismo. Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefnica Comunicaciones Personales S.A. o alguna empresa asociada. Los mensajes electrnicos pueden ser alterados, motivo por el cual Telefnica Comunicaciones Personales S.A. no aceptar ninguna obligacin cualquiera sea el resultante de este mensaje. Muchas Gracias. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Batista, Facundo wrote: [Martin MOKREJ] #- At least you'll need a disk of 34694 EXABYTES!!! #- #- Hmm, you are right. So 20E15 then? I definitely need to be Right. Now you only need 355 PETABytes. Nowadays disk is cheap, but... #- in range 1-14. ;-) Why? I need to test for occurence every such combination in some real-world examples. Many many will be missing, and those I have to keep. I've no clue what size will be the drop, so I rather expect full size. So, the generated, theoretical lexicon I could generate on the fly, but the results I have to keep. Even if I give up on this, can I write Sets onto a disk while keeping their nice, built-in methods? M. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Martin MOKREJ© [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Hi, I have sets.Set() objects having up to 20E20 items, each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. How can I write them efficiently to disk? To be more exact, I have 20 sets. _set1 has 1E20 keys of size 1 character. alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E') for aa1 in alphabet: # l = [aa1] #_set1.add(aa1) for aa2 in alphabet: # l.append(aa2) #_set2.add(''.join(l)) [cut] The reason I went for sets instead of lists is the speed, availability of unique, common and other methods. What would you propose as an elegant solution? Actually, even those nested for loops take ages. :( M. _set1 only has 19 keys of size 1 character - 'A' is duplicated. Assuming you replace 'A' with another character (say 'Z'), then here is what you get: _set1 - 20 elements (20**1) _set2 - 400 elements (20**2) _set3 - 8000 elements (20**3) ... _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26 If you have no duplicates in your alphabet, then you wont need to use sets, every combination will be unique. In this case, just use a generator. Here's a recursive generator approach that may save you a bunch of nested editing (set maxDepth as high as you dare): alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E') maxDepth = 3 def genNextCombo(root=list(),depth=1): for a in alphabet: thisRoot = root + [a] yield .join( thisRoot ) if depth maxDepth: for c in genNextCombo(thisRoot, depth+1): yield c for c in genNextCombo(): print c # or write to file, or whatever -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Robert Brewer wrote: Martin MOKREJ wrote: I have sets.Set() objects having up to 20E20 items, each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. How can I write them efficiently to disk? got shelve*? I know about shelve, but doesn't it work like a dictionary? Why should I use shelve for this? Then it's faster to use bsddb directly and use string as a key and None as a value, I'd guess. Even for that, note that even for data contained in _set11, the index should be(could be) optimized for keysize 11. There are no other record-sizes. Similarly, _set15 has all keys of size 15. In the bsddb or anydbm and other modules docs, I don't see how to optimize that. Without this optimization, I think it would be even slower. And shelve gives me exactly such, unoptimized, general index on dictionary. Maybe I'm wrong, I'm just a beginner here. Thanks M. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Paul McGuire wrote: Martin MOKREJ [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Hi, I have sets.Set() objects having up to 20E20 items, each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. How can I write them efficiently to disk? To be more exact, I have 20 sets. _set1 has 1E20 keys of size 1 character. alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E') for aa1 in alphabet: # l = [aa1] #_set1.add(aa1) for aa2 in alphabet: # l.append(aa2) #_set2.add(''.join(l)) [cut] The reason I went for sets instead of lists is the speed, availability of unique, common and other methods. What would you propose as an elegant solution? Actually, even those nested for loops take ages. :( M. _set1 only has 19 keys of size 1 character - 'A' is duplicated. Ahh, you are right. That's a typo, yet I have to figure out which char I have to use. But 'Z' will do for the tests anyway. Assuming you replace 'A' with another character (say 'Z'), then here is what you get: _set1 - 20 elements (20**1) _set2 - 400 elements (20**2) _set3 - 8000 elements (20**3) ... _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26 If you have no duplicates in your alphabet, then you wont need to use sets, every combination will be unique. In this case, just use a generator. As I noted in later response, actually I will compare these ideal sets to some real world examples. I have no expectation how many entries will be common and how many unique. The only thing I know even in real world, I might be in size ranges of 2E6 or perhaps 2E8? Here's a recursive generator approach that may save you a bunch of nested editing (set maxDepth as high as you dare): alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E') maxDepth = 3 def genNextCombo(root=list(),depth=1): for a in alphabet: thisRoot = root + [a] yield .join( thisRoot ) if depth maxDepth: for c in genNextCombo(thisRoot, depth+1): yield c for c in genNextCombo(): print c # or write to file, or whatever Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html doesn't really tell what happens here. :( But is quite fast and definitely saves those the stupid recursively hand-written for loops. Thanks Paul! M. -- http://mail.python.org/mailman/listinfo/python-list
RE: Writing huge Sets() to disk
Martin MOKREJ wrote: Robert Brewer wrote: Martin MOKREJ wrote: I have sets.Set() objects having up to 20E20 items, each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. How can I write them efficiently to disk? got shelve*? I know about shelve, but doesn't it work like a dictionary? Why should I use shelve for this? Then it's faster to use bsddb directly and use string as a key and None as a value, I'd guess. If you're using Python 2.3, then a sets.Set *is* implemented with a dictionary, with None values. It simply has some extra methods to make it behave like a set. In addition, the Set class already has builtin methods for pickling and unpickling. So it's probably faster to use bsddb directly, but why not find out by trying 2 lines of code that uses shelve? The time-consuming part of your quest is writing the timed test suite that will indicate which route will be fastest, which you'll have to do regardless. Robert Brewer MIS Amor Ministries [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
On Mon, 2005-01-10 at 11:11, Martin MOKREJ¦ wrote: Hi, I have sets.Set() objects having up to 20E20 items, each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. Lets be realistic. Your house is on fire and you are remodeling the basement. Assuming you are on a 64 bit machine with full 64 bit addressing, your absolute upper limit on the size of a set is 2^64, or 18446744073709551616 byte. Your real upper limit is at least an order of magnitude smaller. You are asking us how to store 20E20, or 20, items in a Set. That is still an order of magnitude greater than the number of *bits* you can address. Your desktop might not be able to enumerate all of these strings in your lifetime, much less index and store them. We might as well be discussing the number of angles that can sit on the head of a pin. Any discussion of a list vs Set/dict is a small micro optimization matter dwarfed by the fact that there don't exist machines to hold this data. The consideration of Set vs. dict is an even less important matter of syntactic sugar. To me, it sounds like you are taking an AI class and trying to deal with a small search space by brute force. First, stop banging your head against the wall algorithmically. Nobody lost their job for saying NP != P. Then tell us what you are tring to do; perhaps there is a better way, perhaps the problem is unsolvable and there is a heuristic that will satisfy your needs. Adam DePrince -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Adam DePrince wrote: On Mon, 2005-01-10 at 11:11, Martin MOKREJ¦ wrote: Hi, I have sets.Set() objects having up to 20E20 items, each is composed of up to 20 characters. Keeping them in memory on !GB machine put's me quickly into swap. I don't want to use dictionary approach, as I don't see a sense to store None as a value. The items in a set are unique. Lets be realistic. Your house is on fire and you are remodeling the basement. Assuming you are on a 64 bit machine with full 64 bit addressing, your absolute upper limit on the size of a set is 2^64, or 18446744073709551616 byte. Your real upper limit is at least an order of magnitude smaller. You are asking us how to store 20E20, or 20, items in a Set. That is still an order of magnitude greater than the number of *bits* you can address. Your desktop might not be able to enumerate all of these strings in your lifetime, much less index and store them. We might as well be discussing the number of angles that can sit on the head of a pin. Any discussion of a list vs Set/dict is a small micro optimization matter dwarfed by the fact that there don't exist machines to hold this data. The consideration of Set vs. dict is an even less important matter of syntactic sugar. To me, it sounds like you are taking an AI class and trying to deal with a small search space by brute force. First, stop banging your head against the wall algorithmically. Nobody lost their job for saying NP != P. Then tell us what you are tring to do; perhaps there is a better way, perhaps the problem is unsolvable and there is a heuristic that will satisfy your needs. Hi Adam, just imagine, you want to compare how many words are in English, German, Czech, Polish disctionary. You collect words from every language and record them in dict or Set, as you wish. Once you have those Set's or dict's for those 4 languages, you ask for common words and for those unique to Polish. I have no estimates of real-world numbers, but we might be in range of 1E6 or 1E8? I believe in any case, huge. My concern is actually purely scientific, not really related to analysis of these 4 languages, but I believe it describes my intent quite well. I wanted to be able to get a list of words NOT found in say Polish, and therefore wanted to have a list of all, theoretically existing words. In principle, I can drop this idea of having ideal, theoretical lexicon. But have to store those real-world dictionaries anyway to hard drive. Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
[Martin MOKREJ] just imagine, you want to compare how many words are in English, German, Czech, Polish disctionary. You collect words from every language and record them in dict or Set, as you wish. Call the set of all English words E; G, C, and P similarly. Once you have those Set's or dict's for those 4 languages, you ask for common words This Python expression then gives the set of words common to all 4: E G C P and for those unique to Polish. P - E - G - C is a reasonably efficient way to compute that. I have no estimates of real-world numbers, but we might be in range of 1E6 or 1E8? I believe in any case, huge. No matter how large, it's utterly tiny compared to the number of character strings that *aren't* words in any of these languages. English has a lot of words, but nobody estimates it at over 2 million (including scientific jargon, like names for chemical compounds): http://www.worldwidewords.org/articles/howmany.htm My concern is actually purely scientific, not really related to analysis of these 4 languages, but I believe it describes my intent quite well. I wanted to be able to get a list of words NOT found in say Polish, and therefore wanted to have a list of all, theoretically existing words. In principle, I can drop this idea of having ideal, theoretical lexicon. But have to store those real-world dictionaries anyway to hard drive. Real-word dictionaries shouldn't be a problem. I recommend you store each as a plain text file, one word per line. Then, e.g., to convert that into a set of words, do f = open('EnglishWords.txt') set_of_English_words = set(f) f.close() You'll have a trailing newline character in each word, but that doesn't really matter. Note that if you sort the word-per-line text files first, the Unix `comm` utility can be used to perform intersection and difference on a pair at a time with virtually no memory burden (and regardless of file size). -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Tim Peters wrote: [Martin MOKREJ] just imagine, you want to compare how many words are in English, German, Czech, Polish disctionary. You collect words from every language and record them in dict or Set, as you wish. Call the set of all English words E; G, C, and P similarly. Once you have those Set's or dict's for those 4 languages, you ask for common words This Python expression then gives the set of words common to all 4: E G C P and for those unique to Polish. P - E - G - C is a reasonably efficient way to compute that. Nice, is it equivalent to common / unique methods of Sets? I have no estimates of real-world numbers, but we might be in range of 1E6 or 1E8? I believe in any case, huge. No matter how large, it's utterly tiny compared to the number of character strings that *aren't* words in any of these languages. English has a lot of words, but nobody estimates it at over 2 million (including scientific jargon, like names for chemical compounds): http://www.worldwidewords.org/articles/howmany.htm As I've said, I analyze in real something else then languages. However, it can be described with the example of words in different languages. But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw data. Will sets be appropriate you think? My concern is actually purely scientific, not really related to analysis of these 4 languages, but I believe it describes my intent quite well. I wanted to be able to get a list of words NOT found in say Polish, and therefore wanted to have a list of all, theoretically existing words. In principle, I can drop this idea of having ideal, theoretical lexicon. But have to store those real-world dictionaries anyway to hard drive. Real-word dictionaries shouldn't be a problem. I recommend you store each as a plain text file, one word per line. Then, e.g., to convert that into a set of words, do f = open('EnglishWords.txt') set_of_English_words = set(f) I'm aware I can't keep set_of_English_words in memory. f.close() M. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Martin MOKREJ wrote: But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw data. Will sets be appropriate you think? You started out with 20E20 then cut back to 1E15 keys now it is down to one million but you claim that these will take 1.5 GB. On my system storing 1 million words of length 15 as keys of a python dictionary is around 75MB. I. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Istvan Albert wrote: Martin MOKREJ wrote: But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw data. Will sets be appropriate you think? You started out with 20E20 then cut back to 1E15 keys now it is down to one million but you claim that these will take 1.5 GB. I gave up the theoretical approach. Practically, I might need up to store maybe those 1E15 keys. So you say 1 million words is better to store in dictionary than in a set and use your own function to get out those unique or common words? On my system storing 1 million words of length 15 as keys of a python dictionary is around 75MB. Fine, that's what I wanted to hear. How do you improve the algorithm? Do you delay indexing to the very latest moment or do you let your computer index 999 999 times just for fun? I. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Martin MOKREJ wrote: Istvan Albert wrote: So you say 1 million words is better to store in dictionary than in a set and use your own function to get out those unique or common words? I have said nothing even remotely like that. Fine, that's what I wanted to hear. How do you improve the algorithm? Do you delay indexing to the very latest moment or do you let your computer index 999 999 times just for fun? I think that you need to first understand how dictionaries work. The time needed to insert a key is independent of the number of values in the dictionary. Istvan. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
[Martin MOKREJ] ... I gave up the theoretical approach. Practically, I might need up to store maybe those 1E15 keys. We should work on our multiplication skills here wink. You don't have enough disk space to store 1E15 keys. If your keys were just one byte each, you would need to have 4 thousand disks of 250GB each to store 1E15 keys. How much disk space do you actually have? I'm betting you have no more than one 250GB disk. ... [Istvan Albert] On my system storing 1 million words of length 15 as keys of a python dictionary is around 75MB. Fine, that's what I wanted to hear. How do you improve the algorithm? Do you delay indexing to the very latest moment or do you let your computer index 999 999 times just for fun? It remains wholly unclear to me what the algorithm you want might be. As I mentioned before, if you store keys in sorted text files, you can do intersection and difference very efficiently just by using the Unix `comm` utiltity. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
you probably want to look into building set-like objects ontop of tries, given the homogeneity of your language. You should see imrpovements both in size and speed. -- http://mail.python.org/mailman/listinfo/python-list
RE: Writing huge Sets() to disk
Title: RE: Writing huge Sets() to disk [Istvan Albert] #- I think that you need to first understand how dictionaries work. #- The time needed to insert a key is independent of #- the number of values in the dictionary. Are you sure? I think that is true while the hashes don't collide. If you have collisions, time starts to depend of element quantity. But I'm not sure Tim sure can enlighten us. Asking-for-god-word--ly yours, . Facundo Bitcora De Vuelo: http://www.taniquetil.com.ar/plog PyAr - Python Argentina: http://pyar.decode.com.ar/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADVERTENCIA. La informacin contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacin confidencial o propietaria, cuya divulgacin es sancionada por la ley. Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no est autorizado a divulgar, copiar, distribuir o retener informacin (o parte de ella) contenida en este mensaje. Por favor notifquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magntico) que pueda haber realizado del mismo. Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefnica Comunicaciones Personales S.A. o alguna empresa asociada. Los mensajes electrnicos pueden ser alterados, motivo por el cual Telefnica Comunicaciones Personales S.A. no aceptar ninguna obligacin cualquiera sea el resultante de este mensaje. Muchas Gracias. -- http://mail.python.org/mailman/listinfo/python-list
Re: Writing huge Sets() to disk
Tim Peters wrote: [Martin MOKREJ] ... I gave up the theoretical approach. Practically, I might need up to store maybe those 1E15 keys. We should work on our multiplication skills here wink. You don't have enough disk space to store 1E15 keys. If your keys were just one byte each, you would need to have 4 thousand disks of 250GB each to store 1E15 keys. How much disk space do you actually have? I'm betting you have no more than one 250GB disk. Can get about 700GB on raid5, but this doesn't help here of course. :( I definitely appreciate your comments, I somehow forgot to make sure I can store it. I was mainly distracted by the fact I might anyway hit almost the same size, if there's too few words used in reality. Therefore when looking for those words not found in such a dictionary, I'd be close to the teoretical, maximal size of say in order of E15 or E14. ... [Istvan Albert] On my system storing 1 million words of length 15 as keys of a python dictionary is around 75MB. Fine, that's what I wanted to hear. How do you improve the algorithm? Do you delay indexing to the very latest moment or do you let your computer index 999 999 times just for fun? It remains wholly unclear to me what the algorithm you want might My intent is do some analysis in protein science. But it can be really thought of as analysing those 4 different languages. be. As I mentioned before, if you store keys in sorted text files, you can do intersection and difference very efficiently just by using the Unix `comm` utiltity. Now I got your point. I understand the comm(1) is written in C, but it still has to scan file1 once and file2 n-times, where n is a number of lines in file1, right? Without any index ... I'll consider it, actually will test, thanks! I was really hoping I'll get an answer how to alter the indexes for dictionaries in python. You convinced me not to even try to construct to theoretical dictionary, as it will take ages just to create. Even if I'd manage, I couldn't save it (the theoretical and possibly not even the dict(theoret) - dict(real) result). Still, before I give the whole project, once more: I'll parse some text files, isolates separate words and add them to either Set(), list, dict, flatfile line by line. Depending on the above, I'll compare them and look for those unique to some language. I need to keep track of frequencies used in every such language, so the dict approach would be the best. The number stored as a value vould be a float ^H^H^H^H^H^H Decimal() type - very small number. Once more, I expect to have between E4 or E5 to E8??? words stored in 20 dictionaries (remember words of sizes in range 1-20? Every of those 20 dictionaries should be, I believe, indexed just once. The indexing method should know all entries in a given file are of same size, i.e. 5 chars, 15 chars, 20 chars etc. I already did implement the logic to walk through those 20 different dictionaries from language a and from language b and find uout those unique to a or common to both of them. Why I went to ask on this list was to make sure I took right approach. Sets seemed to be better solution for the simple comparison (common, unique). To keep track of those very small frequencies, I anyway have to have dictionaries. I say that again, how can I improve speed of dictionary indexing? It doesn't matter here if I have 10E4 keys in dictionary or 10E8 keys in a dictionary. Or I wanted to hear: go for Sets(), having set with 1E8 keys might take 1/10 of the size you need for a dictionary ... but you cannot dump them efficiently on a disk. Using shelve will cost you maybe 2x the size of dictionary approach and will be also slover that writing dictionary directly. Something along these words. I really appreciate your input! Martin -- http://mail.python.org/mailman/listinfo/python-list