Re: Writing huge Sets() to disk

2005-01-17 Thread Martin MOKREJ
Hi,
 could someone tell me what all does and what all doesn't copy
references in python. I have found my script after reaching some
state and taking say 600MB, pushes it's internal dictionaries
to hard disk. The for loop consumes another 300MB (as gathered
by vmstat) to push the data to dictionaries, then releases
little bit less than 300MB and the program start to fill-up
again it's internal dictionaries, when full will do the
flush again ...
 The point here is, that this code takes a lot of extra memory.
I believe it's the references problem, and I remeber complains
of frineds facing same problem. I'm a newbie, yes, but don't
have this problem with Perl. OK, I want to improve my Pyhton
knowledge ... :-))

   def push_to_disk(self):
   _dict_on_disk_tuple = (None, self._dict_on_disk1, self._dict_on_disk2, 
self._dict_on_disk3, self._dict_on_disk4, self._dict_on_disk5, 
self._dict_on_disk6, self._dict_on_disk7, self._dict_on_disk8, 
self._dict_on_disk9, self._dict_on_disk10, self._dict_on_disk11, 
self._dict_on_disk12, self._dict_on_disk13, self._dict_on_disk14, 
self._dict_on_disk15, self._dict_on_disk16, self._dict_on_disk17, 
self._dict_on_disk18, self._dict_on_disk19, self._dict_on_disk20)
   _size = 0
   #
   # sizes of these tmpdicts range from 10-1 entries for each!
   for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, 
self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, 
self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, 
self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, 
self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20):
   _size += 1
   if _tmpdict:
   _dict_on_disk = _dict_on_disk_tuple[_size]
   for _word, _value in _tmpdict.iteritems():
   try:
   _string = _dict_on_disk[_word]
   # I discard _a and _b, maybe _string.find(' ') combined 
with slice would do better?
   _abs_count, _a, _b, _expected_freq = _string.split()
   _abs_count = int(_abs_count).__add__(_value)
   _t = (str(_abs_count), '0', '0', '0')
   except KeyError:
   _t = (str(_value), '0', '0', '0')
   # this writes a copy to the dict, right?
   _dict_on_disk[_word] = ' '.join(_t)
   #
   # clear the temporary dictionaries in ourself
   # I think this works as expected and really does release memory
   #
   for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, 
self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, 
self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, 
self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, 
self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20):
   _tmpdict.clear()

  The above routine doesn't release of the memory back when it
exits. 


  See, the loop takes 25 minutes already, and it's prolonging
as the program is in about 1/3 or 1/4 of the total input.
The rest of my code is fast in contrast to this (below 1 minute).
-rw---  1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
-rw---  1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
-rw---  1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
-rw---  1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
-rw---  1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
-rw---  1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
-rw---  1 mmokrejs users 19488 Jan 17 11:02 diskdict6.db
-rw---  1 mmokrejs users  66584576 Jan 17 10:59 diskdict5.db
-rw---  1 mmokrejs users   5750784 Jan 17 10:57 diskdict4.db
-rw---  1 mmokrejs users311296 Jan 17 10:57 diskdict3.db
-rw---  1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
-rw---  1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
-rw---  1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
-rw---  1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
-rw---  1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
-rw---  1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
-rw---  1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
-rw---  1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db
   Some spoke about mmaped files. Could I take advantage of that
with bsddb module or bsddb?
   Is gdbm better in some ways? Recently you have said dictionary
operations are fast ... Once more. I want to turn of locking support.
I can make the values as strings of fixed size, if mmap() would be
available. The number of keys doesn't grow much in time, mostly
there are only updates.
   

Thaks for any ideas.
martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-17 Thread Duncan Booth
Martin MOKREJ© wrote:

 Hi,
   could someone tell me what all does and what all doesn't copy
 references in python. I have found my script after reaching some
 state and taking say 600MB, pushes it's internal dictionaries
 to hard disk. The for loop consumes another 300MB (as gathered
 by vmstat) to push the data to dictionaries, then releases
 little bit less than 300MB and the program start to fill-up
 again it's internal dictionaries, when full will do the
 flush again ...

Almost anything you do copies references.

  
   The point here is, that this code takes a lot of extra memory.
 I believe it's the references problem, and I remeber complains
 of frineds facing same problem. I'm a newbie, yes, but don't
 have this problem with Perl. OK, I want to improve my Pyhton
 knowledge ... :-))
 
 
 
 
long code extract snipped
 
 
The above routine doesn't release of the memory back when it
 exits. 
That's probably because there isn't any memory it can reasonable be 
expected to release. What memory would *you* expect it to release?

The member variables are all still accessible as member variables until you 
run your loop at the end to clear them, so no way could Python release 
them.

Some hints:

When posting code, try to post complete examples which actually work. I 
don't know what type the self._dict_on_diskXX variables are supposed to be. 
It makes a big difference if they are dictionaries (so you are trying to 
hold everything in memory at one time) or shelve.Shelf objects which would 
store the values on disc in a reasonably efficient manner.

Even if they are Shelf objects, I see no reason here why you have to 
process everything at once. Write a simple function which processes one 
tmpdict object into one dict_on_disk object and then closes the 
dict_on_disk object. If you want to compare results later then do that by 
reopening the dict_on_disk objects when you have deleted all the tmpdicts.

Extract out everything you want to do into a class which has at most one 
tmpdict and one dict_on_disk That way your code will be a lot easier to 
read.

Make your code more legible by using fewer underscores.

What on earth is the point of an explicit call to __add__? If Guido had 
meant us to use __add__ he woudn't have created '+'.

What is the purpose of dict_on_disk? Is it for humans to read the data? If 
not, then don't store everything as a string. Far better to just store a 
tuple of your values then you don't have to use split or cast the strings 
to integers. If you do want humans to read some final output then produce 
that separately from the working data files.

You split out 4 values from dict_on_disk and set three of them to 0. If 
that really what you meant or should you be preserving the previous values?

Here is some (untested) code which might help you:

import shelve

def push_to_disc(data, filename):
database = shelve.open(filename)
try:
for key in data:
if database.has_key(key):
count, a, b, expected = database[key]
database[key] = count+data[key], a, b, expected
else:
database[key] = data[key], 0, 0, 0
finally:
database.close()

data.clear()

Call that once for each input dictionary and your data will be written out 
to a disc file and the internal dictionary cleared without any great spike 
of memory use.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-17 Thread Martin MOKREJ
Duncan Booth wrote:
Martin MOKREJ wrote:

Hi,
 could someone tell me what all does and what all doesn't copy
references in python. I have found my script after reaching some
state and taking say 600MB, pushes it's internal dictionaries
to hard disk. The for loop consumes another 300MB (as gathered
by vmstat) to push the data to dictionaries, then releases
little bit less than 300MB and the program start to fill-up
again it's internal dictionaries, when full will do the
flush again ...

Almost anything you do copies references.

But what does this?:
x = 'x'
a = x[2:]
b = z + len(x)
dict[a] = b

 The point here is, that this code takes a lot of extra memory.
I believe it's the references problem, and I remeber complains
of frineds facing same problem. I'm a newbie, yes, but don't
have this problem with Perl. OK, I want to improve my Pyhton
knowledge ... :-))


long code extract snipped
  The above routine doesn't release of the memory back when it
exits. 
That's probably because there isn't any memory it can reasonable be 
expected to release. What memory would *you* expect it to release?
Thos 300MB, they get allocated/reserved when the posted loop get's
executed. When the loops exits, almost all is returned/deallocated.
Yes, almost. :(
The member variables are all still accessible as member variables until you 
run your loop at the end to clear them, so no way could Python release 
them.
OK, I wanted to know if there's some assignment using a reference,
which makes the internal garbage collector not to recycle the memory,
as, for example, the target dictionary still keeps reference to the temporary
dictionary.
Some hints:
When posting code, try to post complete examples which actually work. I 
don't know what type the self._dict_on_diskXX variables are supposed to be. 
It makes a big difference if they are dictionaries (so you are trying to 
hold everything in memory at one time) or shelve.Shelf objects which would 
store the values on disc in a reasonably efficient manner.
The self._dict_on_diskXX are bsddb files, self._tmpdictXX are builtin 
dictionaries.
Even if they are Shelf objects, I see no reason here why you have to 
I gathered from previous discussion it's faster to use bsddb directly,
so no shelve.
process everything at once. Write a simple function which processes one 
tmpdict object into one dict_on_disk object and then closes the
That's what I do, but in the for loop ...
dict_on_disk object. If you want to compare results later then do that by
OK, I got your point.
reopening the dict_on_disk objects when you have deleted all the tmpdicts.
That's what I do (not shown).
Extract out everything you want to do into a class which has at most one 
tmpdict and one dict_on_disk That way your code will be a lot easier to 
read.

Make your code more legible by using fewer underscores.
What on earth is the point of an explicit call to __add__? If Guido had 
meant us to use __add__ he woudn't have created '+'.
To invoke additon directly on the object. It's faster than letting
python to figure out that I sum up int() plus int(). It definitely
has helped a lot when using Decimal(a) + Decimal(b), where I got rid
of thousands of Decimal(__new__), __init__ and I think few other
methods of decimal as well - I think getcontext too. 

What is the purpose of dict_on_disk? Is it for humans to read the data? If 
not, then don't store everything as a string. Far better to just store a 
For humans is processed later.
tuple of your values then you don't have to use split or cast the strings 
bsddb creaps on me that I can store as a key or value only a string.
I'd love to store tuple.
import bsddb
_words1 = bsddb.btopen('db1.db', 'c')
_words1['a'] = 1
Traceback (most recent call last):
File stdin, line 1, in ?
File /usr/lib/python2.3/bsddb/__init__.py, line 120, in __setitem__
  self.db[key] = value
TypeError: Key and Data values must be of type string or None.

How can I record a number then?

to integers. If you do want humans to read some final output then produce 
that separately from the working data files.

You split out 4 values from dict_on_disk and set three of them to 0. If 
that really what you meant or should you be preserving the previous values?
No, overwrite them, i.e. invalidate them. Originally I recorded only first,
but to compute the latter numbers is so expensive I have to store them.
As walking through the dictionaries is so slow, I gave up on an idea to
store just one, and a lot later in the program walk once again through the
dictionary and 'fix' it by computing missing values.
Here is some (untested) code which might help you:
import shelve
Why shelve? To have the ability to record tuple? Isn't it cheaper
to convert to string and back and write to bsddb compared to this overhead?
def push_to_disc(data, filename):
database = shelve.open(filename)
try:
for key in data:
if database.has_key(key):
count, a, b, expected = database[key]

Re: Writing huge Sets() to disk

2005-01-17 Thread Martin MOKREJ
Steve Holden wrote:
Martin MOKREJ wrote:
Hi,
 could someone tell me what all does and what all doesn't copy
references in python. I have found my script after reaching some
state and taking say 600MB, pushes it's internal dictionaries
to hard disk. The for loop consumes another 300MB (as gathered
by vmstat) to push the data to dictionaries, then releases
little bit less than 300MB and the program start to fill-up
again it's internal dictionaries, when full will do the
flush again ...
 The point here is, that this code takes a lot of extra memory.
I believe it's the references problem, and I remeber complains
of frineds facing same problem. I'm a newbie, yes, but don't
have this problem with Perl. OK, I want to improve my Pyhton
knowledge ... :-))
Right ho! In fact I suspect you are still quite new to programming as a 
whole, for reasons that may become clear as we proceed.


   def push_to_disk(self):
   _dict_on_disk_tuple = (None, self._dict_on_disk1, 
self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, 
self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, 
self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, 
self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, 
self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, 
self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, 
self._dict_on_disk20)
The None above is there just as I didn't want to evaluate all the time
something like x+1:
for x in _dict_on_disk_tuple:
   key = 
   newdict[x+1][key] = ...
This None doesn't hurt me, then x contains exactly the value I need several
times in the loop ...

It's a bit unfortunate that all those instance variables are global to 
the method, as it means we can't clearly see what you intend them to do. 
However ...

Whenever I see such code, it makes me suspect that the approach to the 
problem could be more subtle. It appears you have decided to partition 
your data into twenty chunks somehow. The algorithm is clearly not coded 
in a way that would make it easy to modify the number of chunks.
No, it's not, but that's not the speed problem, really. ;)
[Hint: by easy I mean modifying a statement that reads
chunks = 20
to read
chunks = 40
for example]. To avoid this, we might use (say) a list of temp edicts, 
for example (the length of this could easily then be parameterized as 
mentioned. So where (my psychic powers tell me) your __init__() method 
currently contains

self._dict_on_disk1 = something()
self._dict_on_disk2 = something()
...
self._dict_on_disk20 = something()
Almost. They are bsddb dictionary files.
I would have written
self._disk_dicts = []
for i in range(20):
self._disk_dicts.append(something)
Than again, I probably have an advantage over you. I'm such a crappy 
typist I can guarantee I'd make at least six mistakes doing it your way :-)
Originally I had this, I just to get of one small list. ;)

   _size = 0

What with all these leading underscores I presume it must be VERY 
important to keep these object's instance variables private. Do you have 
a particular reason for that, or just general Perl-induced paranoia? :-)
See below. ;)

   #
   # sizes of these tmpdicts range from 10-1 entries for each!
   for _tmpdict in (self._tmpdict1, self._tmpdict2, 
self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, 
self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, 
self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, 
self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, 
self._tmpdict19, self._tmpdict20):
   _size += 1
   if _tmpdict:
   _dict_on_disk = _dict_on_disk_tuple[_size]
   for _word, _value in _tmpdict.iteritems():
   try:
   _string = _dict_on_disk[_word]
   # I discard _a and _b, maybe _string.find(' ') 
combined with slice would do better?
   _abs_count, _a, _b, _expected_freq = 
_string.split()
   _abs_count = int(_abs_count).__add__(_value)
   _t = (str(_abs_count), '0', '0', '0')
   except KeyError:
   _t = (str(_value), '0', '0', '0')

   # this writes a copy to the dict, right?
   _dict_on_disk[_word] = ' '.join(_t)
   #
   # clear the temporary dictionaries in ourself
   # I think this works as expected and really does release memory
   #
   for _tmpdict in (self._tmpdict1, self._tmpdict2, 
self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, 
self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, 
self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, 
self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, 
self._tmpdict19, self._tmpdict20):
   _tmpdict.clear()

There you go again with that huge tuple. You 

Re: Writing huge Sets() to disk

2005-01-14 Thread Martin MOKREJ
Tim Peters wrote:
[Martin MOKREJ]
...
I gave up the theoretical approach. Practically, I might need up
to store maybe those 1E15 keys.

We should work on our multiplication skills here wink.  You don't
have enough disk space to store 1E15 keys.  If your keys were just one
byte each, you would need to have 4 thousand disks of 250GB each to
store 1E15 keys.  How much disk space do you actually have?  I'm
betting you have no more than one 250GB disk.
...
[Istvan Albert]
On my system storing 1 million words of length 15
as keys of a python dictionary is around 75MB.

Fine, that's what I wanted to hear. How do you improve the algorithm?
Do you delay indexing to the very latest moment or do you let your
computer index 999 999 times just for fun?

It remains wholly unclear to me what the algorithm you want might
be.  As I mentioned before, if you store keys in sorted text files,
you can do intersection and difference very efficiently just by using
the Unix `comm` utiltity.
This comm(1) approach doesn't work for me. It somehow fails to detect
common entries when the offset is too big.
file 1:
A
F
G
I
K
M
N
R
V
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AIK
FGR
FRA
GFG
GIN
GRI
IGI
IGR
IKI
ING
IVF
KIG
MAI
NGF
RAA
RIG
file 2:
W
W
W
W
W
W
W
W
W
W
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
A
A
A
A
A
A
A
A
A
A
A
A
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-14 Thread Tim Peters
[Martin MOKREJ]
 This comm(1) approach doesn't work for me. It somehow fails to
 detect common entries when the offset is too big.

 file 1:

 A
 F
 G
 I
 K
 M
 N
 R
 V
 AA
 AI
 FG
 FR
 GF
 GI
 GR
 IG
 IK
 IN
 IV
 KI
 MA
 NG
 RA
 RI
 VF
 AIK
 FGR
 FRA
 GFG
 GIN
 GRI
 IGI
 IGR
 IKI
 ING
 IVF
 KIG
 MAI
 NGF
 RAA
 RIG
 
 file 2:
 
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 AA
 AI
 FG
 FR
 GF
 GI
 GR
 IG
 IK
 IN
 IV
 KI
 MA
 NG
 RA
 RI
 VF
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A

I'll repeat:

 As I mentioned before, if you store keys in sorted text files ...

Those files aren't in sorted order, so of course `comm` can't do
anything useful with them.  Do `man sort`; sorting is not optional
here.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-14 Thread Martin MOKREJ
Tim Peters wrote:
[Martin MOKREJ]
This comm(1) approach doesn't work for me. It somehow fails to
detect common entries when the offset is too big.
[...]
I'll repeat:

As I mentioned before, if you store keys in sorted text files ...

Those files aren't in sorted order, so of course `comm` can't do
anything useful with them.  Do `man sort`; sorting is not optional
here.
I did read the manpage, but somehow it seems I did not execute sort(1)
from within my python code, so it was unsorted and did did not realize
it yet. Thanks!
m.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-11 Thread Bengt Richter
On Mon, 10 Jan 2005 17:11:09 +0100, =?ISO-8859-2?Q?Martin_MOKREJ=A9?= [EMAIL 
PROTECTED] wrote:

Hi,
  I have sets.Set() objects having up to 20E20 items,
What notation are you using when you write 20E20?
IOW, ISTM 1E9 is a billion. So 20E20 would be 2000 billion billion.
Please clarify ;-)

each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

  How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 
'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
# l = [aa1]
#_set1.add(aa1)
for aa2 in alphabet:
# l.append(aa2)
#_set2.add(''.join(l))
[cut]

  The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages. :(

If you will explain a little what you are doing with these set items
perhaps someone will think of another way to represent and use your data.

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Scott David Daniels
Martin MOKREJ wrote:
 But I don't think I can use one-way hashes, as I need to reconstruct
the string later. I have to study hard to get an idea what the proposed 
 code really does.
Scott David Daniels wrote:
Tim Peters wrote:
 Call the set of all English words E; G, C, and P similarly.
This Python expression then gives the set of words common to all 4:
E  G  C  P
and for those unique to Polish.
P -  E - G  - C
One attack is to define a hash to get sizes smaller.  Suppose you find
your word sets are 10**8 large, and you find you need to deal with sets
of size 10**6.  Then if you define a hash that produces an integer below
100, you'll find:
E  G  C  P == Union(E_k  G_k  C_k  P_k) for k in range(100)
P -  E - G  - C == Union(P_k - E_k - G_k - C_k) for k in range(100)
where X_k = [v for v in X if somehash(v) == k]
I don't understand here what would be the result from the hashing 
function. :(  ... Can you clarify this please in more detail?
The trick is to build the X_k values into files.
For example:
for base in range(0, 100, 10):
# work in blocks of 10 (or whatever)
for language in ('English', 'German', 'Czech', 'Polish'):
source = open(language)
dests = [open('%s_%s.txt' % (language.lower(), base + n), 'w')
 for n in range(10)]
for word in source:  # Actually this is probably more involved
code = somehash(word) - base
if 0 = code  base:
dests[code].write(word + '\n')
for dest in dests:
dest.close()
source.close()
After running the above code, you get 400 files with names like, for
example, 'english_33.txt'.  'english_33.txt' contains only words in
English which hash to 33.  Then you need to sort the different files
(like 'english_33.txt') before using them in the next phase.  If you
can sort and dupe-elim in the same step, do that and get rid of the
calls to dupelim below.  Otherwise use dupelim below when reading the
sorted files.  If you must, you might want to do the sort and
dupe-elim in Python:
for language in ('english', 'german', 'czech', 'polish'):
for hashcode in range(100):
filename = '%s_%s.txt' % (language, hashcode)
source = open(filename)
lines = sorted(source)
source.close()
dest = open(filename, 'w')
for line in dupelim(lines):
dest.write(line)
dest.close()
 def inboth(left, right): 
 def leftonly(left, right): 
Aaaah, so the functions just walk one line in left, one line in right, 
if values don't match the value in left is unique, it walks again one
line in left and checks if it already matches the value in right file
in the last position, and so on until it find same value in the right file?
Exactly, you must make sure you deal with cases where you pass values,
match values, and run out of source -- that keeps these from being four-
liners.
 For example:

 Ponly = open('polishonly.txt', 'w')
 every = open('every.txt', 'w')
 for hashcode in range(100):
English = open('english_%s.txt' % hashcode)
 ^^ this is some kind of eval?
If  hashcode == 33  then 'english_%s.txt' % hashcode == 'english_33.txt'
So we are just finding the language-specific for-this-hash source.
... for unmatched in leftonly(leftonly(leftonly(dupelim(Polish),
 dupelim(English)), dupelim(German)), dupelim(Czech)):
Ponly.write(unmatched)
 for source in (Polish, English, German, Czech):#(paraphrased)
 source.seek(0)
 for match in inboth(inboth(dupelim(Polish), upelim(English)),
   inboth(dupelim(German), dupelim(Czech))):
 every.write(match)
 for source in (Polish, English, German, Czech):#(paraphrased)
 source.close()
 Ponly.close()
 every.close()
--Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread John Lenton
On Tue, Jan 11, 2005 at 12:33:42AM +0200, Simo Melenius wrote:
 John Lenton [EMAIL PROTECTED] writes:
 
  you probably want to look into building set-like objects ontop of
  tries, given the homogeneity of your language. You should see
  imrpovements both in size and speed.
 
 Ternary search trees give _much_ better space-efficiency compared to
 tries, at the expense of only slightly slower build and search time.
 This is especially essential as the OP mentioned he could have huge
 sets of data.

hmm! sounds like *some*body needs to go read up on ternary trees,
then.

Ok, ok, I'm going...

-- 
John Lenton ([EMAIL PROTECTED]) -- Random fortune:
Fortune finishes the great quotations, #6

But, soft!  What light through yonder window breaks?
It's nothing, honey.  Go back to sleep.


signature.asc
Description: Digital signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

2005-01-10 Thread Paul Rubin
Martin MOKREJ¦ [EMAIL PROTECTED] writes:
   I have sets.Set() objects having up to 20E20 items,
   just imagine, you want to compare how many words are in English, German,
 Czech, Polish disctionary. You collect words from every language and record
 them in dict or Set, as you wish.
 
   Once you have those Set's or dict's for those 4 languages, you ask
 for common words and for those unique to Polish. I have no estimates
 of real-world numbers, but we might be in range of 1E6 or 1E8?
 I believe in any case, huge.

They'll be less than 1e6 and so not huge by the standards of today's
computers.  You could use ordinary dicts or sets.

1e20 is another matter.  I doubt that there are any computers in the
world with that much storage.  How big is your dataset REALLY?

   I wanted to be able to get a list of words NOT found in say Polish,
 and therefore wanted to have a list of all, theoretically existing words.
 In principle, I can drop this idea of having ideal, theoretical lexicon.
 But have to store those real-world dictionaries anyway to hard drive.

One way you could do it is by dumping all the words sequentially to
disk, then sorting the resulting disk file using an O(n log n) offline
algorithm.

Basically data sets of this size are outside of what you can easily
handle with builtin Python operations without putting some thought
into algorithms and data structures.  From ribosome I'm guessing
you're doing computational biology.  If you're going to be writing
code for these kinds of problems on a regular basis, you probably
ought to read a book or two on the topic.  CLRS is a good choice:

  http://theory.lcs.mit.edu/~clr/
  http://mitpress.mit.edu/algorithms/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Paul Rubin
=?windows-1252?Q?Martin_MOKREJ=8A?= [EMAIL PROTECTED] writes:
 Yes, I'm. I still don't get what that acronym CLRS stands for ... :(

CLRS = the names of the authors, Cormen, Leiserson, Rivest, and Stein,
if I spelled those correctly.  :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Tim Peters
[Istvan Albert] 
 #- I think that you need to first understand how dictionaries work. 
 #- The time needed to insert a key is independent of 
 #- the number of values in the dictionary. 

[Batista, Facundo]
 Are you sure? 
 
 I think that is true while the hashes don't collide. If you have collisions,
 time starts to depend of element quantity. But I'm not sure

 Tim sure can enlighten us. 

I can only point the way to enlightenment -- you have to contemplate
your own navel (or Guido's, but he's kind of shy that way).

What Istvan Albert said is close enough in context.  The *expected*
(mean) number of collisions in a Python dict probe is less than 1,
regardless of dict size.  That assumes the collision distribution is
no worse than random.  It's possible that all dict keys hash to the
same table slot, and then each insertion is O(len(dict)).  It's
possible to contrive such cases even if the hash function is a good
one.  But it's not going to happen by accident (and, when it does
wink, open a bug report -- we'll improve the key type's hash
function then).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Batista, Facundo wrote:
[Martin MOKREJ?]
#-   I have sets.Set() objects having up to 20E20 items,
#- each is composed of up to 20 characters. Keeping
Are you really sure??
Either I'll have to construct them all over again say
20-30 times, or I'll find a way to keep them on disk.

#-   How can I write them efficiently to disk? To be more exact,
I think that there's some mistake here.
At least you'll need a disk of 34694 EXABYTES!!!
Hmm, you are right. So 20E15 then? I definitely need to be
in range 1-14. ;-)
Anyway they are large.
M.
--
http://mail.python.org/mailman/listinfo/python-list


RE: Writing huge Sets() to disk

2005-01-10 Thread Batista, Facundo
Title: RE: Writing huge Sets() to disk





[Martin MOKREJ]


#-  At least you'll need a disk of 34694 EXABYTES!!!
#- 
#- Hmm, you are right. So 20E15 then? I definitely need to be


Right. Now you only need 355 PETABytes. Nowadays disk is cheap, but...



#- in range 1-14. ;-)


Why?



. Facundo


Bitcora De Vuelo: http://www.taniquetil.com.ar/plog
PyAr - Python Argentina: http://pyar.decode.com.ar/



  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ADVERTENCIA.


La informacin contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacin confidencial o propietaria, cuya divulgacin es sancionada por la ley.

Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no est autorizado a divulgar, copiar, distribuir o retener informacin (o parte de ella) contenida en este mensaje. Por favor notifquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magntico) que pueda haber realizado del mismo.

Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefnica Comunicaciones Personales S.A. o alguna empresa asociada.

Los mensajes electrnicos pueden ser alterados, motivo por el cual Telefnica Comunicaciones Personales S.A. no aceptar ninguna obligacin cualquiera sea el resultante de este mensaje.

Muchas Gracias.



-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Writing huge Sets() to disk

2005-01-10 Thread Batista, Facundo
Title: RE: Writing huge Sets() to disk





[Martin MOKREJ?]


#- I have sets.Set() objects having up to 20E20 items,
#- each is composed of up to 20 characters. Keeping


Are you really sure??



#- How can I write them efficiently to disk? To be more exact,


I think that there's some mistake here.


At least you'll need a disk of 34694 EXABYTES!!!



. Facundo


Bitcora De Vuelo: http://www.taniquetil.com.ar/plog
PyAr - Python Argentina: http://pyar.decode.com.ar/


  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ADVERTENCIA.


La informacin contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacin confidencial o propietaria, cuya divulgacin es sancionada por la ley.

Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no est autorizado a divulgar, copiar, distribuir o retener informacin (o parte de ella) contenida en este mensaje. Por favor notifquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magntico) que pueda haber realizado del mismo.

Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefnica Comunicaciones Personales S.A. o alguna empresa asociada.

Los mensajes electrnicos pueden ser alterados, motivo por el cual Telefnica Comunicaciones Personales S.A. no aceptar ninguna obligacin cualquiera sea el resultante de este mensaje.

Muchas Gracias.



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Batista, Facundo wrote:
[Martin MOKREJ]
#-  At least you'll need a disk of 34694 EXABYTES!!!
#- 
#- Hmm, you are right. So 20E15 then? I definitely need to be

Right. Now you only need 355 PETABytes. Nowadays disk is cheap, but...
#- in range 1-14. ;-)
Why?
I need to test for occurence every such combination in some real-world
examples. Many many will be missing, and those I have to keep.
I've no clue what size will be the drop, so I rather expect full size.
So, the generated, theoretical lexicon I could generate on the fly,
but the results I have to keep.
Even if I give up on this, can I write Sets onto a disk while keeping
their nice, built-in methods?
M.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Paul McGuire
Martin MOKREJ© [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Hi,
   I have sets.Set() objects having up to 20E20 items,
 each is composed of up to 20 characters. Keeping
 them in memory on !GB machine put's me quickly into swap.
 I don't want to use dictionary approach, as I don't see a sense
 to store None as a value. The items in a set are unique.

   How can I write them efficiently to disk? To be more exact,
 I have 20 sets. _set1 has 1E20 keys of size 1 character.

 alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',
'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
 for aa1 in alphabet:
 # l = [aa1]
 #_set1.add(aa1)
 for aa2 in alphabet:
 # l.append(aa2)
 #_set2.add(''.join(l))
 [cut]

   The reason I went for sets instead of lists is the speed,
 availability of unique, common and other methods.
 What would you propose as an elegant solution?
 Actually, even those nested for loops take ages. :(
 M.

_set1 only has 19 keys of size 1 character - 'A' is duplicated.

Assuming you replace 'A' with another character (say 'Z'), then here is what
you get:
_set1 - 20 elements (20**1)
_set2 - 400 elements (20**2)
_set3 - 8000 elements (20**3)
...
_set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26

If you have no duplicates in your alphabet, then you wont need to use sets,
every combination will be unique.  In this case, just use a generator.

Here's a recursive generator approach that may save you a bunch of nested
editing (set maxDepth as high as you dare):

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
'Y', 'W', 'K', 'R', 'H', 'D', 'E')

maxDepth = 3

def genNextCombo(root=list(),depth=1):
for a in alphabet:
thisRoot = root + [a]
yield .join( thisRoot )
if depth  maxDepth:
for c in genNextCombo(thisRoot, depth+1):
yield c

for c in genNextCombo():
print c  # or write to file, or whatever



-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Robert Brewer wrote:
Martin MOKREJ wrote:
 I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.
 How can I write them efficiently to disk?

got shelve*?
I know about shelve, but doesn't it work like a dictionary?
Why should I use shelve for this? Then it's faster to use
bsddb directly and use string as a key and None as a value, I'd guess.
Even for that, note that even for data contained in _set11,
the index should be(could be) optimized for keysize 11.
There are no other record-sizes.
Similarly, _set15 has all keys of size 15. In the bsddb or anydbm
and other modules docs, I don't see how to optimize that. Without
this optimization, I think it would be even slower. And shelve
gives me exactly such, unoptimized, general index on dictionary.
Maybe I'm wrong, I'm just a beginner here.
Thanks
M.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Paul McGuire wrote:
Martin MOKREJ [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
Hi,
 I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.
 How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.
alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',
'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
   # l = [aa1]
   #_set1.add(aa1)
   for aa2 in alphabet:
   # l.append(aa2)
   #_set2.add(''.join(l))
[cut]
 The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages. :(
M.

_set1 only has 19 keys of size 1 character - 'A' is duplicated.
Ahh, you are right. That's a typo, yet I have to figure out which char
I have to use. But 'Z' will do for the tests anyway.
Assuming you replace 'A' with another character (say 'Z'), then here is what
you get:
_set1 - 20 elements (20**1)
_set2 - 400 elements (20**2)
_set3 - 8000 elements (20**3)
...
_set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26
If you have no duplicates in your alphabet, then you wont need to use sets,
every combination will be unique.  In this case, just use a generator.
As I noted in later response, actually I will compare these ideal sets to 
some
real world examples. I have no expectation how many entries will be common
and how many unique. The only thing I know even in real world, I might be in
size ranges of 2E6 or perhaps 2E8?
Here's a recursive generator approach that may save you a bunch of nested
editing (set maxDepth as high as you dare):
alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
'Y', 'W', 'K', 'R', 'H', 'D', 'E')
maxDepth = 3
def genNextCombo(root=list(),depth=1):
for a in alphabet:
thisRoot = root + [a]
yield .join( thisRoot )
if depth  maxDepth:
for c in genNextCombo(thisRoot, depth+1):
yield c
for c in genNextCombo():
print c  # or write to file, or whatever
Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html
doesn't really tell what happens here. :( But is quite fast and
definitely saves those the stupid recursively hand-written for
loops.
Thanks Paul!
M.
--
http://mail.python.org/mailman/listinfo/python-list


RE: Writing huge Sets() to disk

2005-01-10 Thread Robert Brewer
Martin MOKREJ wrote:
 Robert Brewer wrote:
  Martin MOKREJ wrote:
  
   I have sets.Set() objects having up to 20E20 items,
 each is composed of up to 20 characters. Keeping
 them in memory on !GB machine put's me quickly into swap.
 I don't want to use dictionary approach, as I don't see a sense
 to store None as a value. The items in a set are unique.
 
   How can I write them efficiently to disk?
  
  
  got shelve*?
 
 I know about shelve, but doesn't it work like a dictionary?
 Why should I use shelve for this? Then it's faster to use
 bsddb directly and use string as a key and None as a value, I'd guess.

If you're using Python 2.3, then a sets.Set *is* implemented with a dictionary, 
with None values. It simply has some extra methods to make it behave like a 
set. In addition, the Set class already has builtin methods for pickling and 
unpickling.

So it's probably faster to use bsddb directly, but why not find out by trying 2 
lines of code that uses shelve? The time-consuming part of your quest is 
writing the timed test suite that will indicate which route will be fastest, 
which you'll have to do regardless.


Robert Brewer
MIS
Amor Ministries
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Adam DePrince
On Mon, 2005-01-10 at 11:11, Martin MOKREJ¦ wrote:
 Hi,
   I have sets.Set() objects having up to 20E20 items,
 each is composed of up to 20 characters. Keeping
 them in memory on !GB machine put's me quickly into swap.
 I don't want to use dictionary approach, as I don't see a sense
 to store None as a value. The items in a set are unique.

Lets be realistic.  Your house is on fire and you are remodeling the
basement.

Assuming you are on a 64 bit machine with full 64 bit addressing, your
absolute upper limit on the size of a set is 2^64, or
18446744073709551616 byte.  Your real upper limit is at least an order
of magnitude smaller.

You are asking us how to store 20E20, or 20, items
in a Set.  That is still an order of magnitude greater than the number
of *bits* you can address.  Your desktop might not be able to enumerate
all of these strings in your lifetime, much less index and store them.

We might as well be discussing the number of angles that can sit on the
head of a pin.  Any discussion of a list vs Set/dict is a small micro
optimization matter dwarfed by the fact that there don't exist machines
to hold this data.  The consideration of Set vs. dict is an even less
important matter of syntactic sugar.

To me, it sounds like you are taking an AI class and trying to deal with
a small search space by brute force.  First, stop banging your head
against the wall algorithmically.  Nobody lost their job for saying NP
!= P.  Then tell us what you are tring to do; perhaps there is a better
way, perhaps the problem is unsolvable and there is a heuristic that
will satisfy your needs. 



Adam DePrince 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Adam DePrince wrote:
On Mon, 2005-01-10 at 11:11, Martin MOKREJ¦ wrote:
Hi,
 I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

Lets be realistic.  Your house is on fire and you are remodeling the
basement.
Assuming you are on a 64 bit machine with full 64 bit addressing, your
absolute upper limit on the size of a set is 2^64, or
18446744073709551616 byte.  Your real upper limit is at least an order
of magnitude smaller.
You are asking us how to store 20E20, or 20, items
in a Set.  That is still an order of magnitude greater than the number
of *bits* you can address.  Your desktop might not be able to enumerate
all of these strings in your lifetime, much less index and store them.
We might as well be discussing the number of angles that can sit on the
head of a pin.  Any discussion of a list vs Set/dict is a small micro
optimization matter dwarfed by the fact that there don't exist machines
to hold this data.  The consideration of Set vs. dict is an even less
important matter of syntactic sugar.
To me, it sounds like you are taking an AI class and trying to deal with
a small search space by brute force.  First, stop banging your head
against the wall algorithmically.  Nobody lost their job for saying NP
!= P.  Then tell us what you are tring to do; perhaps there is a better
way, perhaps the problem is unsolvable and there is a heuristic that
will satisfy your needs. 
Hi Adam,
 just imagine, you want to compare how many words are in English, German,
Czech, Polish disctionary. You collect words from every language and record
them in dict or Set, as you wish.
 Once you have those Set's or dict's for those 4 languages, you ask
for common words and for those unique to Polish. I have no estimates
of real-world numbers, but we might be in range of 1E6 or 1E8?
I believe in any case, huge.
 My concern is actually purely scientific, not really related to analysis
of these 4 languages, but I believe it describes my intent quite well.
 I wanted to be able to get a list of words NOT found in say Polish,
and therefore wanted to have a list of all, theoretically existing words.
In principle, I can drop this idea of having ideal, theoretical lexicon.
But have to store those real-world dictionaries anyway to hard drive.
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Tim Peters
[Martin MOKREJ]
  just imagine, you want to compare how many words are in English, German,
 Czech, Polish disctionary. You collect words from every language and record
 them in dict or Set, as you wish.

Call the set of all English words E; G, C, and P similarly.

  Once you have those Set's or dict's for those 4 languages, you ask
 for common words

This Python expression then gives the set of words common to all 4:

E  G  C  P

 and for those unique to Polish.

P -  E - G  - C

is a reasonably efficient way to compute that.

 I have no estimates
 of real-world numbers, but we might be in range of 1E6 or 1E8?
 I believe in any case, huge.

No matter how large, it's utterly tiny compared to the number of
character strings that *aren't* words in any of these languages. 
English has a lot of words, but nobody estimates it at over 2 million
(including scientific jargon, like names for chemical compounds):

http://www.worldwidewords.org/articles/howmany.htm

 My concern is actually purely scientific, not really related to analysis
 of these 4 languages, but I believe it describes my intent quite well.

  I wanted to be able to get a list of words NOT found in say Polish,
 and therefore wanted to have a list of all, theoretically existing words.
 In principle, I can drop this idea of having ideal, theoretical lexicon.
 But have to store those real-world dictionaries anyway to hard drive.

Real-word dictionaries shouldn't be a problem.  I recommend you store
each as a plain text file, one word per line.  Then, e.g., to convert
that into a set of words, do

f = open('EnglishWords.txt')
set_of_English_words = set(f)
f.close()

You'll have a trailing newline character in each word, but that
doesn't really matter.

Note that if you sort the word-per-line text files first, the Unix
`comm` utility can be used to perform intersection and difference on a
pair at a time with virtually no memory burden (and regardless of file
size).
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Tim Peters wrote:
[Martin MOKREJ]
just imagine, you want to compare how many words are in English, German,
Czech, Polish disctionary. You collect words from every language and record
them in dict or Set, as you wish.

Call the set of all English words E; G, C, and P similarly.

Once you have those Set's or dict's for those 4 languages, you ask
for common words

This Python expression then gives the set of words common to all 4:
E  G  C  P

and for those unique to Polish.

P -  E - G  - C
is a reasonably efficient way to compute that.
Nice, is it equivalent to common / unique methods of Sets?

I have no estimates
of real-world numbers, but we might be in range of 1E6 or 1E8?
I believe in any case, huge.

No matter how large, it's utterly tiny compared to the number of
character strings that *aren't* words in any of these languages. 
English has a lot of words, but nobody estimates it at over 2 million
(including scientific jargon, like names for chemical compounds):

http://www.worldwidewords.org/articles/howmany.htm
As I've said, I analyze in real something else then languages.
However, it can be described with the example of words in different languages.
But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
data. Will sets be appropriate you think?
My concern is actually purely scientific, not really related to analysis
of these 4 languages, but I believe it describes my intent quite well.
I wanted to be able to get a list of words NOT found in say Polish,
and therefore wanted to have a list of all, theoretically existing words.
In principle, I can drop this idea of having ideal, theoretical lexicon.
But have to store those real-world dictionaries anyway to hard drive.

Real-word dictionaries shouldn't be a problem.  I recommend you store
each as a plain text file, one word per line.  Then, e.g., to convert
that into a set of words, do
f = open('EnglishWords.txt')
set_of_English_words = set(f)
I'm aware I can't keep set_of_English_words in memory.
f.close()
M.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Istvan Albert
Martin MOKREJ wrote:

But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
data. Will sets be appropriate you think?
You started out with 20E20 then cut back to 1E15 keys
now it is down to one million but you claim that these
will take 1.5 GB.
On my system storing 1 million words of length 15
as keys of a python dictionary is around 75MB.
I.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Istvan Albert wrote:
Martin MOKREJ wrote:

But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
data. Will sets be appropriate you think?

You started out with 20E20 then cut back to 1E15 keys
now it is down to one million but you claim that these
will take 1.5 GB.
I gave up the theoretical approach. Practically, I might need up
to store maybe those 1E15 keys.
So you say 1 million words is better to store in dictionary than
in a set and use your own function to get out those unique or common
words?
On my system storing 1 million words of length 15
as keys of a python dictionary is around 75MB.
Fine, that's what I wanted to hear. How do you improve the algorithm?
Do you delay indexing to the very latest moment or do you let your
computer index 999 999 times just for fun?
I.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Istvan Albert
Martin MOKREJ wrote:
Istvan Albert wrote:

So you say 1 million words is better to store in dictionary than
in a set and use your own function to get out those unique or common
words?
I have said nothing even remotely like that.
Fine, that's what I wanted to hear. How do you improve the algorithm?
Do you delay indexing to the very latest moment or do you let your
computer index 999 999 times just for fun?
I think that you need to first understand how dictionaries work.
The time needed to insert a key is independent of
the number of values in the dictionary.
Istvan.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread Tim Peters
[Martin MOKREJ]
 ...
 
 I gave up the theoretical approach. Practically, I might need up
 to store maybe those 1E15 keys.

We should work on our multiplication skills here wink.  You don't
have enough disk space to store 1E15 keys.  If your keys were just one
byte each, you would need to have 4 thousand disks of 250GB each to
store 1E15 keys.  How much disk space do you actually have?  I'm
betting you have no more than one 250GB disk.

...

[Istvan Albert]
 On my system storing 1 million words of length 15
 as keys of a python dictionary is around 75MB.

 Fine, that's what I wanted to hear. How do you improve the algorithm?
 Do you delay indexing to the very latest moment or do you let your
 computer index 999 999 times just for fun?

It remains wholly unclear to me what the algorithm you want might
be.  As I mentioned before, if you store keys in sorted text files,
you can do intersection and difference very efficiently just by using
the Unix `comm` utiltity.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Writing huge Sets() to disk

2005-01-10 Thread John Lenton
you probably want to look into building set-like objects ontop of
tries, given the homogeneity of your language. You should see
imrpovements both in size and speed.

-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Writing huge Sets() to disk

2005-01-10 Thread Batista, Facundo
Title: RE: Writing huge Sets() to disk





[Istvan Albert]


#- I think that you need to first understand how dictionaries work.
#- The time needed to insert a key is independent of
#- the number of values in the dictionary.


Are you sure?


I think that is true while the hashes don't collide. If you have collisions, time starts to depend of element quantity. But I'm not sure

Tim sure can enlighten us.


Asking-for-god-word--ly yours,


. Facundo


Bitcora De Vuelo: http://www.taniquetil.com.ar/plog
PyAr - Python Argentina: http://pyar.decode.com.ar/



  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ADVERTENCIA.


La informacin contenida en este mensaje y cualquier archivo anexo al mismo, son para uso exclusivo del destinatario y pueden contener informacin confidencial o propietaria, cuya divulgacin es sancionada por la ley.

Si Ud. No es uno de los destinatarios consignados o la persona responsable de hacer llegar este mensaje a los destinatarios consignados, no est autorizado a divulgar, copiar, distribuir o retener informacin (o parte de ella) contenida en este mensaje. Por favor notifquenos respondiendo al remitente, borre el mensaje original y borre las copias (impresas o grabadas en cualquier medio magntico) que pueda haber realizado del mismo.

Todas las opiniones contenidas en este mail son propias del autor del mensaje y no necesariamente coinciden con las de Telefnica Comunicaciones Personales S.A. o alguna empresa asociada.

Los mensajes electrnicos pueden ser alterados, motivo por el cual Telefnica Comunicaciones Personales S.A. no aceptar ninguna obligacin cualquiera sea el resultante de este mensaje.

Muchas Gracias.



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

2005-01-10 Thread Martin MOKREJ
Tim Peters wrote:
[Martin MOKREJ]
...
I gave up the theoretical approach. Practically, I might need up
to store maybe those 1E15 keys.

We should work on our multiplication skills here wink.  You don't
have enough disk space to store 1E15 keys.  If your keys were just one
byte each, you would need to have 4 thousand disks of 250GB each to
store 1E15 keys.  How much disk space do you actually have?  I'm
betting you have no more than one 250GB disk.
Can get about 700GB on raid5, but this doesn't help here of course. :(
I definitely appreciate your comments, I somehow forgot to make sure
I can store it. I was mainly distracted by the fact I might anyway
hit almost the same size, if there's too few words used in reality.
Therefore when looking for those words not found in such a dictionary,
I'd be close to the teoretical, maximal size of say in order of E15
or E14.

...
[Istvan Albert]
On my system storing 1 million words of length 15
as keys of a python dictionary is around 75MB.

Fine, that's what I wanted to hear. How do you improve the algorithm?
Do you delay indexing to the very latest moment or do you let your
computer index 999 999 times just for fun?

It remains wholly unclear to me what the algorithm you want might
My intent is do some analysis in protein science. But it can be really
thought of as analysing those 4 different languages.
be.  As I mentioned before, if you store keys in sorted text files,
you can do intersection and difference very efficiently just by using
the Unix `comm` utiltity.
Now I got your point. I understand the comm(1) is written in C, but it still
has to scan file1 once and file2 n-times, where n is a number of lines
in file1, right? Without any index ... I'll consider it, actually will test,
thanks!
I was really hoping I'll get an answer how to alter the indexes for 
dictionaries
in python.
You convinced me not to even try to construct to theoretical dictionary,
as it will take ages just to create. Even if I'd manage, I couldn't
save it (the theoretical and possibly not even the dict(theoret) - dict(real)
result).
Still, before I give the whole project, once more:
I'll parse some text files, isolates separate words and add them to
either Set(), list, dict, flatfile line by line.
Depending on the above, I'll compare them and look for those unique
to some language. I need to keep track of frequencies used
in every such language, so the dict approach would be the best.
The number stored as a value vould be a float ^H^H^H^H^H^H Decimal()
type - very small number.
Once more, I expect to have between E4 or E5 to E8??? words
stored in 20 dictionaries (remember words of sizes in range 1-20?
Every of those 20 dictionaries should be, I believe, indexed just once.
The indexing method should know all entries in a given file are of same
size, i.e. 5 chars, 15 chars, 20 chars etc.
I already did implement the logic to walk through those 20 different
dictionaries from language a and from language b and find uout those
unique to a or common to both of them. Why I went to ask on this list
was to make sure I took right approach. Sets seemed to be better solution
for the simple comparison (common, unique). To keep track of those
very small frequencies, I anyway have to have dictionaries. I say
that again, how can I improve speed of dictionary indexing?
It doesn't matter here if I have 10E4 keys in dictionary or
10E8 keys in a dictionary.
Or I wanted to hear: go for Sets(), having set with 1E8 keys
might take 1/10 of the size you need for a dictionary ... but
you cannot dump them efficiently on a disk. Using shelve will
cost you maybe 2x the size of dictionary approach and will
be also slover that writing dictionary directly.
Something along these words. I really appreciate your input!
Martin

--
http://mail.python.org/mailman/listinfo/python-list