Tim Peters wrote:
[Martin MOKREJÅ]

...

I gave up the theoretical approach. Practically, I might need up
to store maybe those 1E15 keys.


We should work on our multiplication skills here <wink>.  You don't
have enough disk space to store 1E15 keys.  If your keys were just one
byte each, you would need to have 4 thousand disks of 250GB each to
store 1E15 keys.  How much disk space do you actually have?  I'm
betting you have no more than one 250GB disk.

...

[Istvan Albert]

On my system storing 1 million words of length 15
as keys of a python dictionary is around 75MB.


Fine, that's what I wanted to hear. How do you improve the algorithm?
Do you delay indexing to the very latest moment or do you let your
computer index 999 999 times just for fun?


It remains wholly unclear to me what "the algorithm" you want might
be.  As I mentioned before, if you store keys in sorted text files,
you can do intersection and difference very efficiently just by using
the Unix `comm` utiltity.

This comm(1) approach doesn't work for me. It somehow fails to detect common entries when the offset is too big.

file 1:

A
F
G
I
K
M
N
R
V
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AIK
FGR
FRA
GFG
GIN
GRI
IGI
IGR
IKI
ING
IVF
KIG
MAI
NGF
RAA
RIG


file 2:

W
W
W
W
W
W
W
W
W
W
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to