I've taken a look at your data and was surprised to come to the same
conclusion as yourself that for you dataset, a select followed by a
filter is the fastest solution.

You might not know that you can select on two attribute, so you could
have written:

subvw = vw.select(f1=pickon, state=state)

but the following is faster:

subvw = vw.select(f1=pickon)
rs = subvw.filter(lambda row: row.state == state)

Now you mention that it takes 1.7 seconds to examine one set of data.
This seems slow to me which indicates that A) the problem is different
from the one you describe since on records of 100,000 generated
through the mechanism you gave take only .2 seconds to search using
the select followed with the filter.

Now, I think a better solution to your problem lies in how you store
the data.  You are in desperate need of subviews.

db = metakit.storage(DB,1)
vw = db.getas('test[f1:I,f2:s,data[seq:I,state:s]]').ordered(1)

I am setting up a view ordered on f1:I here for quick lookups.  Each
row has a subtable data which holds your state information.  This
subtable is key now, it dramatically reduces the size of the outer view.

Now the lookups on my 100000 row table are now take 8.79299998283e-005
seconds which is several orders of magnitude better than what we were
finding before. Additionally, the filtering time for the data subviews
is almost inconsequential.  If the number of keys (f1 in this case) is
small, say around 100000 rows or so, you can get much better
performance out of using the hash tables.  And if it is much larger,
say 200000+, you can use blocked and ordered views with a minor speed hit.

Of course, you will take longer to load the data but this might be
overcome by the new found quickness of your searches.

One caveat is that when you are adding rows to a hashed or ordered
view, the index returned is not the actual index in the ordered table
so you can't just blindly use the index that append returns.  I'm
hoping to fix this bug in metakit at some point, until that day if you
need the index you will have to search for it using view.find.  Here
is some test code

import metakit, random, os, time

DB = "foo.mk"
if os.path.exists(DB):
os.remove(DB)
db = metakit.storage(DB,1)
vw = db.getas('test[f1:I,f2:s,data[seq:I,state:s]]').ordered(1)
#, [(s,'%d'%s)])


num = 10000
for i in xrange(num):
   if i % 1000 == 0:
       if i % 10000 == 0:
           print i
       db.commit()

f1 = random.randint(0, num)
index = vw.find(f1=f1)
if index == -1:
badindex = vw.append((f1,'%d'%f1))
# we need to search for the correct index
# since append returns the index from the
# original view, not the mapping view
index = vw.find(f1=f1)
data = vw[index].data
seq = random.randint(0,4)
if seq > 0:
for s in range(seq):
data.append((s,'%d'%s))


t1 = time.time()
for i in range(10000):
   s = random.randint(0, num)
   index = vw.find(f1=s)
t2 = time.time()
print (t2-t1)/num, "seconds per search"


_____________________________________________ Metakit mailing list - [EMAIL PROTECTED] http://www.equi4.com/mailman/listinfo/metakit

Reply via email to