Nicolas Lehuen wrote:
Hi Jim,

You've done a pretty impressive work here. What surprises me is the
O(n) behaviour on DBM and FS. This seems to mean that indexes (or
indices, if you prefer) ar not used.

ext2/ext3 uses a linked list to access files, hence O(n) when adding a file.

For DBM, well, if BDB could not handle indexes, this would be big
news. Are you 100% sure that the Berkeley implementation is used ?

I used dbhash, which according to the python docs is the interface to the bsddb module. The code is pretty much the same as in DbmSession. Code snippet is at the bottom of this message. Running "/usr/bin/file bsd.dbm" gives:

bsd.dbm: Berkeley DB (Hash, version 8, native byte-order)

**** Brain Wave ****

It just occured to me that the performance problem may be related to opening and closing the dbm file for every record insertion. Adjusting the test so that the file is only opened once, I get O(1), and a great speed boost: 0.2 seconds / per 1000 records all the way up to 50,000 records. At that point I start to see period performance hits due to disk syncing, but that is to be expected.

I have no idea what to do with this knowledge unless we can figure out a way to keep the dbm file open across multiple requests. Ideas??

**** End of Wave ****

For FS, I don't know about ext3, but in ReiserFS or the Win NT
filesystem, there are indexes that should speed up file lookups, and
should certainly not yield a O(n) performance.

Don't forget, I only benchmarked creating new session files. Reading, or writing to existing files may be an entirely different matter. Certainly one of the benefits of ReiserFS is that it can handle a large number of small files in an efficient manner.

Anyway, implementing
FS2 instead of FS is not that difficult, and if it yields predictable
results even on ext3, then we should go for it.

Already done - it's just a couple of extra lines. Doing some testing today.

As for the MySQL implementation, well, I've been promising it many
times, but I can provide a connection pool implementation that could
speed up applicative code as well as your session management code.
What I would need to do is to make it mod_python friendly, i.e. make
it configurable through PythonOption directives. Do you think it would
be a good idea to integrate it into mod_python ?

Connection pooling seems like a common request on the mailing list, so I'd say yes.

Regards,
Nicolas


Code snippet from my benchmark script.

import dbhash

def create_bsd(test_dir, test_runs, number_of_files, do_sync=False):
    if not os.path.exists(test_dir):
        os.mkdir(test_dir)
    dbmfile = "%s/bsd.dbm" % (test_dir)
    dbmtype = dbhash
    i = 0
    timeout = 3600
    count = 0
    total_files = 0
    results_file = '%s/bsd.%02d.results' % (test_dir,i)

    results = open(results_file,'w')
    write_header(results, 'dbhash', test_runs, number_of_files)
    while count < test_runs:
        start_time = time.time()
        i = 0
        while i < number_of_files:
            sid = _new_sid()
            data = {'_timeout': timeout,
                    '_accessed': time.time(),
                    'stuff': 'test test test test test'
                    }
            # dbm file is opened and closed for each insertion
            # which is the same as the current DbmSession
            # implentation
            dbm = dbmtype.open(dbmfile, 'c')
            dbm[sid] = cPickle.dumps(data)
            dbm.close()
            i += 1
            total_files += 1


        count += 1
        elapsed_time = time.time() - start_time
        print_progress(count, i, elapsed_time)
        write_result(results,count, i, elapsed_time, total_files)
        results.flush()
        if do_sync:
            s

Reply via email to