Jason, Thanks for your examples - there are lots of useful SQLAlchemy coding hints in there for me...
On Apr 5, 5:30 pm, jason kirtland <[EMAIL PROTECTED]> wrote: > GK wrote: > > Michael, > > > Thank you for your response - it was very helpful for me. > > > It turns out my main problem was that I was importing an order of > > magnitude or so more data than I realized, but you were also right > > about using flush(). > > > You were also right about the overhead of creating extra indexes. In > > the spirit of putting some data in a public space... > > > Starting with en empty database, with a test dataset of 1200 values > > (about 1150 unique insertions) and flushing after every insertion I > > have the following timings: > > > No extra indexes: 2:00 > > Three extra indexes: 2:15 > > > This is using SQLite with a flat file on a 1.8GHz laptop. The records > > are each nearly 1Kb. There's an overhead of about 5 seconds for > > reading the data, so most of the above time is loading the database. > > > I haven't yet had time to judge how the performance varies with larger > > datasets. > > 2:00 seems very high- is that 2 minutes? Below are two similar bulk > table loads. The first uses the same insert-or-update methodology and > only the relational layer (no ORM)- that clocks in at 1.25 seconds on my > laptop. The second is an ORM implementation with a different duplicate > detection methodology- that clocks in at 2.0 seconds. > > --- > > ## > ## Relational version > ## > > import os > import time > import random > from sqlalchemy import * > from sqlalchemy.exceptions import IntegrityError > > data_cols = ( > 'sendadr', 'fromadr', 'toadr', 'sendnam', 'fromnam', > 'tonam', 'subject', 'received', 'spam', 'folderid' ) > chunk = lambda: '%x' % random.getrandbits(400) > dataset = [dict((col, chunk()) for col in data_cols) > for _ in xrange(1200)] > dupes = random.sample(dataset, 50) > > db = '1krows.db' > if os.path.exists(db): > os.unlink(db) > > engine = create_engine('sqlite:///%s' % db) > metadata = MetaData(engine) > table = Table('t', metadata, > Column('id', Integer, primary_key=True), > Column('occurs', Integer, default=1), > *(Column(col, Text) for col in data_cols)) > table.append_constraint(UniqueConstraint(*data_cols)) > metadata.create_all() > > table.insert().execute(dupes) > assert table.select().count().scalar() == 50 > > start = time.time() > > insert = table.insert() > update = (table.update(). > where(and_(*((table.c[col] == bindparam(col)) > for col in data_cols))). > values({'occurs': table.c.occurs+1})) > conn = engine.connect() > tx = conn.begin() > for row in dataset: > try: > conn.execute(insert, row) > except IntegrityError: > conn.execute(update, row) > tx.commit() > > end = time.time() > > assert table.select().count().scalar() == 1200 > assert select([func.count(table.c.id)], > table.c.occurs==2).scalar() == 50 > print "elapsed: %04f" % (end - start) > > ## > ## ORM version > ## > > import hashlib > import os > import time > import random > from sqlalchemy import * > from sqlalchemy.orm import * > > data_cols = ( > 'sendadr', 'fromadr', 'toadr', 'sendnam', 'fromnam', > 'tonam', 'subject', 'received', 'spam', 'folderid' ) > chunk = lambda: '%x' % random.getrandbits(400) > dataset = [dict((col, chunk()) for col in data_cols) > for _ in xrange(1200)] > > def hashrow(row): > return hashlib.sha1( > ','.join(row[c] for c in data_cols)).hexdigest() > > dupes = [] > for row in random.sample(dataset, 50): > dupe = row.copy() > dupe['hash'] = hashrow(dupe) > dupes.append(dupe) > > db = '1krows.db' > if os.path.exists(db): > os.unlink(db) > > engine = create_engine('sqlite:///%s' % db) > metadata = MetaData(engine) > table = Table('t', metadata, > Column('id', Integer, primary_key=True), > Column('occurs', Integer, default=1), > Column('hash', String(40), unique=True), > *(Column(col, Text) for col in data_cols)) > metadata.create_all() > > table.insert().execute(dupes) > assert table.select().count().scalar() == 50 > > class Email(object): > def __init__(self, **kw): > for key, value in kw.items(): > setattr(self, key, value) > > def hashval(self): > return hashrow(dict((col, getattr(self, col)) > for col in data_cols)) > > mapper(Email, table) > > start = time.time() > session = create_session() > session.begin() > > data = [Email(**row) for row in dataset] > > chunk, remaining = [], [(e.hashval(), e) for e in data] > while remaining: > chunk, remaining = remaining[:100], remaining[100:] > by_hash = dict(chunk) > dupes = (session.query(Email). > filter(Email.hash.in_(by_hash.keys()))).all() > for dupe in dupes: > dupe.occurs += 1 > by_hash.pop(dupe.hash) > for hashval, email in by_hash.items(): > email.hash = hashval > session.save(email) > session.flush() > session.commit() > > end = time.time() > > assert table.select().count().scalar() == 1200 > assert select([func.count(table.c.id)], > table.c.occurs==2).scalar() == 50 > print "elapsed: %04f" % (end - start) Yes, my timing was 2 minutes. I ran your examples and got the same times as you quote, so I started wondering what was causing the dramatic difference. Starting with your ORM sample, I started tripping out optimizations and got some progressively longer run times: 1. as supplied: 2.0 seconds 2. removed hash optimizations (use original columns for uniqueness constraint and existing value query): 5.7 seconds 3. removed chunking optimization (i.e. flush after every new email processed instead of every 100): 7.7 seconds 4. removed session.being() and session.commit() around all data: 131 seconds (!) I went back to my original code and tried adding a begin/commit around the top-level operation, and voila! I get about 7 seconds to process about 500 messages (my previous data now being lost). This is a vast improvement. So it appears that, even for a non-transactional database session, wrapping the execution in begin()/commit() is vital for performance. Thanks for helping me to find this! #g --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---