[sqlalchemy] Re: Multiple inserts with .append()'d associations

Jeff Tue, 29 May 2012 11:30:56 -0700

"engine.dispose() is an inefficient operation, as the Engine is a
factory for connections, not a connection itself.   If you'd like
there to be zero actual database connections open when the application
is idle, you can disable pooling using NullPool.    "


Very good point. Will do!

To be more clear, I was actually doing this:

all_events =
list(session.query(Event).filter(....).order_by(...).all())
session.close()
session.bind.dispose()

for a in analyses_parameters:
    analysis = Analysis(....)
    events_in_this_analysis = all_events[...]
    analysis.events = events_in_this_analysis
    session.append(analysis)
session.commit()
session.close()
session.bind.dispose()

Thus, there is no connection to the database during the for loop. And
that loop takes a long time, so the lack of a connection is much
appreciated. The issue comes during the session.commit(), which has
thousands of Analyses with hundreds of thousands of associated Events.
I can watch the database and the open connections, and I see that:

1. All these Analyses are inserted (which takes awhile)
2. THEN the Analysis-Event associations are inserted, which also takes
awhile. It's during this time that I see the connection for this
session goes idle for a few cycles, then is inserting, then goes idle,
and then is inserting, etc. This is the behavior that seemed could be
improved. It's unclear to me from your comments whether or not it
actually can be.

"SQLAlchemy 0.7 will also perform the INSERT of all the above records
as a single prepared statement if you assign a primary key value to
the objects ahead of time"
Good idea, but unfortunately not doable in this case; lots of
independent processes are doing this at the same time, so I don't see
a coherent way to effectively determine primary keys ahead of time.

"Or, you can get the same effect by inserting the records using a non-
ORM executemany, where in this case you wouldn't need to pre-assign
primary key values:

Session.execute(Event.__table__.insert(), params=[{a:evt.a,
b:evt.b, ...} for evt in events}]
... "

I'd actually be inserting into Analysis_Event_Association. That
association table needs the primary keys of the Analyses and the
Events. Doing that manually might look like:

(using NullPool)
all_events =
session.query(Event).filter(....).order_by(...).values(id)
session.close()

for a in analyses_parameters:
    analysis = Analysis(....)
    session.append(analysis)
session.commit()
session.close()

all_analyses =
session.query(Analysis).filter(....).order_by(...).values(id)
session.close()

for a in all_analyses =
    events_in_this_analysis_ids = all_events[...]
    for e in events_in_this_analysis_ids:
        session.add(Analysis_Event_Association(analysis_id=a,
event_id=e)
session.commit()
session.close()

That seems like it would be the most efficient in terms of connections
only being open during the session.commit(), and each of those having
the most efficient INSERTing. The Python end of things, however, looks
a lot less elegant/simple, making me think I'm missing something.

Thanks!

On May 28, 12:11 pm, Michael Bayer <mike...@zzzcomputing.com> wrote:
> On May 28, 2012, at 10:52 AM, Jeff wrote:
>
>
>
>
>
>
>
>
>
> > There are Analyses, which have Events. There's an
> > Analysis_Event_Association table. Events are written before the
> > Analyses. Later, I have a for loop calculating multiple Analyses. For
> > each Analysis, we identify the Events it includes and append them,
> > using sqlalchemy's relationship plumbing. This just writes the
> > appropriate row to the association table. It works great:
>
> > all_events =
> > list(session.query(Event).filter(....).order_by(...).all())
>
> > for a in analyses_parameters:
> >    analysis = Analysis(....)
> >    events_in_this_analysis = all_events[...]
> >    analysis.events = events_in_this_analysis
> >    session.append(analysis)
>
> > session.commit()
> > session.close()
> > session.bind.dispose()
>
> > The issue is that I'm in a situation where there are many, many
> > processes writing to the same database, and the database only allows
> > us a limited number of connections. So I'm trying to have connections
> > open only when they're actually being used, hence the
> > session.bind.dispose().
>
> engine.dispose() is an inefficient operation, as the Engine is a factory for 
> connections, not a connection itself.   If you'd like there to be zero actual 
> database connections open when the application is idle, you can disable 
> pooling using NullPool.
>
>
>
> > Writing all the analyses is a single insert, and thus is efficient
> > with regards to having a connection open. HOWEVER, it appears that
> > each of the Events associations we appended are being inserted
> > individually, which is bad.  The connection sits there, then is used,
> > then sits there, then is used, etc. This contributes to the maximum
> > connections open, which is a limited resource.
>
> Just to make sure we're on the same page, the Session procures a single DBAPI 
> connection, and uses it for the lifespan of that transaction.   So the above 
> code is using only one connection until COMMIT is called.   Unless there is a 
> significant period of latency in the above loop, it would be way less 
> efficient and also defeat transactional guarantees to use a new connection 
> for each INSERT.   If your app is using MyISAM tables then the transactional 
> guarantees are moot in any case, though.   But breaking up a tight loop to 
> repeatedly connect over and over again for each record being inserted, in the 
> effort to use fewer concurrent connections, is very unlikely to help the 
> situation as you're adding lots more work and time spent to the operation.
>
> > I'd like all the
> > appended Events association to be inserted in one go. Is there a way
> > to do this? Am I correct about what's going on?
>
> SQLAlchemy 0.7 will also perform the INSERT of all the above records as a 
> single prepared statement if you assign a primary key value to the objects 
> ahead of time; the ORM no longer needs to fetch the newly inserted primary 
> key in this case, so you'd see a single statement sent to the DBAPI, and the 
> DBAPI then executes the statement in a more efficient manner for each set of 
> parameters (typically using prepared statement mechanics).   It still 
> ultimately emits individual INSERT statements over the wire but just at a 
> lower, more performant level on the Python side.
>
> Or, you can get the same effect by inserting the records using a non-ORM 
> executemany, where in this case you wouldn't need to pre-assign primary key 
> values:
>
> Session.execute(Event.__table__.insert(), params=[{a:evt.a, b:evt.b, ...} for 
> evt in events}]
>
> the disadvantage to the above is that the Session doesn't know about these 
> newly inserted Event objects.  You'd need to make sure you don't add 
> equivalent records to the Session itself or they will conflict, and if you 
> want to work with these Event objects in the Session immediately subsequent 
> to the above operation you'd need to load them in (as that would be the only 
> way to get at their newly generated primary key values).

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

[sqlalchemy] Re: Multiple inserts with .append()'d associations

Reply via email to