Re: [sqlalchemy] How to keep ORM Sessions in sync when using the SQL expression language as well?

Russell Warren Thu, 20 Oct 2011 13:50:29 -0700

>
> I understand this somewhat and had done that, but am still confused by one
> thing: in my pure ORM case (for "joe"), how did the identity map get a hold
> of the auto-increment ids when there was no apparent emitted SQL?  It seems
> there wasn't one done, although I suppose it could have been done behind the
> scenes and not logged?
>
> the examples you showed me on pastebin all seemed to involve the ORM going
> out to fetch rows, so when they are fetched via SELECT all the info is
> there.
>


But there *was* no apparent SELECT!  I don't think I expressed myself
properly... below is a dump of the SQL logging for my pastebin example (some
prefix cruft removed):

BEGIN (implicit)
> INSERT INTO users (name) VALUES (?)
> ('joe',)
> INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
> ('jo...@example.com', 1)
> INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
> ('jo...@example.com', 1)
> INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
> ('jo...@example.com', 1)
> INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
> ('jo...@example.com', 1)
> INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
> ('jo...@example.com', 1)
> INSERT INTO users (name) VALUES (?)
> ('bob',)
> INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
> (('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com',
> 2), ('bo...@example.com', 2), ('bo...@example.com', 2))


And this is *immediately* followed by the "for addr in joe.addresses: print
addr.id, addr" print loop, which prints:

> 1 <Address('jo...@example.com')>
> 2 <Address('jo...@example.com')>
> 3 <Address('jo...@example.com')>
> 4 <Address('jo...@example.com')>
> 5 <Address('jo...@example.com')>


Note that the ids are there and there wasn't a SELECT to be found anywhere
in the SQL emission logs.  This is what leads me to believe that either A)
there was one, it just didn't get logged, or B) they are somehow coming back
on the insert.  If B, I'd like to get a hold of that for my bulk inserts.

so really you're only concerned about a bulk INSERT, ok
>

Yes, and this comes exclusively from actual measurements and observed query
behavior (and that is really only the INSERT flood when using the ORM only).

>  Thanks - I'm not sure I 100% get it.  I think it is simply that I can
>> make an ORM object instance not associated with a session (I think you call
>> this "detached"), directly assign the PK (if I know it), and then when I
>> call Session.add() the ORM will trust me and use the provided PK for
>> integrating the object into the identity-map.  Is that right?
>>
> Correct, but I've never had anyone do this before.    I'm not yet sure I
> like advising someone to manipulate the Session directly in this way.   We
> need to be careful what we advise because in two weeks it will show up on
> tutorials everywhere then we get complaints that "SQLAlchemy is too hard"
> because everyone thinks they're supposed to put things straight into the
> identity map (exaggerating here, but only slightly).
>

No worries - I will steer very clear of anything that is not advised.  My
initial objective was to find what *is* advised when cheating on the ORM and
doing things (bulk inserts) directly  As well as trying to figure out some
other things like how autoinc ids are magically known without an apparent
SELECT.  You are being very helpful on all counts.

well therein lies why the ORM most of the time can't use bulk INSERT calls
> in a backend-neutral way.   Your approach, which essentially is that of
> swapping in some optimized code where flush() would normally take place, is
> subject to the same rules of the road.  Assuming the backend is one that
> uses a "get the most recent ID" approach, which includes SQLite, MySQL and
> MSSQL, the Python DBAPI, and in fact every database client API I've ever
> worked with, only supports returning you the "last inserted id" one INSERT
> at at time - that is, only via cursor.execute(), and not
> cursor.executemany()  (background on DBAPI at
> http://www.python.org/dev/peps/pep-0249/).   On Postgresql, Oracle,
> Firebird, the underlying construct to generate ids is the sequence - on
> those, you could run the sequence externally to the inserts for the full
> range of integer IDs first.
>

The main application is using PostgreSQL, but at the moment there is no
Postgres specific code, and everything still works perfectly (and is tested
regularly) on SQLite.  The latter being mostly for quick/easy debugging.

Snagging a bunch if ids from the sequence in advance is an interesting idea
to avoid the "post fetch of primary key" you mentioned.  That will be
blazing fast, versus reading back the ids after the inserts (assuming no
magic id returns as per earlier).  The added bonus/appeal with that is that
the ORM can work normally without having to resort to the trickery that
merging non-ORM operations into the ORM requires.

so to do this, you'd need:
>
> 1. a way to make primary key ids fast, i.e. a sequence.   or not using
> autoinc if you're on MySQL, etc.
> 2. a function that will copy those inserted rows into ORM objects, where
> you use set_committed_state() for those attributes you'd like loaded (must
> include at least the primary key identifying attributes)) and state.key =
> instance_key(obj) (paraphrased) to set up the state and use session.add().
> At some point SQLA could grow a make_detached(obj) function.
> 2a:  Fine here's make_detached:
>
> from sqlalchemy.orm.attributes import instance_state
> def make_detached(obj):
>     """Given a transient object with state, turn it into 'detached', that
> is, has an instance key
>        based on its primary key identity.  obj must have primary key
> attributes populated."""
>     state = instance_state(obj)
>     state.key = state.mapper._identity_key_from_state(state)
>

Thanks!  Although after really chewing on all of this and trying out
different things, I'm now content to stay in the ORM and keep away from the
need for that make_detached function (see below).  Your pointing out of the
batch insert behaviour when pre-assigning PKs with ORM inserts is the big
win here I think.  I confirmed this batch insert ORM behaviour with the
following simple snippet:

numEmails = 5
> joeEmails = [Address("joe%...@example.com" % i) for i in range(numEmails)]
> #Now manually assign some known unique PK ids in advance...
> # - this is Bad code and is only a cheat for this example
> for i, mail in enumerate(joeEmails, 10000):
>     mail.id = i
> joe.addresses = joeEmails
> sess.add(joe)
> #flush and see that all addresses are bulk inserted with a single INSERT...
> sess.flush()


So I never need to leave the ORM, and don't even need to do trickery like
forcing unnatural transient->detached state transitions (although all of
this has been extremely informative).

The only remaining thing is to make sure my manually assigned PKs are
legit/unique.  With my case of PostgreSQL I can hard code the fetching of a
block of ids from a sequence with:

def GetSequenceBlock(ORMSession, SeqName, BlockSize):
    q = "SELECT nextval('%s') FROM generate_series(1, %d)" % (SeqName,
BlockSize)

...
Going further, I've now successfully implemented this technique with
PostgreSQL in a bunch of code, sticking with the ORM the whole way.  I've
added this classmethod to the base class I use for my DeclarativeBase:

class MyOrmBase(object):
>     @classmethod
>     def GetIdSequenceBlock(cls, ORMSession, BlockSize):
>         """Returns a list of BlockSize ids from the sequence for the
> current
>         table's id. If no sequence can be found (dialect doesn't support
> it?)
>         an empty list is returned.
>
>         Assumptions:
>          - the column with the sequence is named "id"
>          - standard SQLAlchemy sequence naming convention
>
>         Currently only supports PostgreSQL.
>
>         """
>         seqName = "%s_id_seq" % cls.__tablename__
>         q = "SELECT nextval('%s') FROM generate_series(1, %d)" % \
>             (seqName, BlockSize)
>         try:
>             ids = [v[0] for v in ORMSession.execute(q).fetchall()]
>         except Exception as exc: #could be various exceptions
>             ids = []
>         return ids


And when I want to do a bulk insert using this I do something like:

mails = [Address("joe%...@example.com" % i) for i in range(mail_count)]
> pregen_ids = cls.GetIdSequenceBlock(sess, mail_count)
> for i, id in enumerate(pregen_ids):
>     mails[i].id = id
> #and this flush does it all in one INSERT...
> # - for PostgreSQL, anyway
> # - For SQLite it does an INSERT per row due to no PK pre-assign
> sess.flush()


I'm quite happy with that, except for the fact that it breaks the seal and
introduces PostgreSQL-specific code (raw SQL at that) to the code.

I really wish there was a way to do this with the stock autoincrementing PK
in SQLite, but it seems I'll have to use a different method of pre-fetching
ids with SQLite.  On the raw SQL point, I see that calling nextid =
connection.execute(seq)will get the next value in the sequence for all
relevant dialects, but there doesn't seem to be an equivalent way in the
dialects to get a batch of them in one shot.

To the original question and subject header, in future if I come across any
cases where I feel the need to drop to SQL expression language I now have
some extra tools available to work with it.  I expect the #1 approach will
be simply to expiring any related attributes and let the ORM refresh as
needed as that seems like the simplest and less sketchy route.  If I'm
desperately trying to avoid the extra selects at some point, I'll dip into
something like make_detached(), but I agree with you that the simple
existence of such a thing makes it more complicated when you don't know what
you are doing.

Once again - thanks!

Russ

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Re: [sqlalchemy] How to keep ORM Sessions in sync when using the SQL expression language as well?

Reply via email to