> > I understand this somewhat and had done that, but am still confused by one > thing: in my pure ORM case (for "joe"), how did the identity map get a hold > of the auto-increment ids when there was no apparent emitted SQL? It seems > there wasn't one done, although I suppose it could have been done behind the > scenes and not logged? > > the examples you showed me on pastebin all seemed to involve the ORM going > out to fetch rows, so when they are fetched via SELECT all the info is > there. >
But there *was* no apparent SELECT! I don't think I expressed myself properly... below is a dump of the SQL logging for my pastebin example (some prefix cruft removed): BEGIN (implicit) > INSERT INTO users (name) VALUES (?) > ('joe',) > INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?) > ('jo...@example.com', 1) > INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?) > ('jo...@example.com', 1) > INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?) > ('jo...@example.com', 1) > INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?) > ('jo...@example.com', 1) > INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?) > ('jo...@example.com', 1) > INSERT INTO users (name) VALUES (?) > ('bob',) > INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?) > (('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', > 2), ('bo...@example.com', 2), ('bo...@example.com', 2)) And this is *immediately* followed by the "for addr in joe.addresses: print addr.id, addr" print loop, which prints: > 1 <Address('jo...@example.com')> > 2 <Address('jo...@example.com')> > 3 <Address('jo...@example.com')> > 4 <Address('jo...@example.com')> > 5 <Address('jo...@example.com')> Note that the ids are there and there wasn't a SELECT to be found anywhere in the SQL emission logs. This is what leads me to believe that either A) there was one, it just didn't get logged, or B) they are somehow coming back on the insert. If B, I'd like to get a hold of that for my bulk inserts. so really you're only concerned about a bulk INSERT, ok > Yes, and this comes exclusively from actual measurements and observed query behavior (and that is really only the INSERT flood when using the ORM only). > Thanks - I'm not sure I 100% get it. I think it is simply that I can >> make an ORM object instance not associated with a session (I think you call >> this "detached"), directly assign the PK (if I know it), and then when I >> call Session.add() the ORM will trust me and use the provided PK for >> integrating the object into the identity-map. Is that right? >> > Correct, but I've never had anyone do this before. I'm not yet sure I > like advising someone to manipulate the Session directly in this way. We > need to be careful what we advise because in two weeks it will show up on > tutorials everywhere then we get complaints that "SQLAlchemy is too hard" > because everyone thinks they're supposed to put things straight into the > identity map (exaggerating here, but only slightly). > No worries - I will steer very clear of anything that is not advised. My initial objective was to find what *is* advised when cheating on the ORM and doing things (bulk inserts) directly As well as trying to figure out some other things like how autoinc ids are magically known without an apparent SELECT. You are being very helpful on all counts. well therein lies why the ORM most of the time can't use bulk INSERT calls > in a backend-neutral way. Your approach, which essentially is that of > swapping in some optimized code where flush() would normally take place, is > subject to the same rules of the road. Assuming the backend is one that > uses a "get the most recent ID" approach, which includes SQLite, MySQL and > MSSQL, the Python DBAPI, and in fact every database client API I've ever > worked with, only supports returning you the "last inserted id" one INSERT > at at time - that is, only via cursor.execute(), and not > cursor.executemany() (background on DBAPI at > http://www.python.org/dev/peps/pep-0249/). On Postgresql, Oracle, > Firebird, the underlying construct to generate ids is the sequence - on > those, you could run the sequence externally to the inserts for the full > range of integer IDs first. > The main application is using PostgreSQL, but at the moment there is no Postgres specific code, and everything still works perfectly (and is tested regularly) on SQLite. The latter being mostly for quick/easy debugging. Snagging a bunch if ids from the sequence in advance is an interesting idea to avoid the "post fetch of primary key" you mentioned. That will be blazing fast, versus reading back the ids after the inserts (assuming no magic id returns as per earlier). The added bonus/appeal with that is that the ORM can work normally without having to resort to the trickery that merging non-ORM operations into the ORM requires. so to do this, you'd need: > > 1. a way to make primary key ids fast, i.e. a sequence. or not using > autoinc if you're on MySQL, etc. > 2. a function that will copy those inserted rows into ORM objects, where > you use set_committed_state() for those attributes you'd like loaded (must > include at least the primary key identifying attributes)) and state.key = > instance_key(obj) (paraphrased) to set up the state and use session.add(). > At some point SQLA could grow a make_detached(obj) function. > 2a: Fine here's make_detached: > > from sqlalchemy.orm.attributes import instance_state > def make_detached(obj): > """Given a transient object with state, turn it into 'detached', that > is, has an instance key > based on its primary key identity. obj must have primary key > attributes populated.""" > state = instance_state(obj) > state.key = state.mapper._identity_key_from_state(state) > Thanks! Although after really chewing on all of this and trying out different things, I'm now content to stay in the ORM and keep away from the need for that make_detached function (see below). Your pointing out of the batch insert behaviour when pre-assigning PKs with ORM inserts is the big win here I think. I confirmed this batch insert ORM behaviour with the following simple snippet: numEmails = 5 > joeEmails = [Address("joe%...@example.com" % i) for i in range(numEmails)] > #Now manually assign some known unique PK ids in advance... > # - this is Bad code and is only a cheat for this example > for i, mail in enumerate(joeEmails, 10000): > mail.id = i > joe.addresses = joeEmails > sess.add(joe) > #flush and see that all addresses are bulk inserted with a single INSERT... > sess.flush() So I never need to leave the ORM, and don't even need to do trickery like forcing unnatural transient->detached state transitions (although all of this has been extremely informative). The only remaining thing is to make sure my manually assigned PKs are legit/unique. With my case of PostgreSQL I can hard code the fetching of a block of ids from a sequence with: def GetSequenceBlock(ORMSession, SeqName, BlockSize): q = "SELECT nextval('%s') FROM generate_series(1, %d)" % (SeqName, BlockSize) ... Going further, I've now successfully implemented this technique with PostgreSQL in a bunch of code, sticking with the ORM the whole way. I've added this classmethod to the base class I use for my DeclarativeBase: class MyOrmBase(object): > @classmethod > def GetIdSequenceBlock(cls, ORMSession, BlockSize): > """Returns a list of BlockSize ids from the sequence for the > current > table's id. If no sequence can be found (dialect doesn't support > it?) > an empty list is returned. > > Assumptions: > - the column with the sequence is named "id" > - standard SQLAlchemy sequence naming convention > > Currently only supports PostgreSQL. > > """ > seqName = "%s_id_seq" % cls.__tablename__ > q = "SELECT nextval('%s') FROM generate_series(1, %d)" % \ > (seqName, BlockSize) > try: > ids = [v[0] for v in ORMSession.execute(q).fetchall()] > except Exception as exc: #could be various exceptions > ids = [] > return ids And when I want to do a bulk insert using this I do something like: mails = [Address("joe%...@example.com" % i) for i in range(mail_count)] > pregen_ids = cls.GetIdSequenceBlock(sess, mail_count) > for i, id in enumerate(pregen_ids): > mails[i].id = id > #and this flush does it all in one INSERT... > # - for PostgreSQL, anyway > # - For SQLite it does an INSERT per row due to no PK pre-assign > sess.flush() I'm quite happy with that, except for the fact that it breaks the seal and introduces PostgreSQL-specific code (raw SQL at that) to the code. I really wish there was a way to do this with the stock autoincrementing PK in SQLite, but it seems I'll have to use a different method of pre-fetching ids with SQLite. On the raw SQL point, I see that calling nextid = connection.execute(seq)will get the next value in the sequence for all relevant dialects, but there doesn't seem to be an equivalent way in the dialects to get a batch of them in one shot. To the original question and subject header, in future if I come across any cases where I feel the need to drop to SQL expression language I now have some extra tools available to work with it. I expect the #1 approach will be simply to expiring any related attributes and let the ORM refresh as needed as that seems like the simplest and less sketchy route. If I'm desperately trying to avoid the extra selects at some point, I'll dip into something like make_detached(), but I agree with you that the simple existence of such a thing makes it more complicated when you don't know what you are doing. Once again - thanks! Russ -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.