date:20131113

Re: [sqlalchemy] Speed up bulk inserts

2013-11-13 Thread Achim

Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer:

I wrote a full post regarding this topic on stackoverflow at
http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768.
If you start with this, I can answer more specific questions.

The article was very helpful, thanks. I still want to figure out the best
balance between convenience and speed for my use case. Do the following
make sense and is possible?

I work only with Postgresql and I'm sure that all involved objects have a
unique id column which is called 'id'. So before doing a session.commit(),
I could check how many objects are in my session. As I'm just bulk
inserting, I know that all of them are new and don't have their id set yet.
Now I ask the database for that number of new ids, iterate over the objects
in my session and set the ids. Internally all ids would come from a single
sequence, so I don't have to care about object types and so on. Afterwards
SqlAlchemy should be aware that ids have already been set, so no generated
ids have to be returned and the session.commit() should be much simpler and
faster.

Sounds like a still quite simple, but hopefully much faster solution. Do
you agree?

kind regards,
Achim

--
You received this message because you are subscribed to the Google Groups
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.

[sqlalchemy] Is it possible to obtain actual values inserted into database in a SessionExtension?

2013-11-13 Thread Mark Bird

I am trying to write a generic auditing extension, but I have some 
inconsistencies I'd like to iron out if possible.

The problem I have is that when an object is updated via a relationship 
rather than a Column attribute, the value returned from 
attributes.get_history is a SQLAlchemy object, and not the value that is in 
the database. This is more pronounced on a many-to-many relationship - the 
secondary table isn't marked as new/dirty, even though that is the table 
that will actually have data inserted into it. Instead the 2 tables joined 
by the secondary table are marked as dirty, although no data is changed in 
those tables. I understand this makes sense at the SQLAlchemy level - but 
is there a way to get the actual database changes? 

A code example to illustrate this:

from sqlalchemy.types import Integer
from sqlalchemy import Column, ForeignKey, create_engine
from sqlalchemy.orm import relationship, backref, attributes, sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm.interfaces import SessionExtension

BASE = declarative_base()

ENGINE = create_engine(sqlite://)

class AuditListener(SessionExtension):
def after_flush(self, session, *args):
for attr in (new, dirty, deleted):
print attr
for obj in getattr(session, attr):
print %s % obj.__class__.__name__
for col in obj.__mapper__.iterate_properties:
added, unchanged, deleted = attributes.get_history(obj, 
col.key)
if added:
print %s: %s % (col.key, str(added))

SESSION = sessionmaker(
bind=ENGINE,
extension=(AuditListener())
)

class FooBar(BASE):
__tablename__ = 'foobar'
foo_id = Column('foo_id',
ForeignKey('foo.foo_id'),
primary_key=True)

bar_id = Column('bar_id',
ForeignKey('bar.bar_id'),
primary_key=True)

class Foo(BASE):
__tablename__ = 'foo'
foo_id = Column('foo_id',
Integer,
primary_key=True)

bars = relationship('Bar',
secondary=FooBar.__table__,
backref=backref('foos'))

class Bar(BASE):
__tablename__ = 'bar'
bar_id = Column('bar_id',
Integer,
primary_key=True)


def example():
BASE.metadata.create_all(ENGINE, checkfirst=False)

session = SESSION()
f = Foo(foo_id=1)
b1 = Bar(bar_id=1)
b2 = Bar(bar_id=2)
session.add(f)
session.add(b1)
session.add(b2)
session.commit()

# add relationship between f and b1  b2:
f.bars = [b1, b2]
session.commit()

if __name__ == '__main__':
example()

If you run this code you will see this output:

new
Bar
bar_id: [2]
Foo
foo_id: [1]
Bar
bar_id: [1]
dirty
deleted
new
dirty
Bar
Foo
bars: [__main__.Bar object at 0x27b4910, __main__.Bar object at 
0x277b050]
Bar
deleted

As you can see, the output when the many-to-many relationship is committed 
is to denote an update to the bars relationship of Foo. There is nothing 
indicating there are 2 new FooBar entries created.

Any ideas?

Thanks,

Mark.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [sqlalchemy] Is it possible to obtain actual values inserted into database in a SessionExtension?

2013-11-13 Thread Michael Bayer


On Nov 13, 2013, at 7:00 AM, Mark Bird mark.a.b...@gmail.com wrote:

 I am trying to write a generic auditing extension, but I have some 
 inconsistencies I'd like to iron out if possible.
 
 The problem I have is that when an object is updated via a relationship 
 rather than a Column attribute, the value returned from 
 attributes.get_history is a SQLAlchemy object, and not the value that is in 
 the database. This is more pronounced on a many-to-many relationship - the 
 secondary table isn't marked as new/dirty, even though that is the table that 
 will actually have data inserted into it. Instead the 2 tables joined by the 
 secondary table are marked as dirty, although no data is changed in those 
 tables. I understand this makes sense at the SQLAlchemy level - but is there 
 a way to get the actual database changes? 


yeah I’ve tried to stress in the documentation, using “secondary” in 
relationship() in conjunction with an ORM object that is actually mapped to 
that table will lead to confusion.  There is no history present for the table 
that you specify for “secondary”, it is only subject to inserts/deletes based 
on changes to the collection in which it refers to, e.g. Foo.bars, and these 
inserts/deletes are calculated and executed within a closed flush action.  The 
FooBar class does not play a role of any kind in this operation since you have 
not instantiated a FooBar object.

You need to make a choice if you want to work with Foo-FooBar-Bar objects, or 
if you want to work with Foo-Bar directly with “secondary” in between, and 
then stick to just one or the other.



 
 A code example to illustrate this:
 
 from sqlalchemy.types import Integer
 from sqlalchemy import Column, ForeignKey, create_engine
 from sqlalchemy.orm import relationship, backref, attributes, sessionmaker
 from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy.orm.interfaces import SessionExtension
 
 BASE = declarative_base()
 
 ENGINE = create_engine(sqlite://)
 
 class AuditListener(SessionExtension):
 def after_flush(self, session, *args):
 for attr in (new, dirty, deleted):
 print attr
 for obj in getattr(session, attr):
 print %s % obj.__class__.__name__
 for col in obj.__mapper__.iterate_properties:
 added, unchanged, deleted = attributes.get_history(obj, 
 col.key)
 if added:
 print %s: %s % (col.key, str(added))
 
 SESSION = sessionmaker(
 bind=ENGINE,
 extension=(AuditListener())
 )
 
 class FooBar(BASE):
 __tablename__ = 'foobar'
 foo_id = Column('foo_id',
 ForeignKey('foo.foo_id'),
 primary_key=True)
 
 bar_id = Column('bar_id',
 ForeignKey('bar.bar_id'),
 primary_key=True)
 
 class Foo(BASE):
 __tablename__ = 'foo'
 foo_id = Column('foo_id',
 Integer,
 primary_key=True)
 
 bars = relationship('Bar',
 secondary=FooBar.__table__,
 backref=backref('foos'))
 
 class Bar(BASE):
 __tablename__ = 'bar'
 bar_id = Column('bar_id',
 Integer,
 primary_key=True)
 
 
 def example():
 BASE.metadata.create_all(ENGINE, checkfirst=False)
 
 session = SESSION()
 f = Foo(foo_id=1)
 b1 = Bar(bar_id=1)
 b2 = Bar(bar_id=2)
 session.add(f)
 session.add(b1)
 session.add(b2)
 session.commit()
 
 # add relationship between f and b1  b2:
 f.bars = [b1, b2]
 session.commit()
 
 if __name__ == '__main__':
 example()
 
 If you run this code you will see this output:
 
 new
 Bar
 bar_id: [2]
 Foo
 foo_id: [1]
 Bar
 bar_id: [1]
 dirty
 deleted
 new
 dirty
 Bar
 Foo
 bars: [__main__.Bar object at 0x27b4910, __main__.Bar object at 
 0x277b050]
 Bar
 deleted
 
 As you can see, the output when the many-to-many relationship is committed is 
 to denote an update to the bars relationship of Foo. There is nothing 
 indicating there are 2 new FooBar entries created.
 
 Any ideas?
 
 Thanks,
 
 Mark.
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 sqlalchemy group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to sqlalchemy+unsubscr...@googlegroups.com.
 To post to this group, send email to sqlalchemy@googlegroups.com.
 Visit this group at http://groups.google.com/group/sqlalchemy.
 For more options, visit https://groups.google.com/groups/opt_out.



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [sqlalchemy] Speed up bulk inserts

2013-11-13 Thread Michael Bayer

On Nov 13, 2013, at 4:57 AM, Achim do...@procoders.net wrote:

Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer:
I wrote a full post regarding this topic on stackoverflow at
http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768
. If you start with this, I can answer more specific questions.

The article was very helpful, thanks. I still want to figure out the best
balance between convenience and speed for my use case. Do the following make
sense and is possible?

I work only with Postgresql and I'm sure that all involved objects have a
unique id column which is called 'id'. So before doing a session.commit(), I
could check how many objects are in my session. As I'm just bulk inserting, I
know that all of them are new and don't have their id set yet. Now I ask the
database for that number of new ids, iterate over the objects in my session
and set the ids. Internally all ids would come from a single sequence, so I
don't have to care about object types and so on. Afterwards SqlAlchemy should
be aware that ids have already been set, so no generated ids have to be
returned and the session.commit() should be much simpler and faster.

Sounds like a still quite simple, but hopefully much faster solution. Do you
agree?

sure that should be fine, if you can pre-calc your PKs. It just won’t work
under any kind of concurrency, as in such a situation there could be
interleaved INSERTs from different processes.

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [sqlalchemy] Speed up bulk inserts

2013-11-13 Thread Michael Bayer

On Nov 13, 2013, at 11:52 AM, Claudio Freire klaussfre...@gmail.com wrote:

On Wed, Nov 13, 2013 at 1:45 PM, Michael Bayer mike...@zzzcomputing.com
wrote:

Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer:

I wrote a full post regarding this topic on stackoverflow at
http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768
. If you start with this, I can answer more specific questions.

The article was very helpful, thanks. I still want to figure out the best
balance between convenience and speed for my use case. Do the following make
sense and is possible?

Sounds like a still quite simple, but hopefully much faster solution. Do you
agree?

sure that should be fine, if you can pre-calc your PKs. It just won’t work
under any kind of concurrency, as in such a situation there could be
interleaved INSERTs from different processes.

Postgresql sequences already handle that kind of concurrency scenario.

how exactly, if two transactions T1 and T2 both pull a number from a sequence,
T1 gets 40, T2 gets 41, then we pre-calc 10 inserts that have not yet occurred
for each, T1 has 40-50, T2 has 41-51, the number is totally wrong for both - in
reality it would be some random distribution of 40-60 between T1 and T2. No ?

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [sqlalchemy] Speed up bulk inserts

2013-11-13 Thread Achim

According to 
http://www.postgresql.org/docs/9.2/static/functions-sequence.html, 
sequences are non-transactional:

Advance the sequence object to its next value and return that value. This 
is done atomically: even if multiple sessions execute nextval concurrently, 
each will safely receive a distinct sequence value.
...
To avoid blocking concurrent transactions that obtain numbers from the 
same sequence, a nextval operation is never rolled back; that is, once a 
value has been fetched it is considered used, even if the transaction that 
did the nextval later aborts. This means that aborted transactions might 
leave unused holes in the sequence of assigned values.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [sqlalchemy] Speed up bulk inserts

2013-11-13 Thread Claudio Freire

On Wed, Nov 13, 2013 at 2:04 PM, Michael Bayer mike...@zzzcomputing.com wrote: