Re: [sqlalchemy] session.expire{,_all}_unmodified for dropping objects that are unchanged?!

2018-01-18 Thread Torsten Landschoff
Hello Mike,

first, thanks for considering my ramblings and thanks for the reply!

On 17.01.2018 23:23, Mike Bayer wrote:
> this is overkill. Here's what your code looks like without an ORM: 
Yes, I think my message was lost in translation or rather in
oversimplification.
> this is very specific to the very quirky and odd thing I'm doing in
> the first place, which is that I'm creating an endlessly long linked
> list - I detach the element.
The actual code does not create an endlessly long linked list of course,
it's more like a directed acyclic graph.
Nodes may have many outgoing edges so it's not so simple to know where
to cut :-)
>> This made me think about our usage of "session.expire": Mostly, we use it to
>> tell SQLAlchemy that this is a good time to forget this object in the
>> expectation that there are no pending changes. I found 13 calls to
>> session.expire in our code base.
> The way that loop is structured, I would not want to expire all
> attributes that have been flushed.   When you make a new Change(), it
> has a predecessor for which it needs to locate its primary key in
> order to assemble it into the foreign key column of the successor.  If
> you've expired that Change() object completely, you are forcing the
> ORM to un-expire the object using another SELECT to ensure it hasn't
> been DELETEd, so that it can use the Change.id value.  The more

True. And I admit that it did not think about cutting the links, I just
wanted to drop objects (and all related links) that are no longer needed.
I actually prefetch the primary keys of all objects touched or referred
to by a batch of incoming objects. After processing the batch, I drop
all objects not referred to by the next incoming batch and load those
required for processing the next batch.

> you can tell if an object is part of that "modified" collection using
> instance_state(obj).modified - if that's false, it's definitely clean.
> _modified isn't public.  You can also look inside of session.dirty.
Thanks for the pointer, I actually looked at the implementation of
session.dirty to find the internal _modified.
I replaced this with instance-state(obj).modified.
> ".new" or ".deleted").But the most in-depth method,
> session.is_modified(), does that state.modified check first before
> digging in, so since you can likely tolerate false positives you can
> stick with that.
False positives are not an issue, if an instance is kept for one extra
batch if will most likely be dropped in the next one.
> it's "safe" but as before if the object is involved in other objects
> still to be flushed, a new SELECT will need to be emitted to unexpire
> the object.
I droped safe_expire_all, it was only for completeness.
About the extra select for unexpiring: I will gladly take that
performance hit if I manage to reintroduce that bug.

Calling expire() for an instance that had pending updates was
(obviously) a bug in my application. I just wanted to go one step
further than just fixing the spot where it appeared and make it harder
to go there again.
> the case you illustrate is surprising and unusual. I would mark it
> up to a specific case you have here where you happen to be building up
> an endlessly long chain of objects for which the issue of memory is
> most cleanly handled by an explicit periodic break of the chain.
The initial code was better here because session.commit() would have
done that for us because of the implicit expire_all directly after the
flush.
As you noted above this lead to a number of extra selects to unexpire
all objects still needed.

BTW: Our application is a rich client where the user interface also
keeps mapped objects loaded in a ORM session, since reloading is quite
expensive. Updates are done mostly in worker threads and processes and
broadcast to all database sessions. For updating the display, all
modified objects are expired and refreshed from the database.

This is where I actually messed up and refreshed from the DB inside a
running transaction - before flushing the changes entered by the user. m(
Finding the cause made me think of other misuses of expiration.
>> Note that the documentation does not warn about dropping pending changes due 
>> to calls to expire.
> it's mentioned here:
> http://docs.sqlalchemy.org/en/latest/orm/session_state_management.html#session-expire
> but just not in any kind of boldface:
Oops, sorry, I only checked the reference documentation.
> I actually probably wouldn't make expire() do what it does today, e.g.
> expire un-flushed changes.   When expire() was first created it was
> likely not possible or very easy to hit only DB-persisted attributes.
>  The attribute history system was rewritten several times since then
> and it probably would be pretty simple today to add a flag expire(obj,
> persistent_only=True) something like that.   But it's one of those
> things where when someone needs that, I want to see why, because I
> think there's usually another way to handle the situation.

Re: [sqlalchemy] session.expire{,_all}_unmodified for dropping objects that are unchanged?!

2018-01-17 Thread Mike Bayer
On Wed, Jan 17, 2018 at 11:06 AM, Torsten Landschoff
 wrote:
> Hello world,
>
> tl;dr: I would like to have the ability to expire objects from the session
> but only if they aren't new/dirty/deleted.
>
> Have a look at this example code:
>
> import random
>
> from sqlalchemy import Column, Integer, String, ForeignKey, create_engine
> from sqlalchemy.orm import relationship, backref, sessionmaker
> from sqlalchemy.ext.declarative import declarative_base
>
> Base = declarative_base()
>
> lorem_ipsum = """Lorem ipsum dolor sit amet, consectetur [...]""" * 42
>
> class Change(Base):
> __tablename__ = "change"
> id = Column(Integer, primary_key=True)
> predecessor_id = Column(Integer, ForeignKey("change.id"))
> comment = Column(String)
>
> predecessor = relationship("Change", foreign_keys=[predecessor_id],
> uselist=False)
>
> def __repr__(self):
> return "<{0} #{1}: {2}>".format(self.__class__.__name__, self.id,
> self.comment)
>
> engine = create_engine("sqlite:///demo.db", echo=False)
> Base.metadata.create_all(engine)
>
> session = sessionmaker(engine)()
>
> current_change = Change(comment="Initial commit")
> session.add(current_change)
>
> for batch in range(42):
> for x in range(1000):
> current_change = Change(
> predecessor=current_change,
> comment="Squashed bug #{0}\n\n{1}".format(
> random.randrange(0, 100), lorem_ipsum)
> )
> session.add(current_change)
> session.flush()
>
> session.commit()
>
> Running this (with the complete lorem ipsum text) on my machine, I get this:
>
> $ ulimit -v 1024000
> $ python safe_expire.py
> Traceback (most recent call last):
>   File "safe_expire.py", line 61, in 
> random.randrange(0, 100), lorem_ipsum)
> MemoryError
>
>
> Quite obviously the problem here is that all the mapped objects are kept in
> memory even after being flushed (and before being committed).
> This is just a contrived example to show the effect I had in our application
> which (among other things) implements a DVCS on top of SQLAlchemy (on top of
> a DB).
>
> Our real life use it not as simple, basically incoming objects are inserted
> into the database bottom up (in SQL insertion order, because of foreign key
> constraints).
> For performance reasons the code processes incoming changes in batches and
> tries to keep the objects relevant for relationships and sanity checking in
> the database session - for us, session.expire_on_commit is set to False.
>
> During the batch operation the code would run out of memory, because nothing
> is expired automatically. This was a surprise to me, because the session
> only keeps a weak reference its mapped objects. It seems like the
> relationships between objects keep them alive. As updates are processed
> bottom up this basically means that no mapped objects are ever dropped.
>
> This is easy to fix by manually expiring old entries from the session, like
> this:
>
> diff --git a/baseline.py b/baseline.py
> index 44e1b7a..df4eb24 100644
> --- a/baseline.py
> +++ b/baseline.py
> @@ -50,7 +50,9 @@ current_change = Change(comment="Initial commit")
>  session.add(current_change)
>
>  for batch in range(42):
> +old_entries = []
>  for x in range(1000):
> +old_entries.append(current_change)
>  current_change = Change(
>  predecessor=current_change,
>  comment="Squashed bug #{0}\n\n{1}".format(
> @@ -59,4 +61,7 @@ for batch in range(42):
>  session.add(current_change)
>  session.flush()
>
> +for entry in old_entries:
> +session.expire(entry)
> +
>  session.commit()
>
>
> So this is what I did.

this is overkill.  Here's what your code looks like without an ORM:

class Link(object):
def __init__(self, previous=None):
self.previous = previous


current_link = Link()
print("Current link!  %s" % current_link)

for batch in range(42):
for x in range(1000):
current_link = Link(current_link)
print("Current link!  %s" % current_link)


This code has the exact same problem, without any expire() or anything
like that.So if you were say writing these objects to a file or
something like that, how would you deal with this loop building up an
enormous chain of un-collectable Link objects?  If it were me, I'd
simply do this:

for batch in range(42):
for x in range(1000):
current_link = Link(current_link)
print("Current link!  %s" % current_link)

current_link.previous = None

this is very specific to the very quirky and odd thing I'm doing in
the first place, which is that I'm creating an endlessly long linked
list - I detach the element.

The analogue with the ORM would be to expire this specific link, which
we can do each flush so that the entire previous chain goes away:

for batch in range(42):
for x in range(1000):
current_change = Change(
predecessor=current_change,
  

[sqlalchemy] session.expire{,_all}_unmodified for dropping objects that are unchanged?!

2018-01-17 Thread Torsten Landschoff
Hello world,

tl;dr: I would like to have the ability to expire objects from the
session but only if they aren't new/dirty/deleted.

Have a look at this example code:

import random

from sqlalchemy import Column, Integer, String, ForeignKey, create_engine
from sqlalchemy.orm import relationship, backref, sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

lorem_ipsum = """Lorem ipsum dolor sit amet, consectetur [...]""" * 42class 
Change(Base):
__tablename__ = "change" id = Column(Integer, primary_key=True)
predecessor_id = Column(Integer, ForeignKey("change.id"))
comment = Column(String)

predecessor = relationship("Change", foreign_keys=[predecessor_id], 
uselist=False)

def __repr__(self):
return "<{0} #{1}: {2}>".format(self.__class__.__name__, self.id, 
self.comment)

engine = create_engine("sqlite:///demo.db", echo=False)
Base.metadata.create_all(engine)

session = sessionmaker(engine)()

current_change = Change(comment="Initial commit")
session.add(current_change)

for batch in range(42):
for x in range(1000):
current_change = Change(
predecessor=current_change,
comment="Squashed bug #{0}\n\n{1}".format(
random.randrange(0, 100), lorem_ipsum)
)
session.add(current_change)
session.flush()

session.commit()

Running this (with the complete lorem ipsum text) on my machine, I get this:

> $ ulimit -v 1024000
> $ python safe_expire.py
> Traceback (most recent call last):
>   File "safe_expire.py", line 61, in 
>     random.randrange(0, 100), lorem_ipsum)
> MemoryError

Quite obviously the problem here is that all the mapped objects are kept
in memory even after being flushed (and before being committed).
This is just a contrived example to show the effect I had in our
application which (among other things) implements a DVCS on top of
SQLAlchemy (on top of a DB).

Our real life use it not as simple, basically incoming objects are
inserted into the database bottom up (in SQL insertion order, because of
foreign key constraints).
For performance reasons the code processes incoming changes in batches
and tries to keep the objects relevant for relationships and sanity
checking in the database session - for us, session.expire_on_commit is
set to False.

During the batch operation the code would run out of memory, because
nothing is expired automatically. This was a surprise to me, because the
session only keeps a weak reference its mapped objects. It seems like
the relationships between objects keep them alive. As updates are
processed bottom up this basically means that no mapped objects are ever
dropped.

This is easy to fix by manually expiring old entries from the session,
like this:

diff --git a/baseline.py b/baseline.py
index 44e1b7a..df4eb24 100644
--- a/baseline.py
+++ b/baseline.py
@@ -50,7 +50,9 @@ current_change = Change(comment="Initial commit")
 session.add(current_change)
 
 for batch in range(42):
+ old_entries = []
 for x in range(1000):
+ old_entries.append(current_change)
 current_change = Change(
 predecessor=current_change,
 comment="Squashed bug #{0}\n\n{1}".format(
@@ -59,4 +61,7 @@ for batch in range(42):
 session.add(current_change)
 session.flush()
 
+ for entry in old_entries:
+ session.expire(entry)
+
 session.commit()


So this is what I did. This works fine, until somebody (maybe myself)
did changes to the old objects without an intermediate flush.
This was really hard to find, somehow updates never made it to the
database. SQLAlchemy did not even execute the expected SQL.

This made me think about our usage of "session.expire": Mostly, we use
it to tell SQLAlchemy that this is a good time to forget this object in
the expectation that there are no pending changes. I found 13 calls to
session.expire in our code base.

So I would like to replace it with something that gives SQLAlchemy a
hint that it probably is a good time to drop these things.

I came up with this code:

def safe_expire(session, o):
""" Drop *o* from *session* if it has no pending changes. """ if o in 
session.identity_map._modified:
return False else:
session.expire(o)
return True def safe_expire_all(session):
""" Drop all objects with no pending changes from *session*. """ id_map = 
session.identity_map
unmodified = [o for key, o in id_map.iteritems() if not o in 
id_map._modified]

for o in unmodified:
session.expire(o)

return unmodified


Does this look like a safe thing to do? (I wonder about updated
relationships etc.)

Am I alone running into adverse effects of session.expire? Note that the
documentation does not warn about dropping pending changes due to calls
to expire. In hindsight this seems quite obvious, but I did not think
about this when writing that code (because I knew the objects in
question are clean anyway).

Greetings and thanks for any insights,