Re: [sqlalchemy] Best practices for sub-dividing a large collection relationship

Dane K Barney Thu, 30 Dec 2021 18:04:28 -0800

Hey Mike, thanks for your speedy reply.

I guess I left out some key details and simplified my examples a little too 
much, which is probably why you're confused.


For one, I have actually implemented downstream_edges and upstream_edges as 
dictionary collections, i.e.:

src_node = relationship(
    "Node",
    foreign_keys=[src_node_id],
    backref=backref(
        "downstream_edges",
        collection_class=attribute_mapped_collection("downstream_key"),
    )
)

@property
def downstream_key(self):
    return self.dst_node, self.color

The purpose of this is to provide an easy way of doing "upserts" -- that 
is, checking if an edge already exists before inserting, since the edge 
table has a UNIQUE KEY (`src_node_id`,`dst_node_id`,`color`) that would be 
violated otherwise.

This allows me to do:

downstream_key = (dst_node, color)
try:
edge = node.downstream_edges[downstream_key]
except KeyError:
edge = Edge(dst_node=dst_node, color=color)
node.downstream_edges[downstream_key] = edge

I realize that this means self.downstream_edges is being fully loaded every 
time I do this "upsert". But this is why I can't follow your suggestion of 
simply adding Edges with Edge(src_node=src_node, dst_node=dst_node, 
color=color), because there's no check to prevent violating the UNIQUE KEY. 
So the question then is what is the most performant way of achieving this 
"upsert" behavior without triggering a full load of the downstream_edges.

There is a second problem I'm running into when trying to make use of 
dynamic relationships, or any relationship definition which requires flush 
+ query to load. When I add a new edge to a collection, if I want to have 
event listeners respond to those modifications, for example by using 
@event.listens_for(Session, "before_flush") 
then I don't seem to be allowed to reference any relationship inside that 
event listener which requires database querying to fetch. I'm assuming this 
is because the "before_flush" event listener cannot itself load a 
relationship that would require a flush to load, because we are already in 
mid-flush.
Given that, if I've added a new edge to downstream_edges, which is 
lazy="dynamic" let's say, how can I see all the items in that collection, 
or in a subset of that collection (which could be a lazy="dynamic" 
relationship or a primaryjoin + viewonly=True relationship) without 
requiring an additional flush at an a point in time when one is not allowed?

Note that I have also tried to accomplish this using "after_flush" and 
"after_flush_postexec" event listeners, but the problem with these is I 
don't seem to be allowed to make further ORM modifications inside these 
listeners which would require more UPDATE statements, as I get an SAWarning 
telling me what I'm doing is not recommended.
Thanks,
-Dane
On Thursday, December 30, 2021 at 5:33:38 PM UTC-8 Mike Bayer wrote:

>
>
> On Thu, Dec 30, 2021, at 8:10 PM, Dane K Barney wrote:
>
> Given the following example model:
>
> class Node(Base):
>     __tablename__ = "node"
>     id = Column(Integer, primary_key=True, autoincrement=True)
>
> class Edge(Base):
>     __tablename__ = "edge"
>     id = Column(Integer, primary_key=True, autoincrement=True)
>     color = Column(Enum(Color))
>     src_node_id = Column(Integer, ForeignKey("node.id"))
>     dst_node_id = Column(Integer, ForeignKey("node.id"))
>     src_node = relationship(
>         "Node",
>         foreign_keys=[src_node_id],
>         backref=backref("downstream_edges")
>     )
>     dst_node = relationship(
>         "Node",
>         foreign_keys=[dst_node_id],
>         backref=backref("upstream_edges")
>     )
>
> The collections Node.downstream_edges and Node.upstream_edges are so large 
> that it is detrimental to load them entirely on a regular basis (although I 
> still need the ability to do so). So for performance reasons, I am trying 
> to come up with a way of sub-dividing these relationships based on the 
> "color" attribute. One way I have tried to do this is by defining 
> additional relationships on the Node class which filter the query using a 
> custom primaryjoin. For example:
>
> blue_downstream_edges = relationship(
>     "Edge",
>     primaryjoin="and_(Node.id == Edge.src_node_id, Edge.color == 'blue'",
>     viewonly=True,
> )
>
> The downside to this approach is that the viewonly=True attribute means I 
> cannot directly make modifications to this collection. When I want to add 
> new Edges, I have to do it on the complete downstream_edges relationship, 
> which will trigger the full load that I'm trying to avoid.
>
>
> I would try to add new Edge objects without appending them to any list, 
> just create a new Edge(src_node=n1, dest_node=n2).   
>
>
>
> Another idea I had was to use the lazy="dynamic" attribute on the 
> downstream_edges and upstream_edges relationships, so that I could do 
> something like:
>
> blue_downstream_edges = node.downstream_edges.filter(Edge.color == 'blue')
>
>
> yes, you should put "dynamic" on these relationships.
>
>
> As I understand it, I would then be able to append items to this subset 
> collection. However, those changes would not be seen in the complete 
> downstream_edges collection.
>
>
> they would ?  they need to be flushed first, but that's how it works, if 
> you want to see the "complete collection" then you are loading that 
> complete collection.
>
> OTOH if you want to just have the "complete collection" all in memory at 
> once, OK, then ...you wouldn't use dynamic?   is it kind of like two 
> different modes of operation you want to use, one where collections are 
> fully in memory and another where they arent ?   
>
>
>
> In short, what I'm trying to achieve is a way to provide subsets of a 
> large collection, which are modifiable, and those modifications are visible 
> in both the larger collection and the subset collection.
>
> I'm guessing this would need to involve some sort of shared access to the 
> same in-memory ORM mapper class, because a modification made to one 
> representation of the collection that has not yet been flushed/persisted to 
> the database would not be visible to another representation of that 
> collection, assuming it requires a database query to load.
>
> Is this possible at all?
>
>
> -- 
> SQLAlchemy - 
> The Python SQL Toolkit and Object Relational Mapper
>  
> http://www.sqlalchemy.org/
>  
> To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
> --- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/sqlalchemy/d834ff0d-654b-4b55-906c-0f062c7bb26fn%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/sqlalchemy/d834ff0d-654b-4b55-906c-0f062c7bb26fn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/1820d488-7511-4909-b3b3-c961dc32dba7n%40googlegroups.com.

Re: [sqlalchemy] Best practices for sub-dividing a large collection relationship

Reply via email to