Re: [sqlalchemy] Best practices for sub-dividing a large collection relationship

Mike Bayer Fri, 31 Dec 2021 10:21:18 -0800

if you need an in-memory dictionary of all the edges at once, then where is the 
part that you dont want the whole thing loaded at once?  or is the idea that 
this dictionary is checking for uniqueness within the scope of what's being 
added in the current session (I'd use a separate dictionary for that, if that's 
the case)?


your message has broken formatting (not line wrapping) so it's hard to read.   
I dont have a clear picture of what's needed - you can always maintain separate 
relationships where one is "dynamic" and the other not,  or you could maintain 
a separate dictionary of Edge objects entirely.   Such as, when you make a new 
Edge object, just put it in a global weak-referencing dictionary  with a key 
(srcnode, destnode), and that's how you can check for duplicate Edge objects.



On Thu, Dec 30, 2021, at 9:04 PM, Dane K Barney wrote:
> Hey Mike, thanks for your speedy reply.
> 
> I guess I left out some key details and simplified my examples a little too 
> much, which is probably why you're confused.
> 
> For one, I have actually implemented downstream_edges and upstream_edges as 
> dictionary collections, i.e.:
> 
> src_node = relationship(
>     "Node",
>     foreign_keys=[src_node_id],
>     backref=backref(
>         "downstream_edges",
>         collection_class=attribute_mapped_collection("downstream_key"),
>     )
> )
> 
> @property
> def downstream_key(self):
>     return self.dst_node, self.color
> 
> 
> The purpose of this is to provide an easy way of doing "upserts" -- that is, 
> checking if an edge already exists before inserting, since the edge table has 
> a UNIQUE KEY (`src_node_id`,`dst_node_id`,`color`) that would be violated 
> otherwise.
> 
> This allows me to do:
> 
> downstream_key = (dst_node, color)
> try:
> edge = node.downstream_edges[downstream_key]
> except KeyError:
> edge = Edge(dst_node=dst_node, color=color)
> node.downstream_edges[downstream_key] = edge
> 
> I realize that this means self.downstream_edges is being fully loaded every 
> time I do this "upsert". But this is why I can't follow your suggestion of 
> simply adding Edges with Edge(src_node=src_node, dst_node=dst_node, 
> color=color), because there's no check to prevent violating the UNIQUE KEY. 
> So the question then is what is the most performant way of achieving this 
> "upsert" behavior without triggering a full load of the downstream_edges.
> 
> There is a second problem I'm running into when trying to make use of dynamic 
> relationships, or any relationship definition which requires flush + query to 
> load. When I add a new edge to a collection, if I want to have event 
> listeners respond to those modifications, for example by using 
> @event.listens_for(Session, "before_flush") then I don't seem to be allowed 
> to reference any relationship inside that event listener which requires 
> database querying to fetch. I'm assuming this is because the "before_flush" 
> event listener cannot itself load a relationship that would require a flush 
> to load, because we are already in mid-flush.
> Given that, if I've added a new edge to downstream_edges, which is 
> lazy="dynamic" let's say, how can I see all the items in that collection, or 
> in a subset of that collection (which could be a lazy="dynamic" relationship 
> or a primaryjoin + viewonly=True relationship) without requiring an 
> additional flush at an a point in time when one is not allowed?
> 
> Note that I have also tried to accomplish this using "after_flush" and 
> "after_flush_postexec" event listeners, but the problem with these is I don't 
> seem to be allowed to make further ORM modifications inside these listeners 
> which would require more UPDATE statements, as I get an SAWarning telling me 
> what I'm doing is not recommended.
> Thanks,
> -Dane
> On Thursday, December 30, 2021 at 5:33:38 PM UTC-8 Mike Bayer wrote:
>> 
>> 
>> On Thu, Dec 30, 2021, at 8:10 PM, Dane K Barney wrote:
>>> Given the following example model:
>>> 
>>> class Node(Base):
>>>     __tablename__ = "node"
>>>     id = Column(Integer, primary_key=True, autoincrement=True)
>>> 
>>> class Edge(Base):
>>>     __tablename__ = "edge"
>>>     id = Column(Integer, primary_key=True, autoincrement=True)
>>>     color = Column(Enum(Color))
>>>     src_node_id = Column(Integer, ForeignKey("node.id"))
>>>     dst_node_id = Column(Integer, ForeignKey("node.id"))
>>>     src_node = relationship(
>>>         "Node",
>>>         foreign_keys=[src_node_id],
>>>         backref=backref("downstream_edges")
>>>     )
>>>     dst_node = relationship(
>>>         "Node",
>>>         foreign_keys=[dst_node_id],
>>>         backref=backref("upstream_edges")
>>>     )
>>> 
>>> The collections Node.downstream_edges and Node.upstream_edges are so large 
>>> that it is detrimental to load them entirely on a regular basis (although I 
>>> still need the ability to do so). So for performance reasons, I am trying 
>>> to come up with a way of sub-dividing these relationships based on the 
>>> "color" attribute. One way I have tried to do this is by defining 
>>> additional relationships on the Node class which filter the query using a 
>>> custom primaryjoin. For example:
>>> 
>>> blue_downstream_edges = relationship(
>>>     "Edge",
>>>     primaryjoin="and_(Node.id == Edge.src_node_id, Edge.color == 'blue'",
>>>     viewonly=True,
>>> )
>>> 
>>> The downside to this approach is that the viewonly=True attribute means I 
>>> cannot directly make modifications to this collection. When I want to add 
>>> new Edges, I have to do it on the complete downstream_edges relationship, 
>>> which will trigger the full load that I'm trying to avoid.
>> 
>> I would try to add new Edge objects without appending them to any list, just 
>> create a new Edge(src_node=n1, dest_node=n2).   
>> 
>> 
>>> 
>>> Another idea I had was to use the lazy="dynamic" attribute on the 
>>> downstream_edges and upstream_edges relationships, so that I could do 
>>> something like:
>>> 
>>> blue_downstream_edges = node.downstream_edges.filter(Edge.color == 'blue')
>> 
>> yes, you should put "dynamic" on these relationships.
>> 
>>> 
>>> As I understand it, I would then be able to append items to this subset 
>>> collection. However, those changes would not be seen in the complete 
>>> downstream_edges collection.
>> 
>> they would ?  they need to be flushed first, but that's how it works, if you 
>> want to see the "complete collection" then you are loading that complete 
>> collection.
>> 
>> OTOH if you want to just have the "complete collection" all in memory at 
>> once, OK, then ...you wouldn't use dynamic?   is it kind of like two 
>> different modes of operation you want to use, one where collections are 
>> fully in memory and another where they arent ?   
>> 
>> 
>>> 
>>> 
>>> In short, what I'm trying to achieve is a way to provide subsets of a large 
>>> collection, which are modifiable, and those modifications are visible in 
>>> both the larger collection and the subset collection.
>>> 
>>> I'm guessing this would need to involve some sort of shared access to the 
>>> same in-memory ORM mapper class, because a modification made to one 
>>> representation of the collection that has not yet been flushed/persisted to 
>>> the database would not be visible to another representation of that 
>>> collection, assuming it requires a database query to load.
>>> 
>>> Is this possible at all?
>>> 
>>> 
>>> -- 
>>> SQLAlchemy - 
>>> The Python SQL Toolkit and Object Relational Mapper
>>>  
>>> http://www.sqlalchemy.org/
>>>  
>>> To post example code, please provide an MCVE: Minimal, Complete, and 
>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>> description.
>>> --- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "sqlalchemy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to sqlalchemy+...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/sqlalchemy/d834ff0d-654b-4b55-906c-0f062c7bb26fn%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/sqlalchemy/d834ff0d-654b-4b55-906c-0f062c7bb26fn%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> 
> 
> 
> -- 
> SQLAlchemy - 
> The Python SQL Toolkit and Object Relational Mapper
>  
> http://www.sqlalchemy.org/
>  
> To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
> --- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/sqlalchemy/1820d488-7511-4909-b3b3-c961dc32dba7n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/sqlalchemy/1820d488-7511-4909-b3b3-c961dc32dba7n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/702d2604-2e8a-46e1-976b-49ebdfe685c6%40www.fastmail.com.

Re: [sqlalchemy] Best practices for sub-dividing a large collection relationship

Reply via email to