Re: Proposed change in ORM model save logic

Ian Foote Fri, 15 Oct 2021 10:39:43 -0700

Hi Barry,

Based on this I'm cautiously in favour. You set out a good case, it's much
easier to understand the "why" here than in the original ticket.


I'd avoid the extra optimisation of accessing __dict__ directly though - if
__set__ gains extra functionality, I'd prefer not to accidentally miss it
here.

Ian

On Fri, 15 Oct 2021 at 07:49, Barry Johnson <bajohn...@epicor.com> wrote:

> Trac ticket #33191 was recently closed as a “wontfix”, but Carlton
> encouraged bringing the matter to this group.  The actual issue is a small
> one, but it seems that Django could do better than it is doing today with
> one small change, and avoid the need for a counter-intuitive "x = x"
> statement added to application code.
>
> The claim is that Django is unnecessarily clearing the cached entry for an
> in-memory related object at a time that cache does not need to be cleared,
> and that this cache clearing can result in unwanted lazy reads down the
> line.  A one-line fix for the problem is suggested.
>
> Many apologies in advance for the extreme length of this post.  The
> requested change is small and subtle (although quite beneficial in some
> cases); we must dive deep into code to understand WHY it is important.
>
>
>
> Background
> ———————
>
> My team is developing a major retail point of sale application; it’s
> currently live in nearly 1,000 stores and we’re negotiating contracts that
> could bring it to another 10,000 stores across the US over the next three
> to five years before going world-wide.  The backend is Django and
> PostgreSQL.
>
> One of the typical problems we face has to do with importing data from
> existing systems.  This data usually comes to us in large sets of flat
> ASCII files, perhaps a gigabyte or few at a time.  We have to parse this
> incoming data and fill our tables as quickly as possible.  It’s not usual
> for one such data import to create tens of millions of rows across a more
> than a hundred tables.  Our processes to do this are working reasonably
> well.
>
> In many cases, the incoming stream of data generates related records in
> several tables at a time.  For example, our product import populates
> related data across a dozen tables organized into parent/child hierarchies
> that are up to four levels deep.  The incoming data is grouped by product
> (not by functional data type); the rows and columns of the records for a
> single product will create ORM objects scattered across those dozen ORM
> models.  The top-level model, Product, has four child models with
> references to Product; and each of those child models may have other child
> tables referring back to THOSE models, etc.
>
> If the database schema sounds surprisingly complex, it is.  Big box retail
> is more complex than most people realize.  Each store typically carries
> 30,000 to 70,000 individual products; some are even larger.  Some of the
> retail chains to which we market our systems have more than $1 billion USD
> per year in sales.
>
> To be efficient, we process this incoming data in chunks:  We may
> instantiate ORM objects representing, say, 5,000 products and all of their
> related child objects.  For example, we’ll create a new instance of a
> Product model, then instantiate the various children that reference that
> product.  So a very typical pattern is something like
>
>      product = Product(values)
>      pv = ProductVariant(product=product, **more_values)
>      upc = UPC(product_variant=product_variant, **upc_values)
>
> It’s not that simple, of course; we’re reading in sequential data that
> generates multiple instances of the children in loops, etc., but
> essentially building a list of products and the multi-level hierarchy of
> child objects that dependent upon each of those products.  We then use
> bulk_create() to create all of the top-level products in one operation,
> then bulk_create() the various lists of first-level children, then
> bulk_create the children of that second level, etc.  LOTS of bulk creates.
>
>
>
> Prior to Django 3.2, we had to explicitly set the associated “_id” field
> for each child’s reference to a parent after the list of parents were
> bulk_created.  Using our examples above, we would see things like:
>
> bulk_create(list_of_products)
>
> for each variant in list_of_product_variants:
>     variant.product_id = variant.product.id
> bulk_create(list_of_product_variants)
>
> for each upc in list_of_upc_entries:
>    upc.product_variant_id = upc.product_variant.id
> bulk_create(list_of_upc_entries)
>
>             […]
>
> Again, this is somewhat simplifying the code, but the key takeaway is that
> older versions of Django required us to manually pick up the primary key
> value from recently created instances and set the associated “_id” value in
> each instance pointing at the recently created objects.
>
> *As expected, setting the “_id” value of a foreign key field clears the
> internal cache entry containing the reference to the parent object.*  We
> would fully expect that if we said “variant.product_id = 10”, that
> “variant.product” would be considered “not yet loaded”, and a reference to
> variant.product would generate a lazy read.  We don’t disagree with that
> interpretation or behavior, and that is the existing behavior that Carlton
> implicitly referenced when he closed the ticket in question as “wontfix”.
>
> But…
>
>
>
> Relatively Recent Django Changes
> ——————————————————
>
> In the Django 3.2 timeframe, Ticket #29497 (“Saving parent object after
> setting on child leads to an unexpected data loss in bulk_create()”) has
> affected code in this area.  Consider this sequence of code:
>
>    my_parent = Parent()
>    my_child = Child(parent=parent)
>    my_parent.save()
>    my_child.save()
>
> (Yes, our real business case involves use of bulk_create() for thousands
> of objects at a time, but this simpler example using only .save()
> illustrates the problem in a stripped-down manner.)
>
> What is the expected result in the database for the child’s foreign key to
> the parent?   Well…   We think most application programmers would expect
> the row for child object to contain a foreign key refererence to the parent
> object.
>
> Instead, the pre-3.2 Django behavior was that the child would be saved
> with a null value in the foreign key reference to parent…  Because, when we
> save the parent (and internally populate the parent’s primary key with a
> newly assigned ID), Django is unaware of the in-memory reference from child
> to parent, and cannot automatically set child_parent_id = child.parent.id,
> so it remains a null value.
>
> If the foreign key from child to parent is required, the save of the child
> object will fail because of the null value, and the developer will quickly
> realize something is wrong.  However, if the foreign key to the parent is
> NOT required, then the child WILL be quietly saved to the database with a
> null foreign key to the parent... and an unwary developer will wind up with
> unexpected null values.
>
> This was considered an unintentional loss of data and reported on ticket
> #29497.
>
> The solution to this problem was clever:  It recognizes that this is a
> case where my_child.parent_id and my_child.parent.pk are out of sync with
> one another (something we normally never see in Django), and that this
> particular case can be detected and solved.  In particular, the logic
> implemented as part of the ticket effectively changes the internal priority
> of these two related values.
>
> Prior to 3.2, the code to build the SQL INSERT statement for saving the
> child would always use the “parent_id” attribute regardless of whether
> “parent” was an ORM instance or not.  The new 3.2 code does a quick initial
> test:  If the “parent” attribute contains an ORM object with a non-null
> primary key AND if “parent_id” does not have a value, then set parent_id =
> parent.pk and proceed with the remainder of the save.
>
> In other words, if we don’t have a value in “parent_id” but we DO have a
> value in “parent.pk”, then fix parent_id before we move on to build the
> SQL.   Nice!
>
> But…
>
>
>
> The change we are requesting
> ——————————————————
>
> For those that want to see the code in this area, check out the method
> _prepare_related_fields_for_save() in db.models.base.py.  The specific
> line of code in question is:
>     setattr(self, field.attname, obj.pk)
>
> It uses an internal “setter” to set “parent_id” to the primary key of the
> referenced object.
>
> Note this point carefully:  The value of “obj” is the referenced instance
> of the parent in question.  Just a few lines earlier in the code, “obj” is
> set with this statement:
>     obj = getattr(self, field.name, None)
>
> When we chase down this getter, it is returning the value in
> self._state.fields_cache[fieldname].  *In other words, it’s the actual
> instance of the parent referenced in memory from the child.*  The
> application-level code has clearly and intentionally made this reference
> from child to parent, and the fact that we saved the parent to the database
> doesn’t (well, shouldn’t) affect that relationship.  It certainly would NOT
> affect the relationship if the parent was updated instead of inserted, and
> we would expect consistent behavior from .save() regardless of insert vs.
> update.
>
> So, back to the narrow problem.  This line of code:
>     setattr(self, field.attname, obj.pk)
> causes the internal cached reference from parent to child to be cleared!
>
> Well, yes, of course it does.  This setattr() operation is effectively
> doing the same thing as
>     child.parent_id = 10
> which we’ve already said OUGHT to invalidate an existing reference to
> “child.parent”.
>
> But here’s the crux of the argument:  The code within
> _prepare_related_fields_for_save() ISN’T application-level code clearly
> altering the value of an ID field.  Instead, it’s internal Django code
> attempting to reconcile a mismatch between the cached ID of the parent
> (child.parent_id) and primary key value of the in-memory reference to a
> model instance (child.parent.pk).  And most importantly, it’s reconciling
> that difference by believing that the referenced instance is more accurate
> and should be preserved.  Yet the very act of preserving that referenced
> instance’s primary key is removing the reference to the very instance that
> is considered more accurate.
>
> Wait a moment — to preserve a pointer to the referenced instance, we’re
> going to pluck one tiny value out of it AND THEN throw the referenced
> instance away!  To those of my generation, this reminds us of a 50+ year
> old infamous quotation, “To save the village we had to destroy it.”
>
>
>
> Why does it matter?
> ————————————
>
> If the sole use of the parent and child objects in this example was to get
> data saved to the database, it really wouldn’t matter.
>
> However, in our specific use case, we have to record certain types of
> operations in an internal log.  (It’s an “inventory transaction register”;
> whenever we change something in inventory that has an effect on the value
> of inventory, we must log that change).  That logging logic is actually
> relative to ORM instances that are two levels down from the top in our
> hierarchy of parent/child objects being bulk created.  We have a method to
> which we pass each of these not-top-level objects, and the log entries
> includes pieces of data from those objects’ parent and grandparent objects.
>
> When we access the parent object through the child we’re logging, we wind
> up causing lazy reads of the objects we just saved to the database (and
> whom we still have in memory, and whom we THOUGHT were still referenced
> from our child objects).  In our case, we actually get 2 lazy reads for
> each store-specific product being imported; 50,000 items being imported for
> a four-store chain causes an extra 400,000 unnecessary SQL SELECT
> statements.  Removing these potential lazy reads is the performance
> improvement we’d like to share with the world.
>
> And perhaps just as importantly, it avoids differences in behavior caused
> when a referenced object is updated vs. inserted into the database during a
> save.  When it is updated, there is no need to alter the primary key, so
> the referenced object remains referenced.  But if the referenced object is
> inserted, the internal code that updates the primary key for that instance
> removes the reference.  We normally don’t like these types of behavioral
> differences in Django – we want .save() to behave the same way regardless
> of inserting vs. updating.
>
>
>
> There is a workaround
> ——————————————————
>
> There is a fairly simple workaround, and it harkens back to the days of
> pre-3.2 Django.
>
>    my_parent = Parent()
>    my_child = Child(parent=parent)
>    my_parent.save()
>    my_child.parent = my_child.parent    # Add this line to work around the
> problem.
>    my_child.save()
>
> Yes, we can do this.  This rather unintuitive statement (that appears to
> be useless) obtains the internal cached reference to the parent object,
> then set both child.parent_id and the internal cached reference back to
> that instance.  (The code in the setter first sets child.parent_id (which
> clears the cached entry)  then inserts the originally referenced parent
> back into the internal cache.)
>
> This workaround greatly resembles the statement from days of old:
> (“child.parent_id = child.parent.id”), but has the additional benefit of
> keeping our in-memory reference to parent (and to all of the objects that
> parent itself might be referencing as well), and avoiding lazy reads in the
> future.
>
>
>
> Our suggested change
> ——————————————————
>
> After this very long-winded explanation, we are proposing a small change
> that makes the workaround unnecessary while still avoiding the potential
> performance hit from lazy queries.
>
> Let’s look back to our friend _prepare_related_fields_for_save().
>
> We argue there is a STRONG belief that preparing an object for saving to
> the database should not substantially alter the content of that
> object.  Yet this method does not adhere to that belief:  While the results
> saved to the database are correct, it has altered the in-memory object by
> discarding the internal reference to a parent object that the application
> level code has carefully set.
>
> In general, we like what _prepare_related_fields_for_save() is doing — we
> like the fact that it’s reconciling the difference between the copy of the
> referenced object’s primary key (child.parent_id) and the in-memory
> reference (child.parent.pk).  We agree that this method is doing the
> right thing by using the primary key value from the referenced object.  We
> simply do not think that the parent instance should be removed from the
> internal cache as part of this change.
>
> Instead of resolving this difference in keys by executing:
>     setattr(self, field.attname, obj.pk)
>
> We believe it should instead:
>     setattr(self, field.name, obj)
>
> This would effectively replace the pseudo-code
>     child.parent_id = child.parent.pk
> With
>     child.parent = child.parent
>
> It replaces the value of child.parent_id with child.parent.pk, but
> without the side effect of discarding the in-memory relation of the child
> and parent instances.
>
>
>
> Finally, there is a small performance improvement that could be made to
> this suggestion.
>
> If we look at what the current
>     setattr(self, field.attname, obj.pk)
> actually does, we have to look at the __set__() method of class
> ForeignKeyDeferredAttribute in
> django.db.models.fields.related_descriptors.py.
>
> The code in this method is:
>
> def __set__(self, instance, value):
>     if instance.__dict__.get(self.field.attname) != value and
> self.field.is_cached(instance):
>         self.field.delete_cached_value(instance)
>     instance.__dict__[self.field.attname] = value
>
> It does what we expect:  If the “parent_id” value is changing and there is
> a cached in-memory reference, then clear the cached entry before setting
> the integer value.
>
> But note that the value of the “parent_id” attribute is really stored in
> the dictionary instance.__dict__[self.field.attname]
>
> Our proposed  change back in _prepare_related_fields_for_save() has the
> goal of setting the value of the integer “parent_id” while preserving the
> cached in-memory reference to the parent.
>
> Instead of doing
>     setattr(self, field.name, obj)
> (Which accomplishes the goal by essentially clearing the cache, then
> restoring it)
>
> We could just do this:
>     self.__dict__[self.field.attname] = obj.pk
>
> It is a slight denormalization of the logic that manages the value of the
> deferred foreign key attribute, but it has a tiny performance improvement
> by not removing, then replacing a dictionary entry.
>
>
>
> What are WE doing about this problem?
> —————————————————————
>
> We’re going to be staying with Django 3.2 until the next long-term support
> release is available, so this change has no immediacy for our particular
> application.  However, we think it is a nice extension that removes a
> non-obvious behavior in Django 3.2 (the .save() method can sometimes
> discard in-memory references to other ORM objects) and seems like it
> enhances the magic of the ORM framework.
>
> In the meantime, we’ll be plugging non-intuitive lines of code that look
> like
>     child.parent = child.parent
>
> into about fifty locations throughout our app, as well as training a team
> of 20 server-side developers to watch out for this edge case in any future
> code they write.  It’s all doable, and we’ll do it… but we really like
> Django because it does “automagically” handle so many little database
> oddities and nuances.  Over time, we’d like to get rid of “surprise” code
> like these lines that an inexperienced developer might remove because they
> look like they do nothing.
>
>
>
> IF YOU’VE REACHED THE END
> —————————————————
>
> Thank you!  I’m amazed you stuck through it all.
>
> What do you think?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/596D3E68-4A37-423D-AEDF-64D2A5658DE8%40epicor.com
> <https://groups.google.com/d/msgid/django-developers/596D3E68-4A37-423D-AEDF-64D2A5658DE8%40epicor.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/CAFv-zf%2BUCZo2C73Tom6OGCVTz_c2undO1X1LF0nNa_QHztB_5g%40mail.gmail.com.

Re: Proposed change in ORM model save logic

Reply via email to