Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
Brian, that's an excellent writup, thanks! >From what I can tell, it looks very very unlikely that MySQL will add the "returning" syntax. MariaDB however has supported "returning" for DELETE operations *only* for about 5 years, and has a 2 year old issue to add it for "INSERT" https://jira.mariadb.org/browse/MDEV-10014 However, it also seems like Django takes the "common denominator" approach towards supporting them (since they are so similar, it treats them as the same database), so even if one or the other were to add support, I doubt that functionality would be exposed through the ORM unless they *both* were to support it. Therefore I think we can assume we won't see that functionality any time soon. I'll probably do the performance investigation as described in #4290 some time late next week. I'd like for some of the refactoring that has been going on in the stages API to be completed before I proceed with that. On Wed, Jan 16, 2019 at 4:11 PM Brian Bouterse wrote: > Today @jortel, @dalley, @daviddavis, and @asmacdo been taking a look at > our options in terms of resolving the PK issue that prevents Pulp from > running on MariaDB. After considering various of possible solutions on a > call, and @dalleydoing a little bit of research into the feasibility of > those, I believe there are only two options: > > A) switch to UUIDs or B) implement a fallback that saves one at a time if > there's no "RETURNING". > > In reflecting on this, I have two observations about option (B). First, > the root cause of this is that MariaDB doesn't implement the "RETURNING" > option like PostgreSQL and Oracle does. The best solution would be for > MariaDB to do that. If they aren't able to fix it where the problem is > created, investing a good deal of time and energy into how to make Pulp run > fast on MariaDB doesn't seem worth it to me. Secondly, our fallback to > one-by-one mode would effectively cause Pulp sync on MariaDB to experience > the slowdowns described in this ticket [0] which were generally identified > as unacceptable for Pulp users. In terms of timeline, since we had talked > about MariaDB compatibility being an RC blocker, I don't think we want to > wait for MariaDB to make any changes, like an implementation for > "RETURNING". > > In thinking about option (A) it resolves the compatibility issue and > allows plugin writers to avoid the "my dbs don't work the same with > bulk_create" problem entirely, so it's a great solution to the central > problem (db compatibility). My only concern with (A) is that we know it is > a some amount slower in terms of write performance, read performance, and > index size concerns. @dalley has an issue written up [1] to look at the > impact of adopting (A) on Pulp workloads. > > So the next step would be to complete [1] and share the results. After > looking at them we would yes/no the adopting of UUIDs as likely our only > feasible resolution. Any other ideas on what to do or how to do it are > welcome. > > [0]: https://pulp.plan.io/issues/3770 > [1]: https://pulp.plan.io/issues/4290 > > -Brian > > > > On Wed, Jan 9, 2019 at 1:41 PM Matthias Dellweg wrote: > >> On Wed, 9 Jan 2019 08:46:18 -0500 >> David Davis wrote: >> >> > The Rubygems api includes sha as part of the metadata for a gem. >> > Couldn't you use that as part of the natural key? >> > >> > I'm surprised that Chef's supermarket API doesn't include this as >> > well. Maybe we could open a feature request? >> > >> > David >> >> WRT rubygems i use repositories generated by `gem generate_index`. >> There is no chechsum in the metadata that i found. >> But also the whole thing seems to be documented rather poorly. >> Using rubygems.org as a remote is not adviseable until there are proper >> filters implemented. ;) >> Matthias >> ___ >> Pulp-dev mailing list >> Pulp-dev@redhat.com >> https://www.redhat.com/mailman/listinfo/pulp-dev >> > ___ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev > ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
Today @jortel, @dalley, @daviddavis, and @asmacdo been taking a look at our options in terms of resolving the PK issue that prevents Pulp from running on MariaDB. After considering various of possible solutions on a call, and @dalleydoing a little bit of research into the feasibility of those, I believe there are only two options: A) switch to UUIDs or B) implement a fallback that saves one at a time if there's no "RETURNING". In reflecting on this, I have two observations about option (B). First, the root cause of this is that MariaDB doesn't implement the "RETURNING" option like PostgreSQL and Oracle does. The best solution would be for MariaDB to do that. If they aren't able to fix it where the problem is created, investing a good deal of time and energy into how to make Pulp run fast on MariaDB doesn't seem worth it to me. Secondly, our fallback to one-by-one mode would effectively cause Pulp sync on MariaDB to experience the slowdowns described in this ticket [0] which were generally identified as unacceptable for Pulp users. In terms of timeline, since we had talked about MariaDB compatibility being an RC blocker, I don't think we want to wait for MariaDB to make any changes, like an implementation for "RETURNING". In thinking about option (A) it resolves the compatibility issue and allows plugin writers to avoid the "my dbs don't work the same with bulk_create" problem entirely, so it's a great solution to the central problem (db compatibility). My only concern with (A) is that we know it is a some amount slower in terms of write performance, read performance, and index size concerns. @dalley has an issue written up [1] to look at the impact of adopting (A) on Pulp workloads. So the next step would be to complete [1] and share the results. After looking at them we would yes/no the adopting of UUIDs as likely our only feasible resolution. Any other ideas on what to do or how to do it are welcome. [0]: https://pulp.plan.io/issues/3770 [1]: https://pulp.plan.io/issues/4290 -Brian On Wed, Jan 9, 2019 at 1:41 PM Matthias Dellweg wrote: > On Wed, 9 Jan 2019 08:46:18 -0500 > David Davis wrote: > > > The Rubygems api includes sha as part of the metadata for a gem. > > Couldn't you use that as part of the natural key? > > > > I'm surprised that Chef's supermarket API doesn't include this as > > well. Maybe we could open a feature request? > > > > David > > WRT rubygems i use repositories generated by `gem generate_index`. > There is no chechsum in the metadata that i found. > But also the whole thing seems to be documented rather poorly. > Using rubygems.org as a remote is not adviseable until there are proper > filters implemented. ;) > Matthias > ___ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev > ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On Wed, 9 Jan 2019 08:46:18 -0500 David Davis wrote: > The Rubygems api includes sha as part of the metadata for a gem. > Couldn't you use that as part of the natural key? > > I'm surprised that Chef's supermarket API doesn't include this as > well. Maybe we could open a feature request? > > David WRT rubygems i use repositories generated by `gem generate_index`. There is no chechsum in the metadata that i found. But also the whole thing seems to be documented rather poorly. Using rubygems.org as a remote is not adviseable until there are proper filters implemented. ;) Matthias pgpyQz2C8Cd5M.pgp Description: OpenPGP digital signature ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On Wed, Jan 09, 2019 at 08:46:18AM -0500, David Davis wrote: >The Rubygems api includes sha as part of the metadata for a gem. >Couldn't you use that as part of the natural key? >I'm surprised that Chef's supermarket API doesn't include this as well. >Maybe we could open a feature request? W.r.t. the Supermarket API I haven't found anything (neither in the documentation nor in the actual responses). We could open a feature request, but there isn't only the official Supermarket, there is also Chef Server, Minimart, (the deprecated) berkshelf API server and JFrog Artifactory. I wouldn't like to depend on a cutting edge feature of the Supermarket API, which other upstream servers have to incorporate to be usable with pulp_cookbook. Moreover, the current metadata is provided as one big chunk of data (single request to 'universe' endpoint). The digest would probably be obtained by extending the response to GET detailed information about a specific cookbook version (i.e. name and version). I plan to enrich metadata using this endpoint at some point in time, but I want to make it optional (for Chef Supermarket, this would cause thousands of requests during an initial repo sync even when using on_demand policy) >David > >On Tue, Jan 8, 2019 at 2:50 PM Simon Baatz <[1]gmbno...@gmail.com> >wrote: > > On 08.01.2019 17:16, Jeff Ortel wrote: > > > > > > On 1/3/19 1:28 PM, Simon Baatz wrote: > >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: > >>> I don't think that using integer ids with bulk_create and > >>> supporting > >>> mysql/mariadb are necessarily mutually exclusive. I think > there > >>> might > >>> be a way to find the records created using bulk_create if we > >>> know the > >>> natural key. It might be more performant than using UUIDs as > well. > >> This assumes that there is a natural key. For content types with > no > >> digest information in the meta data, there may be a natural key > >> for content within a repo version only, but no natural key for > the > >> overall content. (If we want to support non-immediate modes for > such > >> content. In immediate mode, a digest can be computed from the > >> associated artifact(s)). > > > > Can you give some examples of Content without a natural key? > For example, the meta-data obtained for Cookbooks is "version" and > "name" (the same seems to apply to Ruby Gems). With immediate sync > policy, we can add a digest to each content unit as we know the > digest > of the associated artifact. Thus, the natural key is "version", > "name", > and "digest" > In "non-immediate mode", we only have "version" and "name" to work > with > during sync. Now, there is a trade-off (I think) and we have the > following possibilities: > 1. Just pretend that "version" and "name" are unique. We have a > natural > key, but it may lead to the cross-repo effects that I described a > while > ago on the list. > 2. Use "version" and "name" as natural key within a repo version, > but > not globally. In this scenario, it may turn out that two content > units > are actually the same after downloading. > I prefer option 2: Content sharing is not perfect, but as a user, I > don't have to fear that repositories mix-up content that happens to > have > the same name and version. > There is also an extension of 2., which allows content sharing > during > sync for immediate mode. Define a "pseudo" natural key on global > content level: "version", "name" and "digest". "digest" may be null. > Two > content units are considered the same if they match in all three > attributes and these attributes are not null. But even in immediate > mode, the artifact will not be downloaded if "name" and "version" > are > already present in the repository version the sync is based on. A > pipeline for this could look like: > def pipeline_stages(self, new_version): > pipeline = [ > self.first_stage, > QueryExistingContentUnits(new_version=new_version), > ExistingContentNeedsNoArtifacts() > ] > if self.download_artifacts: > pipeline.extend([ArtifactDownloader(), ArtifactSaver(), > UpdateContentWithDownloadResult(), > QueryExistingContentUnits()]) > pipeline.extend([ContentUnitSaver()]) > return pipeline > QueryExistingContentUnits(new_version=new_version) associates based > on > the "repo version key", > QueryExistingContentUnits() associates globally based on the "pseudo > natural key" (digest must be set to match at all) >
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
The Rubygems api includes sha as part of the metadata for a gem. Couldn't you use that as part of the natural key? I'm surprised that Chef's supermarket API doesn't include this as well. Maybe we could open a feature request? David On Tue, Jan 8, 2019 at 2:50 PM Simon Baatz wrote: > On 08.01.2019 17:16, Jeff Ortel wrote: > > > > > > On 1/3/19 1:28 PM, Simon Baatz wrote: > >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: > >>> I don't think that using integer ids with bulk_create and > >>> supporting > >>> mysql/mariadb are necessarily mutually exclusive. I think there > >>> might > >>> be a way to find the records created using bulk_create if we > >>> know the > >>> natural key. It might be more performant than using UUIDs as well. > >> This assumes that there is a natural key. For content types with no > >> digest information in the meta data, there may be a natural key > >> for content within a repo version only, but no natural key for the > >> overall content. (If we want to support non-immediate modes for such > >> content. In immediate mode, a digest can be computed from the > >> associated artifact(s)). > > > > Can you give some examples of Content without a natural key? > > For example, the meta-data obtained for Cookbooks is "version" and > "name" (the same seems to apply to Ruby Gems). With immediate sync > policy, we can add a digest to each content unit as we know the digest > of the associated artifact. Thus, the natural key is "version", "name", > and "digest" > > In "non-immediate mode", we only have "version" and "name" to work with > during sync. Now, there is a trade-off (I think) and we have the > following possibilities: > > 1. Just pretend that "version" and "name" are unique. We have a natural > key, but it may lead to the cross-repo effects that I described a while > ago on the list. > 2. Use "version" and "name" as natural key within a repo version, but > not globally. In this scenario, it may turn out that two content units > are actually the same after downloading. > > I prefer option 2: Content sharing is not perfect, but as a user, I > don't have to fear that repositories mix-up content that happens to have > the same name and version. > > There is also an extension of 2., which allows content sharing during > sync for immediate mode. Define a "pseudo" natural key on global > content level: "version", "name" and "digest". "digest" may be null. Two > content units are considered the same if they match in all three > attributes and these attributes are not null. But even in immediate > mode, the artifact will not be downloaded if "name" and "version" are > already present in the repository version the sync is based on. A > pipeline for this could look like: > > def pipeline_stages(self, new_version): > pipeline = [ > self.first_stage, > QueryExistingContentUnits(new_version=new_version), > ExistingContentNeedsNoArtifacts() > ] > if self.download_artifacts: > pipeline.extend([ArtifactDownloader(), ArtifactSaver(), > UpdateContentWithDownloadResult(), > QueryExistingContentUnits()]) > pipeline.extend([ContentUnitSaver()]) > return pipeline > > QueryExistingContentUnits(new_version=new_version) associates based on > the "repo version key", > QueryExistingContentUnits() associates globally based on the "pseudo > natural key" (digest must be set to match at all) > > ___ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev > ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
I also want to vote for solution 2. Only thing i want to add is multi-tenancy capability. The repository of one user should really not be (from a functionality view point) affected by actions on another repository. Sharing the actual files for performance is desireable, but sharing content units might be a problem in any case. On Tue, 8 Jan 2019 20:49:41 +0100 Simon Baatz wrote: > On 08.01.2019 17:16, Jeff Ortel wrote: > > > > > > On 1/3/19 1:28 PM, Simon Baatz wrote: > >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: > >>> I don't think that using integer ids with bulk_create and > >>> supporting > >>> mysql/mariadb are necessarily mutually exclusive. I think > >>> there might > >>> be a way to find the records created using bulk_create if we > >>> know the > >>> natural key. It might be more performant than using UUIDs as > >>> well. > >> This assumes that there is a natural key. For content types with > >> no digest information in the meta data, there may be a natural key > >> for content within a repo version only, but no natural key for the > >> overall content. (If we want to support non-immediate modes for > >> such content. In immediate mode, a digest can be computed from the > >> associated artifact(s)). > > > > Can you give some examples of Content without a natural key? > > For example, the meta-data obtained for Cookbooks is "version" and > "name" (the same seems to apply to Ruby Gems). With immediate sync > policy, we can add a digest to each content unit as we know the digest > of the associated artifact. Thus, the natural key is "version", > "name", and "digest" > > In "non-immediate mode", we only have "version" and "name" to work > with during sync. Now, there is a trade-off (I think) and we have the > following possibilities: > > 1. Just pretend that "version" and "name" are unique. We have a > natural key, but it may lead to the cross-repo effects that I > described a while ago on the list. > 2. Use "version" and "name" as natural key within a repo version, but > not globally. In this scenario, it may turn out that two content units > are actually the same after downloading. > > I prefer option 2: Content sharing is not perfect, but as a user, I > don't have to fear that repositories mix-up content that happens to > have the same name and version. > > There is also an extension of 2., which allows content sharing during > sync for immediate mode. Define a "pseudo" natural key on global > content level: "version", "name" and "digest". "digest" may be null. > Two content units are considered the same if they match in all three > attributes and these attributes are not null. But even in immediate > mode, the artifact will not be downloaded if "name" and "version" are > already present in the repository version the sync is based on. A > pipeline for this could look like: > > def pipeline_stages(self, new_version): > pipeline = [ > self.first_stage, > QueryExistingContentUnits(new_version=new_version), > ExistingContentNeedsNoArtifacts() > ] > if self.download_artifacts: > pipeline.extend([ArtifactDownloader(), ArtifactSaver(), > UpdateContentWithDownloadResult(), > QueryExistingContentUnits()]) > pipeline.extend([ContentUnitSaver()]) > return pipeline > > QueryExistingContentUnits(new_version=new_version) associates based on > the "repo version key", > QueryExistingContentUnits() associates globally based on the "pseudo > natural key" (digest must be set to match at all) > > ___ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev pgpEZ85uiTF85.pgp Description: OpenPGP digital signature ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On 08.01.2019 17:16, Jeff Ortel wrote: > > > On 1/3/19 1:28 PM, Simon Baatz wrote: >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: >>> I don't think that using integer ids with bulk_create and >>> supporting >>> mysql/mariadb are necessarily mutually exclusive. I think there >>> might >>> be a way to find the records created using bulk_create if we >>> know the >>> natural key. It might be more performant than using UUIDs as well. >> This assumes that there is a natural key. For content types with no >> digest information in the meta data, there may be a natural key >> for content within a repo version only, but no natural key for the >> overall content. (If we want to support non-immediate modes for such >> content. In immediate mode, a digest can be computed from the >> associated artifact(s)). > > Can you give some examples of Content without a natural key? For example, the meta-data obtained for Cookbooks is "version" and "name" (the same seems to apply to Ruby Gems). With immediate sync policy, we can add a digest to each content unit as we know the digest of the associated artifact. Thus, the natural key is "version", "name", and "digest" In "non-immediate mode", we only have "version" and "name" to work with during sync. Now, there is a trade-off (I think) and we have the following possibilities: 1. Just pretend that "version" and "name" are unique. We have a natural key, but it may lead to the cross-repo effects that I described a while ago on the list. 2. Use "version" and "name" as natural key within a repo version, but not globally. In this scenario, it may turn out that two content units are actually the same after downloading. I prefer option 2: Content sharing is not perfect, but as a user, I don't have to fear that repositories mix-up content that happens to have the same name and version. There is also an extension of 2., which allows content sharing during sync for immediate mode. Define a "pseudo" natural key on global content level: "version", "name" and "digest". "digest" may be null. Two content units are considered the same if they match in all three attributes and these attributes are not null. But even in immediate mode, the artifact will not be downloaded if "name" and "version" are already present in the repository version the sync is based on. A pipeline for this could look like: def pipeline_stages(self, new_version): pipeline = [ self.first_stage, QueryExistingContentUnits(new_version=new_version), ExistingContentNeedsNoArtifacts() ] if self.download_artifacts: pipeline.extend([ArtifactDownloader(), ArtifactSaver(), UpdateContentWithDownloadResult(), QueryExistingContentUnits()]) pipeline.extend([ContentUnitSaver()]) return pipeline QueryExistingContentUnits(new_version=new_version) associates based on the "repo version key", QueryExistingContentUnits() associates globally based on the "pseudo natural key" (digest must be set to match at all) ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On 1/3/19 1:28 PM, Simon Baatz wrote: On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: I don't think that using integer ids with bulk_create and supporting mysql/mariadb are necessarily mutually exclusive. I think there might be a way to find the records created using bulk_create if we know the natural key. It might be more performant than using UUIDs as well. This assumes that there is a natural key. For content types with no digest information in the meta data, there may be a natural key for content within a repo version only, but no natural key for the overall content. (If we want to support non-immediate modes for such content. In immediate mode, a digest can be computed from the associated artifact(s)). Can you give some examples of Content without a natural key? Of course, there are ways around that (use a UUID as the "natural" key, or add a UUID to the repo version key fields), but I would like to avoid that. On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkli...@redhat.com> wrote: Thank you Daniel for the explanation and for filing an issue[0] to do performance analysis of UUIDs. I really hope that we can switch back to using UUIDs so we can bring back MariaDB support for Pulp 3. [0] [2]https://pulp.plan.io/issues/4290 On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dal...@redhat.com> wrote: To rephrase the problem a little bit: We need to bulk_create() a bunch of objects, and then after we do that we want to immediately be able to relate them with other objects, which means we need their PKs of the objects that were just created. In the case of auto-increment integer PKs, we can't know that PK value before it gets saved into the database. Luckily, PostgreSQL (and Oracle) support a "RETURNING" keyword that does provides this information. The raw SQL would look something like this: INSERT INTO items (name) values ('bear') RETURNING id; Django uses this feature to set the PK field on the model objects it returns when you call bulk_create() on a list of unsaved model objects. Unfortunately, MySQL doesn't support this, so there's no way to figure out what the PKs of the objects you just saved were, so the ORM can't set that information on the returned model objects. UUID PKs circumvent this because the PK gets created outside of the database, prior to being saved in the database, and so Django *can* know what the PK will be when it gets saved. On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]bbout...@redhat.com> wrote: +1 to experimentation and also making sure that we understand the performance implications of the decision. I'm replying to this earlier note to restate my observations of the problem a bit more. More ideas and thoughts are welcome. This is a decision with a lot of aspects to consider. On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]pcre...@redhat.com> wrote: On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: > When we switched from UUID to integers for the PK > with databases other than PostgreSQL [0]. > > With a goal of database agnosticism for Pulp3, if plugin writers plan to use bulk_create with any object inherited > from one of ours, they can't will get different behaviors on different databases and they won't have PKs that they may > require. bulk_create is a normal django thing, so plugin writers making a django plugin should be able to use it. This > concerned me already, but today it was also brought up by non-RH plugin writers also [1] in a PR. > > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our ticket where we discussed that change [2]. > Note, we did not consider this bulk_create downside at that time, which I think is the most significant downside to > consider. > > Having bulk_create effectively not available for plugin writers (since we can't rely on its pks being returned) I > think is a non-starter for me. I love how short the UUIDs made our URLs so that's the tradeoff mainly in my mind. > Those balanced against each other, I think we should switch back. > > Another option is to become PostgreSQL only which (though I love psql) I think would be the wrong choice for Pulp from > what I've heard from its users. > > What do you think? What should we do? So, my mind immediately goes to this question, which might be usefull for others to help make decisions, so I'll ask: When you say: "we lost the ability to have the primary key set during bulk_create" Can you clarify what you mean by this? My mind immediately goes to this chain of events: When you use bulk_create, the existing in-memory model objects representing the
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
I think adding a separate UUID column seems like a feasible workaround for Content without a natural key. But I can understand the desire not to do so in order to create an artificial natural key. I guess we can weigh the performance results of using UUID PKs against how often we expect plugin Content to have no natural key. David On Thu, Jan 3, 2019 at 2:28 PM Simon Baatz wrote: > On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: > >I don't think that using integer ids with bulk_create and supporting > >mysql/mariadb are necessarily mutually exclusive. I think there might > >be a way to find the records created using bulk_create if we know the > >natural key. It might be more performant than using UUIDs as well. > > This assumes that there is a natural key. For content types with no > digest information in the meta data, there may be a natural key > for content within a repo version only, but no natural key for the > overall content. (If we want to support non-immediate modes for such > content. In immediate mode, a digest can be computed from the > associated artifact(s)). > > Of course, there are ways around that (use a UUID as the "natural" key, > or add a UUID to the repo version key fields), but I would like to > avoid that. > > >On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkli...@redhat.com> > >wrote: > > > >Thank you Daniel for the explanation and for filing an issue[0] to do > >performance analysis of UUIDs. > >I really hope that we can switch back to using UUIDs so we can bring > >back MariaDB support for Pulp 3. > >[0] [2]https://pulp.plan.io/issues/4290 > > > >On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dal...@redhat.com> > >wrote: > > > >To rephrase the problem a little bit: > >We need to bulk_create() a bunch of objects, and then after we do that > >we want to immediately be able to relate them with other objects, > which > >means we need their PKs of the objects that were just created. > >In the case of auto-increment integer PKs, we can't know that PK value > >before it gets saved into the database. Luckily, PostgreSQL (and > >Oracle) support a "RETURNING" keyword that does provides this > >information. The raw SQL would look something like this: > > > > INSERT INTO items (name) values ('bear') RETURNING id; > > > >Django uses this feature to set the PK field on the model objects it > >returns when you call bulk_create() on a list of unsaved model > objects. > >Unfortunately, MySQL doesn't support this, so there's no way to figure > >out what the PKs of the objects you just saved were, so the ORM can't > >set that information on the returned model objects. > >UUID PKs circumvent this because the PK gets created outside of the > >database, prior to being saved in the database, and so Django *can* > >know what the PK will be when it gets saved. > > > >On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4] > bbout...@redhat.com> > >wrote: > > > >+1 to experimentation and also making sure that we understand the > >performance implications of the decision. I'm replying to this earlier > >note to restate my observations of the problem a bit more. > >More ideas and thoughts are welcome. This is a decision with a lot of > >aspects to consider. > >On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5] > pcre...@redhat.com> > >wrote: > > > > On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: > > > When we switched from UUID to integers for the PK > > > with databases other than PostgreSQL [0]. > > > > > > With a goal of database agnosticism for Pulp3, if plugin writers > > plan to use bulk_create with any object inherited > > > from one of ours, they can't will get different behaviors on > > different databases and they won't have PKs that they may > > > require. bulk_create is a normal django thing, so plugin writers > > making a django plugin should be able to use it. This > > > concerned me already, but today it was also brought up by non-RH > > plugin writers also [1] in a PR. > > > > > > The tradeoffs bteween UUIDs versus PKs are pretty well summed up > > in our ticket where we discussed that change [2]. > > > Note, we did not consider this bulk_create downside at that time, > > which I think is the most significant downside to > > > consider. > > > > > > Having bulk_create effectively not available for plugin writers > > (since we can't rely on its pks being returned) I > > > think is a non-starter for me. I love how short the UUIDs made our > > URLs so that's the tradeoff mainly in my mind. > > > Those balanced against each other, I think we should switch back. > > > > > > Another option is to become PostgreSQL only which (though I love > > psql) I think would be the wrong
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote: >I don't think that using integer ids with bulk_create and supporting >mysql/mariadb are necessarily mutually exclusive. I think there might >be a way to find the records created using bulk_create if we know the >natural key. It might be more performant than using UUIDs as well. This assumes that there is a natural key. For content types with no digest information in the meta data, there may be a natural key for content within a repo version only, but no natural key for the overall content. (If we want to support non-immediate modes for such content. In immediate mode, a digest can be computed from the associated artifact(s)). Of course, there are ways around that (use a UUID as the "natural" key, or add a UUID to the repo version key fields), but I would like to avoid that. >On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkli...@redhat.com> >wrote: > >Thank you Daniel for the explanation and for filing an issue[0] to do >performance analysis of UUIDs. >I really hope that we can switch back to using UUIDs so we can bring >back MariaDB support for Pulp 3. >[0] [2]https://pulp.plan.io/issues/4290 > >On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dal...@redhat.com> >wrote: > >To rephrase the problem a little bit: >We need to bulk_create() a bunch of objects, and then after we do that >we want to immediately be able to relate them with other objects, which >means we need their PKs of the objects that were just created. >In the case of auto-increment integer PKs, we can't know that PK value >before it gets saved into the database. Luckily, PostgreSQL (and >Oracle) support a "RETURNING" keyword that does provides this >information. The raw SQL would look something like this: > > INSERT INTO items (name) values ('bear') RETURNING id; > >Django uses this feature to set the PK field on the model objects it >returns when you call bulk_create() on a list of unsaved model objects. >Unfortunately, MySQL doesn't support this, so there's no way to figure >out what the PKs of the objects you just saved were, so the ORM can't >set that information on the returned model objects. >UUID PKs circumvent this because the PK gets created outside of the >database, prior to being saved in the database, and so Django *can* >know what the PK will be when it gets saved. > >On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]bbout...@redhat.com> >wrote: > >+1 to experimentation and also making sure that we understand the >performance implications of the decision. I'm replying to this earlier >note to restate my observations of the problem a bit more. >More ideas and thoughts are welcome. This is a decision with a lot of >aspects to consider. >On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]pcre...@redhat.com> >wrote: > > On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: > > When we switched from UUID to integers for the PK > > with databases other than PostgreSQL [0]. > > > > With a goal of database agnosticism for Pulp3, if plugin writers > plan to use bulk_create with any object inherited > > from one of ours, they can't will get different behaviors on > different databases and they won't have PKs that they may > > require. bulk_create is a normal django thing, so plugin writers > making a django plugin should be able to use it. This > > concerned me already, but today it was also brought up by non-RH > plugin writers also [1] in a PR. > > > > The tradeoffs bteween UUIDs versus PKs are pretty well summed up > in our ticket where we discussed that change [2]. > > Note, we did not consider this bulk_create downside at that time, > which I think is the most significant downside to > > consider. > > > > Having bulk_create effectively not available for plugin writers > (since we can't rely on its pks being returned) I > > think is a non-starter for me. I love how short the UUIDs made our > URLs so that's the tradeoff mainly in my mind. > > Those balanced against each other, I think we should switch back. > > > > Another option is to become PostgreSQL only which (though I love > psql) I think would be the wrong choice for Pulp from > > what I've heard from its users. > > > > What do you think? What should we do? > So, my mind immediately goes to this question, which might be > usefull for others to help make decisions, so I'll ask: > When you say: > "we lost the ability to have the primary key set during bulk_create" > Can you clarify what you mean by this? > My mind immediately goes to this chain of events: > When you use bulk_create, the existing in-memory model > objects representing the data to create
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
I don't think that using integer ids with bulk_create and supporting mysql/mariadb are necessarily mutually exclusive. I think there might be a way to find the records created using bulk_create if we know the natural key. It might be more performant than using UUIDs as well. David On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban wrote: > Thank you Daniel for the explanation and for filing an issue[0] to do > performance analysis of UUIDs. > > I really hope that we can switch back to using UUIDs so we can bring back > MariaDB support for Pulp 3. > > [0] https://pulp.plan.io/issues/4290 > > On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley wrote: > >> To rephrase the problem a little bit: >> >> We need to bulk_create() a bunch of objects, and then after we do that we >> want to immediately be able to relate them with other objects, which means >> we need their PKs of the objects that were just created. >> >> In the case of auto-increment integer PKs, we can't know that PK value >> before it gets saved into the database. Luckily, PostgreSQL (and Oracle) >> support a "RETURNING" keyword that does provides this information. The raw >> SQL would look something like this: >> >>> INSERT INTO items (name) values ('bear') RETURNING id; >>> >>> Django uses this feature to set the PK field on the model objects it >> returns when you call bulk_create() on a list of unsaved model objects. >> >> Unfortunately, MySQL doesn't support this, so there's no way to figure >> out what the PKs of the objects you just saved were, so the ORM can't set >> that information on the returned model objects. >> >> UUID PKs circumvent this because the PK gets created outside of the >> database, prior to being saved in the database, and so Django *can* know >> what the PK will be when it gets saved. >> >> On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse >> wrote: >> >>> +1 to experimentation and also making sure that we understand the >>> performance implications of the decision. I'm replying to this earlier note >>> to restate my observations of the problem a bit more. >>> >>> More ideas and thoughts are welcome. This is a decision with a lot of >>> aspects to consider. >>> >>> >>> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech >>> wrote: >>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: > When we switched from UUID to integers for the PK > with databases other than PostgreSQL [0]. > > With a goal of database agnosticism for Pulp3, if plugin writers plan to use bulk_create with any object inherited > from one of ours, they can't will get different behaviors on different databases and they won't have PKs that they may > require. bulk_create is a normal django thing, so plugin writers making a django plugin should be able to use it. This > concerned me already, but today it was also brought up by non-RH plugin writers also [1] in a PR. > > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our ticket where we discussed that change [2]. > Note, we did not consider this bulk_create downside at that time, which I think is the most significant downside to > consider. > > Having bulk_create effectively not available for plugin writers (since we can't rely on its pks being returned) I > think is a non-starter for me. I love how short the UUIDs made our URLs so that's the tradeoff mainly in my mind. > Those balanced against each other, I think we should switch back. > > Another option is to become PostgreSQL only which (though I love psql) I think would be the wrong choice for Pulp from > what I've heard from its users. > > What do you think? What should we do? So, my mind immediately goes to this question, which might be usefull for others to help make decisions, so I'll ask: When you say: "we lost the ability to have the primary key set during bulk_create" Can you clarify what you mean by this? My mind immediately goes to this chain of events: When you use bulk_create, the existing in-memory model objects representing the data to create do not get updated with the primary key values that are created in the database. Upon a subsequent query of the database, for the exact same set of objects just added, those objects _will_ have the primary key populated. In other words, The database records themselves get the auto-increment IDs added, they just don't get reported back in that query to the ORM layer, therefore it takes a subsequent query to get those ids out. Does that about sum it up? >>> >>> Yes this describes the situation, but there is a bit more to tell. Since >>> PostgreSQL does return the ids the subsequent query that could be done to >>> get the ids isn't written in code today. We didn't need to
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
Thank you Daniel for the explanation and for filing an issue[0] to do performance analysis of UUIDs. I really hope that we can switch back to using UUIDs so we can bring back MariaDB support for Pulp 3. [0] https://pulp.plan.io/issues/4290 On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley wrote: > To rephrase the problem a little bit: > > We need to bulk_create() a bunch of objects, and then after we do that we > want to immediately be able to relate them with other objects, which means > we need their PKs of the objects that were just created. > > In the case of auto-increment integer PKs, we can't know that PK value > before it gets saved into the database. Luckily, PostgreSQL (and Oracle) > support a "RETURNING" keyword that does provides this information. The raw > SQL would look something like this: > >> INSERT INTO items (name) values ('bear') RETURNING id; >> >> Django uses this feature to set the PK field on the model objects it > returns when you call bulk_create() on a list of unsaved model objects. > > Unfortunately, MySQL doesn't support this, so there's no way to figure out > what the PKs of the objects you just saved were, so the ORM can't set that > information on the returned model objects. > > UUID PKs circumvent this because the PK gets created outside of the > database, prior to being saved in the database, and so Django *can* know > what the PK will be when it gets saved. > > On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse > wrote: > >> +1 to experimentation and also making sure that we understand the >> performance implications of the decision. I'm replying to this earlier note >> to restate my observations of the problem a bit more. >> >> More ideas and thoughts are welcome. This is a decision with a lot of >> aspects to consider. >> >> >> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech >> wrote: >> >>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: >>> > When we switched from UUID to integers for the PK >>> > with databases other than PostgreSQL [0]. >>> > >>> > With a goal of database agnosticism for Pulp3, if plugin writers plan >>> to use bulk_create with any object inherited >>> > from one of ours, they can't will get different behaviors on different >>> databases and they won't have PKs that they may >>> > require. bulk_create is a normal django thing, so plugin writers >>> making a django plugin should be able to use it. This >>> > concerned me already, but today it was also brought up by non-RH >>> plugin writers also [1] in a PR. >>> > >>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in >>> our ticket where we discussed that change [2]. >>> > Note, we did not consider this bulk_create downside at that time, >>> which I think is the most significant downside to >>> > consider. >>> > >>> > Having bulk_create effectively not available for plugin writers (since >>> we can't rely on its pks being returned) I >>> > think is a non-starter for me. I love how short the UUIDs made our >>> URLs so that's the tradeoff mainly in my mind. >>> > Those balanced against each other, I think we should switch back. >>> > >>> > Another option is to become PostgreSQL only which (though I love psql) >>> I think would be the wrong choice for Pulp from >>> > what I've heard from its users. >>> > >>> > What do you think? What should we do? >>> >>> So, my mind immediately goes to this question, which might be usefull >>> for others to help make decisions, so I'll ask: >>> >>> When you say: >>> >>> "we lost the ability to have the primary key set during bulk_create" >>> >>> Can you clarify what you mean by this? >>> >>> My mind immediately goes to this chain of events: >>> >>> When you use bulk_create, the existing in-memory model objects >>> representing the data to create do not get >>> updated with the primary key values that are created in the database. >>> >>> Upon a subsequent query of the database, for the exact same set >>> of objects just added, those objects _will_ have >>> the primary key populated. >>> >>> In other words, >>> >>> The database records themselves get the auto-increment IDs >>> added, they just don't get reported back in that >>> query to the ORM layer, therefore it takes a subsequent query to get >>> those ids out. >>> >>> Does that about sum it up? >>> >> >> Yes this describes the situation, but there is a bit more to tell. Since >> PostgreSQL does return the ids the subsequent query that could be done to >> get the ids isn't written in code today. We didn't need to because we >> developed it against PostgreSQL. I'm pretty sure that if you configure Pulp >> against MySQL Pulp won't work, which I think is a problem. So I'm observing >> two things here. 1) This is a hazard that causes code to unexpectedly be >> only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we >> need to fix that too >> >> Do you also see these two issues? What should be done about these? >> >> >>> >>> > >>> > [0]: >>>
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
@dalley that is an excellent writeup of what is going on. So since we could be benefitting from PostgreSQL-only Django features, I think we should add MySQL to the test matrix to ensure that at least pulp/pulp and pulp/pulpcore-plugin are SQL agnostic. I made this issue here to track that work. https://pulp.plan.io/issues/4270 Please raise any concerns or ideas on what we could do better for this. On Wed, Dec 5, 2018 at 1:34 PM Daniel Alley wrote: > To rephrase the problem a little bit: > > We need to bulk_create() a bunch of objects, and then after we do that we > want to immediately be able to relate them with other objects, which means > we need their PKs of the objects that were just created. > > In the case of auto-increment integer PKs, we can't know that PK value > before it gets saved into the database. Luckily, PostgreSQL (and Oracle) > support a "RETURNING" keyword that does provides this information. The raw > SQL would look something like this: > >> INSERT INTO items (name) values ('bear') RETURNING id; >> >> Django uses this feature to set the PK field on the model objects it > returns when you call bulk_create() on a list of unsaved model objects. > > Unfortunately, MySQL doesn't support this, so there's no way to figure out > what the PKs of the objects you just saved were, so the ORM can't set that > information on the returned model objects. > > UUID PKs circumvent this because the PK gets created outside of the > database, prior to being saved in the database, and so Django *can* know > what the PK will be when it gets saved. > > On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse > wrote: > >> +1 to experimentation and also making sure that we understand the >> performance implications of the decision. I'm replying to this earlier note >> to restate my observations of the problem a bit more. >> >> More ideas and thoughts are welcome. This is a decision with a lot of >> aspects to consider. >> >> >> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech >> wrote: >> >>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: >>> > When we switched from UUID to integers for the PK >>> > with databases other than PostgreSQL [0]. >>> > >>> > With a goal of database agnosticism for Pulp3, if plugin writers plan >>> to use bulk_create with any object inherited >>> > from one of ours, they can't will get different behaviors on different >>> databases and they won't have PKs that they may >>> > require. bulk_create is a normal django thing, so plugin writers >>> making a django plugin should be able to use it. This >>> > concerned me already, but today it was also brought up by non-RH >>> plugin writers also [1] in a PR. >>> > >>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in >>> our ticket where we discussed that change [2]. >>> > Note, we did not consider this bulk_create downside at that time, >>> which I think is the most significant downside to >>> > consider. >>> > >>> > Having bulk_create effectively not available for plugin writers (since >>> we can't rely on its pks being returned) I >>> > think is a non-starter for me. I love how short the UUIDs made our >>> URLs so that's the tradeoff mainly in my mind. >>> > Those balanced against each other, I think we should switch back. >>> > >>> > Another option is to become PostgreSQL only which (though I love psql) >>> I think would be the wrong choice for Pulp from >>> > what I've heard from its users. >>> > >>> > What do you think? What should we do? >>> >>> So, my mind immediately goes to this question, which might be usefull >>> for others to help make decisions, so I'll ask: >>> >>> When you say: >>> >>> "we lost the ability to have the primary key set during bulk_create" >>> >>> Can you clarify what you mean by this? >>> >>> My mind immediately goes to this chain of events: >>> >>> When you use bulk_create, the existing in-memory model objects >>> representing the data to create do not get >>> updated with the primary key values that are created in the database. >>> >>> Upon a subsequent query of the database, for the exact same set >>> of objects just added, those objects _will_ have >>> the primary key populated. >>> >>> In other words, >>> >>> The database records themselves get the auto-increment IDs >>> added, they just don't get reported back in that >>> query to the ORM layer, therefore it takes a subsequent query to get >>> those ids out. >>> >>> Does that about sum it up? >>> >> >> Yes this describes the situation, but there is a bit more to tell. Since >> PostgreSQL does return the ids the subsequent query that could be done to >> get the ids isn't written in code today. We didn't need to because we >> developed it against PostgreSQL. I'm pretty sure that if you configure Pulp >> against MySQL Pulp won't work, which I think is a problem. So I'm observing >> two things here. 1) This is a hazard that causes code to unexpectedly be >> only compliant with PostgreSQL. 2) Pulp itself fell
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
To rephrase the problem a little bit: We need to bulk_create() a bunch of objects, and then after we do that we want to immediately be able to relate them with other objects, which means we need their PKs of the objects that were just created. In the case of auto-increment integer PKs, we can't know that PK value before it gets saved into the database. Luckily, PostgreSQL (and Oracle) support a "RETURNING" keyword that does provides this information. The raw SQL would look something like this: > INSERT INTO items (name) values ('bear') RETURNING id; > > Django uses this feature to set the PK field on the model objects it returns when you call bulk_create() on a list of unsaved model objects. Unfortunately, MySQL doesn't support this, so there's no way to figure out what the PKs of the objects you just saved were, so the ORM can't set that information on the returned model objects. UUID PKs circumvent this because the PK gets created outside of the database, prior to being saved in the database, and so Django *can* know what the PK will be when it gets saved. On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse wrote: > +1 to experimentation and also making sure that we understand the > performance implications of the decision. I'm replying to this earlier note > to restate my observations of the problem a bit more. > > More ideas and thoughts are welcome. This is a decision with a lot of > aspects to consider. > > > On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech > wrote: > >> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: >> > When we switched from UUID to integers for the PK >> > with databases other than PostgreSQL [0]. >> > >> > With a goal of database agnosticism for Pulp3, if plugin writers plan >> to use bulk_create with any object inherited >> > from one of ours, they can't will get different behaviors on different >> databases and they won't have PKs that they may >> > require. bulk_create is a normal django thing, so plugin writers making >> a django plugin should be able to use it. This >> > concerned me already, but today it was also brought up by non-RH plugin >> writers also [1] in a PR. >> > >> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our >> ticket where we discussed that change [2]. >> > Note, we did not consider this bulk_create downside at that time, which >> I think is the most significant downside to >> > consider. >> > >> > Having bulk_create effectively not available for plugin writers (since >> we can't rely on its pks being returned) I >> > think is a non-starter for me. I love how short the UUIDs made our URLs >> so that's the tradeoff mainly in my mind. >> > Those balanced against each other, I think we should switch back. >> > >> > Another option is to become PostgreSQL only which (though I love psql) >> I think would be the wrong choice for Pulp from >> > what I've heard from its users. >> > >> > What do you think? What should we do? >> >> So, my mind immediately goes to this question, which might be usefull for >> others to help make decisions, so I'll ask: >> >> When you say: >> >> "we lost the ability to have the primary key set during bulk_create" >> >> Can you clarify what you mean by this? >> >> My mind immediately goes to this chain of events: >> >> When you use bulk_create, the existing in-memory model objects >> representing the data to create do not get >> updated with the primary key values that are created in the database. >> >> Upon a subsequent query of the database, for the exact same set >> of objects just added, those objects _will_ have >> the primary key populated. >> >> In other words, >> >> The database records themselves get the auto-increment IDs added, >> they just don't get reported back in that >> query to the ORM layer, therefore it takes a subsequent query to get >> those ids out. >> >> Does that about sum it up? >> > > Yes this describes the situation, but there is a bit more to tell. Since > PostgreSQL does return the ids the subsequent query that could be done to > get the ids isn't written in code today. We didn't need to because we > developed it against PostgreSQL. I'm pretty sure that if you configure Pulp > against MySQL Pulp won't work, which I think is a problem. So I'm observing > two things here. 1) This is a hazard that causes code to unexpectedly be > only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we > need to fix that too > > Do you also see these two issues? What should be done about these? > > >> >> > >> > [0]: >> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create >> > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702 >> > [2]: https://pulp.plan.io/issues/3848 >> > ___ >> > Pulp-dev mailing list >> > Pulp-dev@redhat.com >> > https://www.redhat.com/mailman/listinfo/pulp-dev >> >> ___ >> Pulp-dev mailing list >>
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
+1 to experimentation and also making sure that we understand the performance implications of the decision. I'm replying to this earlier note to restate my observations of the problem a bit more. More ideas and thoughts are welcome. This is a decision with a lot of aspects to consider. On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech wrote: > On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: > > When we switched from UUID to integers for the PK > > with databases other than PostgreSQL [0]. > > > > With a goal of database agnosticism for Pulp3, if plugin writers plan to > use bulk_create with any object inherited > > from one of ours, they can't will get different behaviors on different > databases and they won't have PKs that they may > > require. bulk_create is a normal django thing, so plugin writers making > a django plugin should be able to use it. This > > concerned me already, but today it was also brought up by non-RH plugin > writers also [1] in a PR. > > > > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our > ticket where we discussed that change [2]. > > Note, we did not consider this bulk_create downside at that time, which > I think is the most significant downside to > > consider. > > > > Having bulk_create effectively not available for plugin writers (since > we can't rely on its pks being returned) I > > think is a non-starter for me. I love how short the UUIDs made our URLs > so that's the tradeoff mainly in my mind. > > Those balanced against each other, I think we should switch back. > > > > Another option is to become PostgreSQL only which (though I love psql) I > think would be the wrong choice for Pulp from > > what I've heard from its users. > > > > What do you think? What should we do? > > So, my mind immediately goes to this question, which might be usefull for > others to help make decisions, so I'll ask: > > When you say: > > "we lost the ability to have the primary key set during bulk_create" > > Can you clarify what you mean by this? > > My mind immediately goes to this chain of events: > > When you use bulk_create, the existing in-memory model objects > representing the data to create do not get > updated with the primary key values that are created in the database. > > Upon a subsequent query of the database, for the exact same set of > objects just added, those objects _will_ have > the primary key populated. > > In other words, > > The database records themselves get the auto-increment IDs added, > they just don't get reported back in that > query to the ORM layer, therefore it takes a subsequent query to get those > ids out. > > Does that about sum it up? > Yes this describes the situation, but there is a bit more to tell. Since PostgreSQL does return the ids the subsequent query that could be done to get the ids isn't written in code today. We didn't need to because we developed it against PostgreSQL. I'm pretty sure that if you configure Pulp against MySQL Pulp won't work, which I think is a problem. So I'm observing two things here. 1) This is a hazard that causes code to unexpectedly be only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we need to fix that too Do you also see these two issues? What should be done about these? > > > > > [0]: > https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create > > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702 > > [2]: https://pulp.plan.io/issues/3848 > > ___ > > Pulp-dev mailing list > > Pulp-dev@redhat.com > > https://www.redhat.com/mailman/listinfo/pulp-dev > > ___ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev > ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On Wed, 2018-12-05 at 09:34 -0500, Daniel Alley wrote: > Perhaps, but it's not a -1 so much as a call for more experimentation and > testing. I wouldn't feel comfortable saying > Pulp is MySQL "compatible" if (if!) it was an order of magnitude slower than > Pulp on Postgres, and we never found out > about that because we never tested it... I think that kind of "compatibility" > would be a net negative to Pulp. So I > just want us to make sure these bases are covered if we're going to make that > claim. > > s/it's chart chart/it's a pretty nasty chart It shows the beginnings of an exponential curve of performance degredation as your record count grows. yes, nasty. > > On Wed, Dec 5, 2018 at 9:27 AM Dennis Kliban wrote: > > It looks like the chart was generated using MySQL 5.0.45 which was released > > at least 10 years ago[0]. I don't think > > we can rely on such old results. > > > > [0] https://en.wikipedia.org/wiki/MySQL#Milestones https://mariadb.com/kb/en/library/guiduuid-performance/ MariaDB (the MySQL fork after oracle took over), still has concerns about guid/uuid performance. The article here is about 1-2 years old, and it spells out some of the still current-ish concerns with utilizing uuids/guids in the mysql world. I would argue that yes, you should at least pay attention to these old results, as far as a "this used to be a problem, lets make sure it still isn't" Most of these int vs uuid/guid concerns in the database world still apply today, and either side has nuanced tradeoffs for each technology (postgresql, mysql/mariadb, oracle, sqlserver, etc...) that will affect wich choice is best. FWIW, _generally_ the compromise solutions I've seen are where the actual pk is an int, and isn't exposed outside of the db/model layer. There is also an attached UUID/GUID that is code-generated that is treated as an object reference in the api/code layers. There are other solutions out there as well. > > > > On Wed, Dec 5, 2018 at 9:18 AM Daniel Alley wrote: > > > I just want to point out that using UUID PKs works perfectly fine on > > > PostgreSQL but is considered a Bad Idea™ on > > > MySQL for performance reasons. > > > > > > http://kccoder.com/mysql/uuid-vs-int-insert-performance/ > > > > > > > > > > > > It's hard to notice at first, but the blue and red lines (representing > > > integer PKs) are tracking near the bottom. > > > > > > I did my testing with PostgreSQL, and I would completely agree that the > > > tiny performance hit we noticed there > > > would take a backseat to the functional benefits Brian is pointing out. > > > But if we really, truly want to be > > > database agnostic, we should put more thought into this change (and > > > others going forwards). > > > > > > Another factor that makes this a more complicated decision is that the > > > limitations on using bulk_create() with > > > multi-table models are more of a "simplification" on the Django side than > > > a fundamental limitation. According to > > > this comment [0] in the Django source code, and this issue [1] it's > > > likely possible on PostgreSQL as-is, if we > > > were willing to mess around inside the ORM a bit. And it could be > > > possible on MySQL also *if* we used UUID PKs. > > > And maybe the performance benefits of being able to use bulk_create() > > > would override or reduce the performance > > > downsides of using UUID with MySQL. I don't know about that though... > > > that's chart chart and without some > > > experimentation this is all speculation. > > > > > > TL;DR If we want to stay DB agnostic it needs to be worked into our > > > decision making process and not be an > > > afterthought This ^. While on the surface the orm layer helps provide you with the ability to say you're db agnostic, to truly have the same experience across database technologies will require up front research and informed choices. > > > > > > [0] > > > https://github.com/django/django/blob/master/django/db/models/query.py#L438 > > > [1] https://code.djangoproject.com/ticket/28821 > > > > > > ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL
On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote: > When we switched from UUID to integers for the PK > with databases other than PostgreSQL [0]. > > With a goal of database agnosticism for Pulp3, if plugin writers plan to use > bulk_create with any object inherited > from one of ours, they can't will get different behaviors on different > databases and they won't have PKs that they may > require. bulk_create is a normal django thing, so plugin writers making a > django plugin should be able to use it. This > concerned me already, but today it was also brought up by non-RH plugin > writers also [1] in a PR. > > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our > ticket where we discussed that change [2]. > Note, we did not consider this bulk_create downside at that time, which I > think is the most significant downside to > consider. > > Having bulk_create effectively not available for plugin writers (since we > can't rely on its pks being returned) I > think is a non-starter for me. I love how short the UUIDs made our URLs so > that's the tradeoff mainly in my mind. > Those balanced against each other, I think we should switch back. > > Another option is to become PostgreSQL only which (though I love psql) I > think would be the wrong choice for Pulp from > what I've heard from its users. > > What do you think? What should we do? So, my mind immediately goes to this question, which might be usefull for others to help make decisions, so I'll ask: When you say: "we lost the ability to have the primary key set during bulk_create" Can you clarify what you mean by this? My mind immediately goes to this chain of events: When you use bulk_create, the existing in-memory model objects representing the data to create do not get updated with the primary key values that are created in the database. Upon a subsequent query of the database, for the exact same set of objects just added, those objects _will_ have the primary key populated. In other words, The database records themselves get the auto-increment IDs added, they just don't get reported back in that query to the ORM layer, therefore it takes a subsequent query to get those ids out. Does that about sum it up? > > [0]: https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702 > [2]: https://pulp.plan.io/issues/3848 > ___ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev
[Pulp-dev] Concerns about bulk_create and PostgreSQL
When we switched from UUID to integers for the PK we lost the ability to have the primary key set during bulk_create with databases other than PostgreSQL [0]. With a goal of database agnosticism for Pulp3, if plugin writers plan to use bulk_create with any object inherited from one of ours, they can't will get different behaviors on different databases and they won't have PKs that they may require. bulk_create is a normal django thing, so plugin writers making a django plugin should be able to use it. This concerned me already, but today it was also brought up by non-RH plugin writers also [1] in a PR. The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our ticket where we discussed that change [2]. Note, we did not consider this bulk_create downside at that time, which I think is the most significant downside to consider. Having bulk_create effectively not available for plugin writers (since we can't rely on its pks being returned) I think is a non-starter for me. I love how short the UUIDs made our URLs so that's the tradeoff mainly in my mind. Those balanced against each other, I think we should switch back. Another option is to become PostgreSQL only which (though I love psql) I think would be the wrong choice for Pulp from what I've heard from its users. What do you think? What should we do? [0]: https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702 [2]: https://pulp.plan.io/issues/3848 ___ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev