Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-16 Thread Daniel Alley
Brian, that's an excellent writup, thanks!

>From what I can tell, it looks very very unlikely that MySQL will add the
"returning" syntax.  MariaDB however has supported "returning" for DELETE
operations *only* for about 5 years, and has a 2 year old issue to add it
for "INSERT" https://jira.mariadb.org/browse/MDEV-10014

However, it also seems like Django takes the "common denominator" approach
towards supporting them (since they are so similar, it treats them as the
same database), so even if one or the other were to add support, I doubt
that functionality would be exposed through the ORM unless they *both* were
to support it.  Therefore I think we can assume we won't see that
functionality any time soon.

I'll probably do the performance investigation as described in #4290 some
time late next week.  I'd like for some of the refactoring that has been
going on in the stages API to be completed before I proceed with that.

On Wed, Jan 16, 2019 at 4:11 PM Brian Bouterse  wrote:

> Today @jortel, @dalley, @daviddavis, and @asmacdo been taking a look at
> our options in terms of resolving the PK issue that prevents Pulp from
> running on MariaDB. After considering various of possible solutions on a
> call, and @dalleydoing a little bit of research into the feasibility of
> those, I believe there are only two options:
>
> A) switch to UUIDs or B) implement a fallback that saves one at a time if
> there's no "RETURNING".
>
> In reflecting on this, I have two observations about option (B). First,
> the root cause of this is that MariaDB doesn't implement the "RETURNING"
> option like PostgreSQL and Oracle does. The best solution would be for
> MariaDB to do that. If they aren't able to fix it where the problem is
> created, investing a good deal of time and energy into how to make Pulp run
> fast on MariaDB doesn't seem worth it to me. Secondly, our fallback to
> one-by-one mode would effectively cause Pulp sync on MariaDB to experience
> the slowdowns described in this ticket [0] which were generally identified
> as unacceptable for Pulp users. In terms of timeline, since we had talked
> about MariaDB compatibility being an RC blocker, I don't think we want to
> wait for MariaDB to make any changes, like an implementation for
> "RETURNING".
>
> In thinking about option (A) it resolves the compatibility issue and
> allows plugin writers to avoid the "my dbs don't work the same with
> bulk_create" problem entirely, so it's a great solution to the central
> problem (db compatibility). My only concern with (A) is that we know it is
> a some amount slower in terms of write performance, read performance, and
> index size concerns. @dalley has an issue written up [1] to look at the
> impact of adopting (A) on Pulp workloads.
>
> So the next step would be to complete [1] and share the results. After
> looking at them we would yes/no the adopting of UUIDs as likely our only
> feasible resolution. Any other ideas on what to do or how to do it are
> welcome.
>
> [0]: https://pulp.plan.io/issues/3770
> [1]: https://pulp.plan.io/issues/4290
>
> -Brian
>
>
>
> On Wed, Jan 9, 2019 at 1:41 PM Matthias Dellweg  wrote:
>
>> On Wed, 9 Jan 2019 08:46:18 -0500
>> David Davis  wrote:
>>
>> > The Rubygems api includes sha as part of the metadata for a gem.
>> > Couldn't you use that as part of the natural key?
>> >
>> > I'm surprised that Chef's supermarket API doesn't include this as
>> > well. Maybe we could open a feature request?
>> >
>> > David
>>
>> WRT rubygems i use repositories generated by `gem generate_index`.
>> There is no chechsum in the metadata that i found.
>> But also the whole thing seems to be documented rather poorly.
>> Using rubygems.org as a remote is not adviseable until there are proper
>> filters implemented. ;)
>>   Matthias
>> ___
>> Pulp-dev mailing list
>> Pulp-dev@redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-16 Thread Brian Bouterse
Today @jortel, @dalley, @daviddavis, and @asmacdo been taking a look at our
options in terms of resolving the PK issue that prevents Pulp from running
on MariaDB. After considering various of possible solutions on a call, and
@dalleydoing a little bit of research into the feasibility of those, I
believe there are only two options:

A) switch to UUIDs or B) implement a fallback that saves one at a time if
there's no "RETURNING".

In reflecting on this, I have two observations about option (B). First, the
root cause of this is that MariaDB doesn't implement the "RETURNING" option
like PostgreSQL and Oracle does. The best solution would be for MariaDB to
do that. If they aren't able to fix it where the problem is created,
investing a good deal of time and energy into how to make Pulp run fast on
MariaDB doesn't seem worth it to me. Secondly, our fallback to one-by-one
mode would effectively cause Pulp sync on MariaDB to experience the
slowdowns described in this ticket [0] which were generally identified as
unacceptable for Pulp users. In terms of timeline, since we had talked
about MariaDB compatibility being an RC blocker, I don't think we want to
wait for MariaDB to make any changes, like an implementation for
"RETURNING".

In thinking about option (A) it resolves the compatibility issue and allows
plugin writers to avoid the "my dbs don't work the same with bulk_create"
problem entirely, so it's a great solution to the central problem (db
compatibility). My only concern with (A) is that we know it is a some
amount slower in terms of write performance, read performance, and index
size concerns. @dalley has an issue written up [1] to look at the impact of
adopting (A) on Pulp workloads.

So the next step would be to complete [1] and share the results. After
looking at them we would yes/no the adopting of UUIDs as likely our only
feasible resolution. Any other ideas on what to do or how to do it are
welcome.

[0]: https://pulp.plan.io/issues/3770
[1]: https://pulp.plan.io/issues/4290

-Brian



On Wed, Jan 9, 2019 at 1:41 PM Matthias Dellweg  wrote:

> On Wed, 9 Jan 2019 08:46:18 -0500
> David Davis  wrote:
>
> > The Rubygems api includes sha as part of the metadata for a gem.
> > Couldn't you use that as part of the natural key?
> >
> > I'm surprised that Chef's supermarket API doesn't include this as
> > well. Maybe we could open a feature request?
> >
> > David
>
> WRT rubygems i use repositories generated by `gem generate_index`.
> There is no chechsum in the metadata that i found.
> But also the whole thing seems to be documented rather poorly.
> Using rubygems.org as a remote is not adviseable until there are proper
> filters implemented. ;)
>   Matthias
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-09 Thread Matthias Dellweg
On Wed, 9 Jan 2019 08:46:18 -0500
David Davis  wrote:

> The Rubygems api includes sha as part of the metadata for a gem.
> Couldn't you use that as part of the natural key?
> 
> I'm surprised that Chef's supermarket API doesn't include this as
> well. Maybe we could open a feature request?
> 
> David

WRT rubygems i use repositories generated by `gem generate_index`.
There is no chechsum in the metadata that i found.
But also the whole thing seems to be documented rather poorly.
Using rubygems.org as a remote is not adviseable until there are proper
filters implemented. ;)
  Matthias


pgpyQz2C8Cd5M.pgp
Description: OpenPGP digital signature
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-09 Thread Simon Baatz
On Wed, Jan 09, 2019 at 08:46:18AM -0500, David Davis wrote:
>The Rubygems api includes sha as part of the metadata for a gem.
>Couldn't you use that as part of the natural key?
>I'm surprised that Chef's supermarket API doesn't include this as well.
>Maybe we could open a feature request?

W.r.t.  the Supermarket API I haven't found anything (neither in the
documentation nor in the actual responses).  We could open a feature
request, but there isn't only the official Supermarket, there is also
Chef Server, Minimart, (the deprecated) berkshelf API server and
JFrog Artifactory.  I wouldn't like to depend on a cutting edge
feature of the Supermarket API, which other upstream servers have to
incorporate to be usable with pulp_cookbook.

Moreover, the current metadata is provided as one big chunk of data
(single request to 'universe' endpoint).  The digest would probably be
obtained by extending the response to GET detailed information about
a specific cookbook version (i.e.  name and version).  I plan to
enrich metadata using this endpoint at some point in time, but I want
to make it optional (for Chef Supermarket, this would cause thousands
of requests during an initial repo sync even when using on_demand
policy)


>David
> 
>On Tue, Jan 8, 2019 at 2:50 PM Simon Baatz <[1]gmbno...@gmail.com>
>wrote:
> 
>  On 08.01.2019 17:16, Jeff Ortel wrote:
>  >
>  >
>  > On 1/3/19 1:28 PM, Simon Baatz wrote:
>  >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
>  >>> I don't think that using integer ids with bulk_create and
>  >>> supporting
>  >>> mysql/mariadb are necessarily mutually exclusive. I think
>  there
>  >>> might
>  >>> be a way to find the records created using bulk_create if we
>  >>> know the
>  >>> natural key. It might be more performant than using UUIDs as
>  well.
>  >> This assumes that there is a natural key.  For content types with
>  no
>  >> digest information in the meta data, there may be a natural key
>  >> for content within a repo version only, but no natural key for
>  the
>  >> overall content.  (If we want to support non-immediate modes for
>  such
>  >> content.  In immediate mode, a digest can be computed from the
>  >> associated artifact(s)).
>  >
>  > Can you give some examples of Content without a natural key?
>  For example, the meta-data obtained for Cookbooks is "version" and
>  "name" (the same seems to apply to Ruby Gems). With immediate sync
>  policy, we can add a digest to each content unit as we know the
>  digest
>  of the associated artifact. Thus, the natural key is "version",
>  "name",
>  and "digest"
>  In "non-immediate mode", we only have "version" and "name" to work
>  with
>  during sync. Now, there is a trade-off (I think) and we have the
>  following possibilities:
>  1. Just pretend that "version" and "name" are unique. We have a
>  natural
>  key, but it may lead to the cross-repo effects that I described a
>  while
>  ago on the list.
>  2. Use "version" and "name" as natural key within a repo version,
>  but
>  not globally. In this scenario, it may turn out that two content
>  units
>  are actually the same after downloading.
>  I prefer option 2: Content sharing is not perfect, but as a user, I
>  don't have to fear that repositories mix-up content that happens to
>  have
>  the same name and version.
>  There is also an extension of 2., which allows content sharing
>  during
>  sync for immediate mode. Define a "pseudo" natural key on global
>  content level: "version", "name" and "digest". "digest" may be null.
>  Two
>  content units are considered the same if they match in all three
>  attributes and these attributes are not null. But even in immediate
>  mode, the artifact will not be downloaded if "name" and "version"
>  are
>  already present in the repository version the sync is based on. A
>  pipeline for this could look like:
>  def pipeline_stages(self, new_version):
>  pipeline = [
>  self.first_stage,
>  QueryExistingContentUnits(new_version=new_version),
>  ExistingContentNeedsNoArtifacts()
>  ]
>  if self.download_artifacts:
>  pipeline.extend([ArtifactDownloader(), ArtifactSaver(),
>   UpdateContentWithDownloadResult(),
>  QueryExistingContentUnits()])
>  pipeline.extend([ContentUnitSaver()])
>  return pipeline
>  QueryExistingContentUnits(new_version=new_version) associates based
>  on
>  the "repo version key",
>  QueryExistingContentUnits() associates globally based on the "pseudo
>  natural key" (digest must be set to match at all)
>  

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-09 Thread David Davis
The Rubygems api includes sha as part of the metadata for a gem. Couldn't
you use that as part of the natural key?

I'm surprised that Chef's supermarket API doesn't include this as well.
Maybe we could open a feature request?

David


On Tue, Jan 8, 2019 at 2:50 PM Simon Baatz  wrote:

> On 08.01.2019 17:16, Jeff Ortel wrote:
> >
> >
> > On 1/3/19 1:28 PM, Simon Baatz wrote:
> >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
> >>> I don't think that using integer ids with bulk_create and
> >>> supporting
> >>> mysql/mariadb are necessarily mutually exclusive. I think there
> >>> might
> >>> be a way to find the records created using bulk_create if we
> >>> know the
> >>> natural key. It might be more performant than using UUIDs as well.
> >> This assumes that there is a natural key.  For content types with no
> >> digest information in the meta data, there may be a natural key
> >> for content within a repo version only, but no natural key for the
> >> overall content.  (If we want to support non-immediate modes for such
> >> content.  In immediate mode, a digest can be computed from the
> >> associated artifact(s)).
> >
> > Can you give some examples of Content without a natural key?
>
> For example, the meta-data obtained for Cookbooks is "version" and
> "name" (the same seems to apply to Ruby Gems). With immediate sync
> policy, we can add a digest to each content unit as we know the digest
> of the associated artifact. Thus, the natural key is "version", "name",
> and "digest"
>
> In "non-immediate mode", we only have "version" and "name" to work with
> during sync. Now, there is a trade-off (I think) and we have the
> following possibilities:
>
> 1. Just pretend that "version" and "name" are unique. We have a natural
> key, but it may lead to the cross-repo effects that I described a while
> ago on the list.
> 2. Use "version" and "name" as natural key within a repo version, but
> not globally. In this scenario, it may turn out that two content units
> are actually the same after downloading.
>
> I prefer option 2: Content sharing is not perfect, but as a user, I
> don't have to fear that repositories mix-up content that happens to have
> the same name and version.
>
> There is also an extension of 2., which allows content sharing during
> sync for immediate mode. Define a "pseudo" natural key on global
> content level: "version", "name" and "digest". "digest" may be null. Two
> content units are considered the same if they match in all three
> attributes and these attributes are not null. But even in immediate
> mode, the artifact will not be downloaded if "name" and "version" are
> already present in the repository version the sync is based on. A
> pipeline for this could look like:
>
> def pipeline_stages(self, new_version):
> pipeline = [
> self.first_stage,
> QueryExistingContentUnits(new_version=new_version),
> ExistingContentNeedsNoArtifacts()
> ]
> if self.download_artifacts:
> pipeline.extend([ArtifactDownloader(), ArtifactSaver(),
>  UpdateContentWithDownloadResult(),
> QueryExistingContentUnits()])
> pipeline.extend([ContentUnitSaver()])
> return pipeline
>
> QueryExistingContentUnits(new_version=new_version) associates based on
> the "repo version key",
> QueryExistingContentUnits() associates globally based on the "pseudo
> natural key" (digest must be set to match at all)
>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-09 Thread Matthias Dellweg
I also want to vote for solution 2.
Only thing i want to add is multi-tenancy capability. The repository of
one user should really not be (from a functionality view point)
affected by actions on another repository. Sharing the actual files for
performance is desireable, but sharing content units might be a problem
in any case.

On Tue, 8 Jan 2019 20:49:41 +0100
Simon Baatz  wrote:

> On 08.01.2019 17:16, Jeff Ortel wrote:
> >
> >
> > On 1/3/19 1:28 PM, Simon Baatz wrote:  
> >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:  
> >>>     I don't think that using integer ids with bulk_create and
> >>> supporting
> >>>     mysql/mariadb are necessarily mutually exclusive. I think
> >>> there might
> >>>     be a way to find the records created using bulk_create if we
> >>> know the
> >>>     natural key. It might be more performant than using UUIDs as
> >>> well.  
> >> This assumes that there is a natural key.  For content types with
> >> no digest information in the meta data, there may be a natural key
> >> for content within a repo version only, but no natural key for the
> >> overall content.  (If we want to support non-immediate modes for
> >> such content.  In immediate mode, a digest can be computed from the
> >> associated artifact(s)).  
> >
> > Can you give some examples of Content without a natural key?  
> 
> For example, the meta-data obtained for Cookbooks is "version" and
> "name" (the same seems to apply to Ruby Gems). With immediate sync
> policy, we can add a digest to each content unit as we know the digest
> of the associated artifact. Thus, the natural key is "version",
> "name", and "digest"
> 
> In "non-immediate mode", we only have "version" and "name" to work
> with during sync. Now, there is a trade-off (I think) and we have the
> following possibilities:
> 
> 1. Just pretend that "version" and "name" are unique. We have a
> natural key, but it may lead to the cross-repo effects that I
> described a while ago on the list.
> 2. Use "version" and "name" as natural key within a repo version, but
> not globally. In this scenario, it may turn out that two content units
> are actually the same after downloading.
> 
> I prefer option 2: Content sharing is not perfect, but as a user, I
> don't have to fear that repositories mix-up content that happens to
> have the same name and version.
> 
> There is also an extension of 2., which allows content sharing during
> sync for immediate mode. Define a "pseudo" natural key on global 
> content level: "version", "name" and "digest". "digest" may be null.
> Two content units are considered the same if they match in all three
> attributes and these attributes are not null. But even in immediate
> mode, the artifact will not be downloaded if "name" and "version" are
> already present in the repository version the sync is based on. A
> pipeline for this could look like:
> 
>     def pipeline_stages(self, new_version):
>     pipeline = [
>     self.first_stage,
>     QueryExistingContentUnits(new_version=new_version),
>     ExistingContentNeedsNoArtifacts()
>     ]
>     if self.download_artifacts:
>     pipeline.extend([ArtifactDownloader(), ArtifactSaver(),
>  UpdateContentWithDownloadResult(),
> QueryExistingContentUnits()])
>     pipeline.extend([ContentUnitSaver()])
>     return pipeline
> 
> QueryExistingContentUnits(new_version=new_version) associates based on
> the "repo version key",
> QueryExistingContentUnits() associates globally based on the "pseudo
> natural key" (digest must be set to match at all)
> 
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev


pgpEZ85uiTF85.pgp
Description: OpenPGP digital signature
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-08 Thread Simon Baatz
On 08.01.2019 17:16, Jeff Ortel wrote:
>
>
> On 1/3/19 1:28 PM, Simon Baatz wrote:
>> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
>>>     I don't think that using integer ids with bulk_create and
>>> supporting
>>>     mysql/mariadb are necessarily mutually exclusive. I think there
>>> might
>>>     be a way to find the records created using bulk_create if we
>>> know the
>>>     natural key. It might be more performant than using UUIDs as well.
>> This assumes that there is a natural key.  For content types with no
>> digest information in the meta data, there may be a natural key
>> for content within a repo version only, but no natural key for the
>> overall content.  (If we want to support non-immediate modes for such
>> content.  In immediate mode, a digest can be computed from the
>> associated artifact(s)).
>
> Can you give some examples of Content without a natural key?

For example, the meta-data obtained for Cookbooks is "version" and
"name" (the same seems to apply to Ruby Gems). With immediate sync
policy, we can add a digest to each content unit as we know the digest
of the associated artifact. Thus, the natural key is "version", "name",
and "digest"

In "non-immediate mode", we only have "version" and "name" to work with
during sync. Now, there is a trade-off (I think) and we have the
following possibilities:

1. Just pretend that "version" and "name" are unique. We have a natural
key, but it may lead to the cross-repo effects that I described a while
ago on the list.
2. Use "version" and "name" as natural key within a repo version, but
not globally. In this scenario, it may turn out that two content units
are actually the same after downloading.

I prefer option 2: Content sharing is not perfect, but as a user, I
don't have to fear that repositories mix-up content that happens to have
the same name and version.

There is also an extension of 2., which allows content sharing during
sync for immediate mode. Define a "pseudo" natural key on global 
content level: "version", "name" and "digest". "digest" may be null. Two
content units are considered the same if they match in all three
attributes and these attributes are not null. But even in immediate
mode, the artifact will not be downloaded if "name" and "version" are
already present in the repository version the sync is based on. A
pipeline for this could look like:

    def pipeline_stages(self, new_version):
    pipeline = [
    self.first_stage,
    QueryExistingContentUnits(new_version=new_version),
    ExistingContentNeedsNoArtifacts()
    ]
    if self.download_artifacts:
    pipeline.extend([ArtifactDownloader(), ArtifactSaver(),
 UpdateContentWithDownloadResult(),
QueryExistingContentUnits()])
    pipeline.extend([ContentUnitSaver()])
    return pipeline

QueryExistingContentUnits(new_version=new_version) associates based on
the "repo version key",
QueryExistingContentUnits() associates globally based on the "pseudo
natural key" (digest must be set to match at all)

___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-08 Thread Jeff Ortel




On 1/3/19 1:28 PM, Simon Baatz wrote:

On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:

I don't think that using integer ids with bulk_create and supporting
mysql/mariadb are necessarily mutually exclusive. I think there might
be a way to find the records created using bulk_create if we know the
natural key. It might be more performant than using UUIDs as well.

This assumes that there is a natural key.  For content types with no
digest information in the meta data, there may be a natural key
for content within a repo version only, but no natural key for the
overall content.  (If we want to support non-immediate modes for such
content.  In immediate mode, a digest can be computed from the
associated artifact(s)).


Can you give some examples of Content without a natural key?



Of course, there are ways around that (use a UUID as the "natural" key,
or add a UUID to the repo version key fields), but I would like to
avoid that.


On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkli...@redhat.com>
wrote:

Thank you Daniel for the explanation and for filing an issue[0] to do
performance analysis of UUIDs.
I really hope that we can switch back to using UUIDs so we can bring
back MariaDB support for Pulp 3.
[0] [2]https://pulp.plan.io/issues/4290

On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dal...@redhat.com>
wrote:

To rephrase the problem a little bit:
We need to bulk_create() a bunch of objects, and then after we do that
we want to immediately be able to relate them with other objects, which
means we need their PKs of the objects that were just created.
In the case of auto-increment integer PKs, we can't know that PK value
before it gets saved into the database.  Luckily, PostgreSQL (and
Oracle) support a "RETURNING" keyword that does provides this
information.  The raw SQL would look something like this:

INSERT INTO items (name) values ('bear') RETURNING id;

Django uses this feature to set the PK field on the model objects it
returns when you call bulk_create() on a list of unsaved model objects.
Unfortunately, MySQL doesn't support this, so there's no way to figure
out what the PKs of the objects you just saved were, so the ORM can't
set that information on the returned model objects.
UUID PKs circumvent this because the PK gets created outside of the
database, prior to being saved in the database, and so Django *can*
know what the PK will be when it gets saved.

On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]bbout...@redhat.com>
wrote:

+1 to experimentation and also making sure that we understand the
performance implications of the decision. I'm replying to this earlier
note to restate my observations of the problem a bit more.
More ideas and thoughts are welcome. This is a decision with a lot of
aspects to consider.
On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]pcre...@redhat.com>
wrote:

  On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
  > When we switched from UUID to integers for the PK
  > with databases other than PostgreSQL [0].
  >
  > With a goal of database agnosticism for Pulp3, if plugin writers
  plan to use bulk_create with any object inherited
  > from one of ours, they can't will get different behaviors on
  different databases and they won't have PKs that they may
  > require. bulk_create is a normal django thing, so plugin writers
  making a django plugin should be able to use it. This
  > concerned me already, but today it was also brought up by non-RH
  plugin writers also [1] in a PR.
  >
  > The tradeoffs bteween UUIDs versus PKs are pretty well summed up
  in our ticket where we discussed that change [2].
  > Note, we did not consider this bulk_create downside at that time,
  which I think is the most significant downside to
  > consider.
  >
  > Having bulk_create effectively not available for plugin writers
  (since we can't rely on its pks being returned) I
  > think is a non-starter for me. I love how short the UUIDs made our
  URLs so that's the tradeoff mainly in my mind.
  > Those balanced against each other, I think we should switch back.
  >
  > Another option is to become PostgreSQL only which (though I love
  psql) I think would be the wrong choice for Pulp from
  > what I've heard from its users.
  >
  > What do you think? What should we do?
  So, my mind immediately goes to this question, which might be
  usefull for others to help make decisions, so I'll ask:
  When you say:
  "we lost the ability to have the primary key set during bulk_create"
  Can you clarify what you mean by this?
  My mind immediately goes to this chain of events:
  When you use bulk_create, the existing in-memory model
  objects representing the 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-03 Thread David Davis
I think adding a separate UUID column seems like a feasible workaround for
Content without a natural key. But I can understand the desire not to do so
in order to create an artificial natural key.

I guess we can weigh the performance results of using UUID PKs against how
often we expect plugin Content to have no natural key.

David


On Thu, Jan 3, 2019 at 2:28 PM Simon Baatz  wrote:

> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
> >I don't think that using integer ids with bulk_create and supporting
> >mysql/mariadb are necessarily mutually exclusive. I think there might
> >be a way to find the records created using bulk_create if we know the
> >natural key. It might be more performant than using UUIDs as well.
>
> This assumes that there is a natural key.  For content types with no
> digest information in the meta data, there may be a natural key
> for content within a repo version only, but no natural key for the
> overall content.  (If we want to support non-immediate modes for such
> content.  In immediate mode, a digest can be computed from the
> associated artifact(s)).
>
> Of course, there are ways around that (use a UUID as the "natural" key,
> or add a UUID to the repo version key fields), but I would like to
> avoid that.
>
> >On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkli...@redhat.com>
> >wrote:
> >
> >Thank you Daniel for the explanation and for filing an issue[0] to do
> >performance analysis of UUIDs.
> >I really hope that we can switch back to using UUIDs so we can bring
> >back MariaDB support for Pulp 3.
> >[0] [2]https://pulp.plan.io/issues/4290
> >
> >On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dal...@redhat.com>
> >wrote:
> >
> >To rephrase the problem a little bit:
> >We need to bulk_create() a bunch of objects, and then after we do that
> >we want to immediately be able to relate them with other objects,
> which
> >means we need their PKs of the objects that were just created.
> >In the case of auto-increment integer PKs, we can't know that PK value
> >before it gets saved into the database.  Luckily, PostgreSQL (and
> >Oracle) support a "RETURNING" keyword that does provides this
> >information.  The raw SQL would look something like this:
> >
> > INSERT INTO items (name) values ('bear') RETURNING id;
> >
> >Django uses this feature to set the PK field on the model objects it
> >returns when you call bulk_create() on a list of unsaved model
> objects.
> >Unfortunately, MySQL doesn't support this, so there's no way to figure
> >out what the PKs of the objects you just saved were, so the ORM can't
> >set that information on the returned model objects.
> >UUID PKs circumvent this because the PK gets created outside of the
> >database, prior to being saved in the database, and so Django *can*
> >know what the PK will be when it gets saved.
> >
> >On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]
> bbout...@redhat.com>
> >wrote:
> >
> >+1 to experimentation and also making sure that we understand the
> >performance implications of the decision. I'm replying to this earlier
> >note to restate my observations of the problem a bit more.
> >More ideas and thoughts are welcome. This is a decision with a lot of
> >aspects to consider.
> >On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]
> pcre...@redhat.com>
> >wrote:
> >
> >  On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
> >  > When we switched from UUID to integers for the PK
> >  > with databases other than PostgreSQL [0].
> >  >
> >  > With a goal of database agnosticism for Pulp3, if plugin writers
> >  plan to use bulk_create with any object inherited
> >  > from one of ours, they can't will get different behaviors on
> >  different databases and they won't have PKs that they may
> >  > require. bulk_create is a normal django thing, so plugin writers
> >  making a django plugin should be able to use it. This
> >  > concerned me already, but today it was also brought up by non-RH
> >  plugin writers also [1] in a PR.
> >  >
> >  > The tradeoffs bteween UUIDs versus PKs are pretty well summed up
> >  in our ticket where we discussed that change [2].
> >  > Note, we did not consider this bulk_create downside at that time,
> >  which I think is the most significant downside to
> >  > consider.
> >  >
> >  > Having bulk_create effectively not available for plugin writers
> >  (since we can't rely on its pks being returned) I
> >  > think is a non-starter for me. I love how short the UUIDs made our
> >  URLs so that's the tradeoff mainly in my mind.
> >  > Those balanced against each other, I think we should switch back.
> >  >
> >  > Another option is to become PostgreSQL only which (though I love
> >  psql) I think would be the wrong 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-03 Thread Simon Baatz
On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
>I don't think that using integer ids with bulk_create and supporting
>mysql/mariadb are necessarily mutually exclusive. I think there might
>be a way to find the records created using bulk_create if we know the
>natural key. It might be more performant than using UUIDs as well.

This assumes that there is a natural key.  For content types with no
digest information in the meta data, there may be a natural key
for content within a repo version only, but no natural key for the
overall content.  (If we want to support non-immediate modes for such
content.  In immediate mode, a digest can be computed from the
associated artifact(s)).

Of course, there are ways around that (use a UUID as the "natural" key,
or add a UUID to the repo version key fields), but I would like to
avoid that.

>On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkli...@redhat.com>
>wrote:
> 
>Thank you Daniel for the explanation and for filing an issue[0] to do
>performance analysis of UUIDs.
>I really hope that we can switch back to using UUIDs so we can bring
>back MariaDB support for Pulp 3.
>[0] [2]https://pulp.plan.io/issues/4290
> 
>On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dal...@redhat.com>
>wrote:
> 
>To rephrase the problem a little bit:
>We need to bulk_create() a bunch of objects, and then after we do that
>we want to immediately be able to relate them with other objects, which
>means we need their PKs of the objects that were just created.
>In the case of auto-increment integer PKs, we can't know that PK value
>before it gets saved into the database.  Luckily, PostgreSQL (and
>Oracle) support a "RETURNING" keyword that does provides this
>information.  The raw SQL would look something like this:
> 
> INSERT INTO items (name) values ('bear') RETURNING id;
> 
>Django uses this feature to set the PK field on the model objects it
>returns when you call bulk_create() on a list of unsaved model objects.
>Unfortunately, MySQL doesn't support this, so there's no way to figure
>out what the PKs of the objects you just saved were, so the ORM can't
>set that information on the returned model objects.
>UUID PKs circumvent this because the PK gets created outside of the
>database, prior to being saved in the database, and so Django *can*
>know what the PK will be when it gets saved.
> 
>On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]bbout...@redhat.com>
>wrote:
> 
>+1 to experimentation and also making sure that we understand the
>performance implications of the decision. I'm replying to this earlier
>note to restate my observations of the problem a bit more.
>More ideas and thoughts are welcome. This is a decision with a lot of
>aspects to consider.
>On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]pcre...@redhat.com>
>wrote:
> 
>  On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>  > When we switched from UUID to integers for the PK
>  > with databases other than PostgreSQL [0].
>  >
>  > With a goal of database agnosticism for Pulp3, if plugin writers
>  plan to use bulk_create with any object inherited
>  > from one of ours, they can't will get different behaviors on
>  different databases and they won't have PKs that they may
>  > require. bulk_create is a normal django thing, so plugin writers
>  making a django plugin should be able to use it. This
>  > concerned me already, but today it was also brought up by non-RH
>  plugin writers also [1] in a PR.
>  >
>  > The tradeoffs bteween UUIDs versus PKs are pretty well summed up
>  in our ticket where we discussed that change [2].
>  > Note, we did not consider this bulk_create downside at that time,
>  which I think is the most significant downside to
>  > consider.
>  >
>  > Having bulk_create effectively not available for plugin writers
>  (since we can't rely on its pks being returned) I
>  > think is a non-starter for me. I love how short the UUIDs made our
>  URLs so that's the tradeoff mainly in my mind.
>  > Those balanced against each other, I think we should switch back.
>  >
>  > Another option is to become PostgreSQL only which (though I love
>  psql) I think would be the wrong choice for Pulp from
>  > what I've heard from its users.
>  >
>  > What do you think? What should we do?
>  So, my mind immediately goes to this question, which might be
>  usefull for others to help make decisions, so I'll ask:
>  When you say:
>  "we lost the ability to have the primary key set during bulk_create"
>  Can you clarify what you mean by this?
>  My mind immediately goes to this chain of events:
>  When you use bulk_create, the existing in-memory model
>  objects representing the data to create 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-03 Thread David Davis
I don't think that using integer ids with bulk_create and supporting
mysql/mariadb are necessarily mutually exclusive. I think there might be a
way to find the records created using bulk_create if we know the natural
key. It might be more performant than using UUIDs as well.

David


On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban  wrote:

> Thank you Daniel for the explanation and for filing an issue[0] to do
> performance analysis of UUIDs.
>
> I really hope that we can switch back to using UUIDs so we can bring back
> MariaDB support for Pulp 3.
>
> [0] https://pulp.plan.io/issues/4290
>
> On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley  wrote:
>
>> To rephrase the problem a little bit:
>>
>> We need to bulk_create() a bunch of objects, and then after we do that we
>> want to immediately be able to relate them with other objects, which means
>> we need their PKs of the objects that were just created.
>>
>> In the case of auto-increment integer PKs, we can't know that PK value
>> before it gets saved into the database.  Luckily, PostgreSQL (and Oracle)
>> support a "RETURNING" keyword that does provides this information.  The raw
>> SQL would look something like this:
>>
>>> INSERT INTO items (name) values ('bear') RETURNING id;
>>>
>>> Django uses this feature to set the PK field on the model objects it
>> returns when you call bulk_create() on a list of unsaved model objects.
>>
>> Unfortunately, MySQL doesn't support this, so there's no way to figure
>> out what the PKs of the objects you just saved were, so the ORM can't set
>> that information on the returned model objects.
>>
>> UUID PKs circumvent this because the PK gets created outside of the
>> database, prior to being saved in the database, and so Django *can* know
>> what the PK will be when it gets saved.
>>
>> On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse 
>> wrote:
>>
>>> +1 to experimentation and also making sure that we understand the
>>> performance implications of the decision. I'm replying to this earlier note
>>> to restate my observations of the problem a bit more.
>>>
>>> More ideas and thoughts are welcome. This is a decision with a lot of
>>> aspects to consider.
>>>
>>>
>>> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech 
>>> wrote:
>>>
 On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
 > When we switched from UUID to integers for the PK
 > with databases other than PostgreSQL [0].
 >
 > With a goal of database agnosticism for Pulp3, if plugin writers plan
 to use bulk_create with any object inherited
 > from one of ours, they can't will get different behaviors on
 different databases and they won't have PKs that they may
 > require. bulk_create is a normal django thing, so plugin writers
 making a django plugin should be able to use it. This
 > concerned me already, but today it was also brought up by non-RH
 plugin writers also [1] in a PR.
 >
 > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in
 our ticket where we discussed that change [2].
 > Note, we did not consider this bulk_create downside at that time,
 which I think is the most significant downside to
 > consider.
 >
 > Having bulk_create effectively not available for plugin writers
 (since we can't rely on its pks being returned) I
 > think is a non-starter for me. I love how short the UUIDs made our
 URLs so that's the tradeoff mainly in my mind.
 > Those balanced against each other, I think we should switch back.
 >
 > Another option is to become PostgreSQL only which (though I love
 psql) I think would be the wrong choice for Pulp from
 > what I've heard from its users.
 >
 > What do you think? What should we do?

 So, my mind immediately goes to this question, which might be usefull
 for others to help make decisions, so I'll ask:

 When you say:

 "we lost the ability to have the primary key set during bulk_create"

 Can you clarify what you mean by this?

 My mind immediately goes to this chain of events:

 When you use bulk_create, the existing in-memory model objects
 representing the data to create do not get
 updated with the primary key values that are created in the database.

 Upon a subsequent query of the database, for the exact same set
 of objects just added, those objects _will_ have
 the primary key populated.

 In other words,

 The database records themselves get the auto-increment IDs
 added, they just don't get reported back in that
 query to the ORM layer, therefore it takes a subsequent query to get
 those ids out.

 Does that about sum it up?

>>>
>>> Yes this describes the situation, but there is a bit more to tell. Since
>>> PostgreSQL does return the ids the subsequent query that could be done to
>>> get the ids isn't written in code today. We didn't need to 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2019-01-03 Thread Dennis Kliban
Thank you Daniel for the explanation and for filing an issue[0] to do
performance analysis of UUIDs.

I really hope that we can switch back to using UUIDs so we can bring back
MariaDB support for Pulp 3.

[0] https://pulp.plan.io/issues/4290

On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley  wrote:

> To rephrase the problem a little bit:
>
> We need to bulk_create() a bunch of objects, and then after we do that we
> want to immediately be able to relate them with other objects, which means
> we need their PKs of the objects that were just created.
>
> In the case of auto-increment integer PKs, we can't know that PK value
> before it gets saved into the database.  Luckily, PostgreSQL (and Oracle)
> support a "RETURNING" keyword that does provides this information.  The raw
> SQL would look something like this:
>
>> INSERT INTO items (name) values ('bear') RETURNING id;
>>
>> Django uses this feature to set the PK field on the model objects it
> returns when you call bulk_create() on a list of unsaved model objects.
>
> Unfortunately, MySQL doesn't support this, so there's no way to figure out
> what the PKs of the objects you just saved were, so the ORM can't set that
> information on the returned model objects.
>
> UUID PKs circumvent this because the PK gets created outside of the
> database, prior to being saved in the database, and so Django *can* know
> what the PK will be when it gets saved.
>
> On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse 
> wrote:
>
>> +1 to experimentation and also making sure that we understand the
>> performance implications of the decision. I'm replying to this earlier note
>> to restate my observations of the problem a bit more.
>>
>> More ideas and thoughts are welcome. This is a decision with a lot of
>> aspects to consider.
>>
>>
>> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech 
>> wrote:
>>
>>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>>> > When we switched from UUID to integers for the PK
>>> > with databases other than PostgreSQL [0].
>>> >
>>> > With a goal of database agnosticism for Pulp3, if plugin writers plan
>>> to use bulk_create with any object inherited
>>> > from one of ours, they can't will get different behaviors on different
>>> databases and they won't have PKs that they may
>>> > require. bulk_create is a normal django thing, so plugin writers
>>> making a django plugin should be able to use it. This
>>> > concerned me already, but today it was also brought up by non-RH
>>> plugin writers also [1] in a PR.
>>> >
>>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in
>>> our ticket where we discussed that change [2].
>>> > Note, we did not consider this bulk_create downside at that time,
>>> which I think is the most significant downside to
>>> > consider.
>>> >
>>> > Having bulk_create effectively not available for plugin writers (since
>>> we can't rely on its pks being returned) I
>>> > think is a non-starter for me. I love how short the UUIDs made our
>>> URLs so that's the tradeoff mainly in my mind.
>>> > Those balanced against each other, I think we should switch back.
>>> >
>>> > Another option is to become PostgreSQL only which (though I love psql)
>>> I think would be the wrong choice for Pulp from
>>> > what I've heard from its users.
>>> >
>>> > What do you think? What should we do?
>>>
>>> So, my mind immediately goes to this question, which might be usefull
>>> for others to help make decisions, so I'll ask:
>>>
>>> When you say:
>>>
>>> "we lost the ability to have the primary key set during bulk_create"
>>>
>>> Can you clarify what you mean by this?
>>>
>>> My mind immediately goes to this chain of events:
>>>
>>> When you use bulk_create, the existing in-memory model objects
>>> representing the data to create do not get
>>> updated with the primary key values that are created in the database.
>>>
>>> Upon a subsequent query of the database, for the exact same set
>>> of objects just added, those objects _will_ have
>>> the primary key populated.
>>>
>>> In other words,
>>>
>>> The database records themselves get the auto-increment IDs
>>> added, they just don't get reported back in that
>>> query to the ORM layer, therefore it takes a subsequent query to get
>>> those ids out.
>>>
>>> Does that about sum it up?
>>>
>>
>> Yes this describes the situation, but there is a bit more to tell. Since
>> PostgreSQL does return the ids the subsequent query that could be done to
>> get the ids isn't written in code today. We didn't need to because we
>> developed it against PostgreSQL. I'm pretty sure that if you configure Pulp
>> against MySQL Pulp won't work, which I think is a problem. So I'm observing
>> two things here. 1) This is a hazard that causes code to unexpectedly be
>> only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we
>> need to fix that too
>>
>> Do you also see these two issues? What should be done about these?
>>
>>
>>>
>>> >
>>> > [0]:
>>> 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2018-12-14 Thread Brian Bouterse
@dalley that is an excellent writeup of what is going on.

So since we could be benefitting from PostgreSQL-only Django features, I
think we should add MySQL to the test matrix to ensure that at least
pulp/pulp and pulp/pulpcore-plugin are SQL agnostic. I made this issue here
to track that work. https://pulp.plan.io/issues/4270

Please raise any concerns or ideas on what we could do better for this.

On Wed, Dec 5, 2018 at 1:34 PM Daniel Alley  wrote:

> To rephrase the problem a little bit:
>
> We need to bulk_create() a bunch of objects, and then after we do that we
> want to immediately be able to relate them with other objects, which means
> we need their PKs of the objects that were just created.
>
> In the case of auto-increment integer PKs, we can't know that PK value
> before it gets saved into the database.  Luckily, PostgreSQL (and Oracle)
> support a "RETURNING" keyword that does provides this information.  The raw
> SQL would look something like this:
>
>> INSERT INTO items (name) values ('bear') RETURNING id;
>>
>> Django uses this feature to set the PK field on the model objects it
> returns when you call bulk_create() on a list of unsaved model objects.
>
> Unfortunately, MySQL doesn't support this, so there's no way to figure out
> what the PKs of the objects you just saved were, so the ORM can't set that
> information on the returned model objects.
>
> UUID PKs circumvent this because the PK gets created outside of the
> database, prior to being saved in the database, and so Django *can* know
> what the PK will be when it gets saved.
>
> On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse 
> wrote:
>
>> +1 to experimentation and also making sure that we understand the
>> performance implications of the decision. I'm replying to this earlier note
>> to restate my observations of the problem a bit more.
>>
>> More ideas and thoughts are welcome. This is a decision with a lot of
>> aspects to consider.
>>
>>
>> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech 
>> wrote:
>>
>>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>>> > When we switched from UUID to integers for the PK
>>> > with databases other than PostgreSQL [0].
>>> >
>>> > With a goal of database agnosticism for Pulp3, if plugin writers plan
>>> to use bulk_create with any object inherited
>>> > from one of ours, they can't will get different behaviors on different
>>> databases and they won't have PKs that they may
>>> > require. bulk_create is a normal django thing, so plugin writers
>>> making a django plugin should be able to use it. This
>>> > concerned me already, but today it was also brought up by non-RH
>>> plugin writers also [1] in a PR.
>>> >
>>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in
>>> our ticket where we discussed that change [2].
>>> > Note, we did not consider this bulk_create downside at that time,
>>> which I think is the most significant downside to
>>> > consider.
>>> >
>>> > Having bulk_create effectively not available for plugin writers (since
>>> we can't rely on its pks being returned) I
>>> > think is a non-starter for me. I love how short the UUIDs made our
>>> URLs so that's the tradeoff mainly in my mind.
>>> > Those balanced against each other, I think we should switch back.
>>> >
>>> > Another option is to become PostgreSQL only which (though I love psql)
>>> I think would be the wrong choice for Pulp from
>>> > what I've heard from its users.
>>> >
>>> > What do you think? What should we do?
>>>
>>> So, my mind immediately goes to this question, which might be usefull
>>> for others to help make decisions, so I'll ask:
>>>
>>> When you say:
>>>
>>> "we lost the ability to have the primary key set during bulk_create"
>>>
>>> Can you clarify what you mean by this?
>>>
>>> My mind immediately goes to this chain of events:
>>>
>>> When you use bulk_create, the existing in-memory model objects
>>> representing the data to create do not get
>>> updated with the primary key values that are created in the database.
>>>
>>> Upon a subsequent query of the database, for the exact same set
>>> of objects just added, those objects _will_ have
>>> the primary key populated.
>>>
>>> In other words,
>>>
>>> The database records themselves get the auto-increment IDs
>>> added, they just don't get reported back in that
>>> query to the ORM layer, therefore it takes a subsequent query to get
>>> those ids out.
>>>
>>> Does that about sum it up?
>>>
>>
>> Yes this describes the situation, but there is a bit more to tell. Since
>> PostgreSQL does return the ids the subsequent query that could be done to
>> get the ids isn't written in code today. We didn't need to because we
>> developed it against PostgreSQL. I'm pretty sure that if you configure Pulp
>> against MySQL Pulp won't work, which I think is a problem. So I'm observing
>> two things here. 1) This is a hazard that causes code to unexpectedly be
>> only compliant with PostgreSQL. 2) Pulp itself fell 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2018-12-05 Thread Daniel Alley
To rephrase the problem a little bit:

We need to bulk_create() a bunch of objects, and then after we do that we
want to immediately be able to relate them with other objects, which means
we need their PKs of the objects that were just created.

In the case of auto-increment integer PKs, we can't know that PK value
before it gets saved into the database.  Luckily, PostgreSQL (and Oracle)
support a "RETURNING" keyword that does provides this information.  The raw
SQL would look something like this:

> INSERT INTO items (name) values ('bear') RETURNING id;
>
> Django uses this feature to set the PK field on the model objects it
returns when you call bulk_create() on a list of unsaved model objects.

Unfortunately, MySQL doesn't support this, so there's no way to figure out
what the PKs of the objects you just saved were, so the ORM can't set that
information on the returned model objects.

UUID PKs circumvent this because the PK gets created outside of the
database, prior to being saved in the database, and so Django *can* know
what the PK will be when it gets saved.

On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse  wrote:

> +1 to experimentation and also making sure that we understand the
> performance implications of the decision. I'm replying to this earlier note
> to restate my observations of the problem a bit more.
>
> More ideas and thoughts are welcome. This is a decision with a lot of
> aspects to consider.
>
>
> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech 
> wrote:
>
>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>> > When we switched from UUID to integers for the PK
>> > with databases other than PostgreSQL [0].
>> >
>> > With a goal of database agnosticism for Pulp3, if plugin writers plan
>> to use bulk_create with any object inherited
>> > from one of ours, they can't will get different behaviors on different
>> databases and they won't have PKs that they may
>> > require. bulk_create is a normal django thing, so plugin writers making
>> a django plugin should be able to use it. This
>> > concerned me already, but today it was also brought up by non-RH plugin
>> writers also [1] in a PR.
>> >
>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our
>> ticket where we discussed that change [2].
>> > Note, we did not consider this bulk_create downside at that time, which
>> I think is the most significant downside to
>> > consider.
>> >
>> > Having bulk_create effectively not available for plugin writers (since
>> we can't rely on its pks being returned) I
>> > think is a non-starter for me. I love how short the UUIDs made our URLs
>> so that's the tradeoff mainly in my mind.
>> > Those balanced against each other, I think we should switch back.
>> >
>> > Another option is to become PostgreSQL only which (though I love psql)
>> I think would be the wrong choice for Pulp from
>> > what I've heard from its users.
>> >
>> > What do you think? What should we do?
>>
>> So, my mind immediately goes to this question, which might be usefull for
>> others to help make decisions, so I'll ask:
>>
>> When you say:
>>
>> "we lost the ability to have the primary key set during bulk_create"
>>
>> Can you clarify what you mean by this?
>>
>> My mind immediately goes to this chain of events:
>>
>> When you use bulk_create, the existing in-memory model objects
>> representing the data to create do not get
>> updated with the primary key values that are created in the database.
>>
>> Upon a subsequent query of the database, for the exact same set
>> of objects just added, those objects _will_ have
>> the primary key populated.
>>
>> In other words,
>>
>> The database records themselves get the auto-increment IDs added,
>> they just don't get reported back in that
>> query to the ORM layer, therefore it takes a subsequent query to get
>> those ids out.
>>
>> Does that about sum it up?
>>
>
> Yes this describes the situation, but there is a bit more to tell. Since
> PostgreSQL does return the ids the subsequent query that could be done to
> get the ids isn't written in code today. We didn't need to because we
> developed it against PostgreSQL. I'm pretty sure that if you configure Pulp
> against MySQL Pulp won't work, which I think is a problem. So I'm observing
> two things here. 1) This is a hazard that causes code to unexpectedly be
> only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we
> need to fix that too
>
> Do you also see these two issues? What should be done about these?
>
>
>>
>> >
>> > [0]:
>> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
>> > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702
>> > [2]: https://pulp.plan.io/issues/3848
>> > ___
>> > Pulp-dev mailing list
>> > Pulp-dev@redhat.com
>> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>> ___
>> Pulp-dev mailing list
>> 

Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2018-12-05 Thread Brian Bouterse
+1 to experimentation and also making sure that we understand the
performance implications of the decision. I'm replying to this earlier note
to restate my observations of the problem a bit more.

More ideas and thoughts are welcome. This is a decision with a lot of
aspects to consider.


On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech  wrote:

> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
> > When we switched from UUID to integers for the PK
> > with databases other than PostgreSQL [0].
> >
> > With a goal of database agnosticism for Pulp3, if plugin writers plan to
> use bulk_create with any object inherited
> > from one of ours, they can't will get different behaviors on different
> databases and they won't have PKs that they may
> > require. bulk_create is a normal django thing, so plugin writers making
> a django plugin should be able to use it. This
> > concerned me already, but today it was also brought up by non-RH plugin
> writers also [1] in a PR.
> >
> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our
> ticket where we discussed that change [2].
> > Note, we did not consider this bulk_create downside at that time, which
> I think is the most significant downside to
> > consider.
> >
> > Having bulk_create effectively not available for plugin writers (since
> we can't rely on its pks being returned) I
> > think is a non-starter for me. I love how short the UUIDs made our URLs
> so that's the tradeoff mainly in my mind.
> > Those balanced against each other, I think we should switch back.
> >
> > Another option is to become PostgreSQL only which (though I love psql) I
> think would be the wrong choice for Pulp from
> > what I've heard from its users.
> >
> > What do you think? What should we do?
>
> So, my mind immediately goes to this question, which might be usefull for
> others to help make decisions, so I'll ask:
>
> When you say:
>
> "we lost the ability to have the primary key set during bulk_create"
>
> Can you clarify what you mean by this?
>
> My mind immediately goes to this chain of events:
>
> When you use bulk_create, the existing in-memory model objects
> representing the data to create do not get
> updated with the primary key values that are created in the database.
>
> Upon a subsequent query of the database, for the exact same set of
> objects just added, those objects _will_ have
> the primary key populated.
>
> In other words,
>
> The database records themselves get the auto-increment IDs added,
> they just don't get reported back in that
> query to the ORM layer, therefore it takes a subsequent query to get those
> ids out.
>
> Does that about sum it up?
>

Yes this describes the situation, but there is a bit more to tell. Since
PostgreSQL does return the ids the subsequent query that could be done to
get the ids isn't written in code today. We didn't need to because we
developed it against PostgreSQL. I'm pretty sure that if you configure Pulp
against MySQL Pulp won't work, which I think is a problem. So I'm observing
two things here. 1) This is a hazard that causes code to unexpectedly be
only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we
need to fix that too

Do you also see these two issues? What should be done about these?


>
> >
> > [0]:
> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
> > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702
> > [2]: https://pulp.plan.io/issues/3848
> > ___
> > Pulp-dev mailing list
> > Pulp-dev@redhat.com
> > https://www.redhat.com/mailman/listinfo/pulp-dev
>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2018-12-05 Thread Patrick Creech
On Wed, 2018-12-05 at 09:34 -0500, Daniel Alley wrote:
> Perhaps, but it's not a -1 so much as a call for more experimentation and 
> testing.  I wouldn't feel comfortable saying
> Pulp is MySQL "compatible" if (if!) it was an order of magnitude slower than 
> Pulp on Postgres, and we never found out
> about that because we never tested it... I think that kind of "compatibility" 
> would be a net negative to Pulp.  So I
> just want us to make sure these bases are covered if we're going to make that 
> claim.
> 
> s/it's chart chart/it's a pretty nasty chart

It shows the beginnings of an exponential curve of performance degredation as 
your record count grows.  yes, nasty.

> 
> On Wed, Dec 5, 2018 at 9:27 AM Dennis Kliban  wrote:
> > It looks like the chart was generated using MySQL 5.0.45 which was released 
> > at least 10 years ago[0]. I don't think
> > we can rely on such old results. 
> > 
> > [0] https://en.wikipedia.org/wiki/MySQL#Milestones

https://mariadb.com/kb/en/library/guiduuid-performance/

MariaDB (the MySQL fork after oracle took over), still has concerns about 
guid/uuid performance.  The article here is
about 1-2 years old, and it spells out some of the still current-ish concerns 
with utilizing uuids/guids in the mysql
world.

I would argue that yes, you should at least pay attention to these old results, 
as far as a "this used to be a problem,
lets make sure it still isn't"

Most of these int vs uuid/guid concerns in the database world still apply 
today, and either side has nuanced tradeoffs
for each technology (postgresql, mysql/mariadb, oracle, sqlserver, etc...) that 
will affect wich choice is best.

FWIW, _generally_ the compromise solutions I've seen are where the actual pk is 
an int, and isn't exposed outside of the
db/model layer.  There is also an attached UUID/GUID that is code-generated 
that is treated as an object reference in
the api/code layers.  There are other solutions out there as well.

> > 
> > On Wed, Dec 5, 2018 at 9:18 AM Daniel Alley  wrote:
> > > I just want to point out that using UUID PKs works perfectly fine on 
> > > PostgreSQL but is considered a Bad Idea™ on
> > > MySQL for performance reasons.  
> > > 
> > > http://kccoder.com/mysql/uuid-vs-int-insert-performance/
> > > 
> > > 
> > > 
> > > It's hard to notice at first, but the blue and red lines (representing 
> > > integer PKs) are tracking near the bottom.
> > > 
> > > I did my testing with PostgreSQL, and I would completely agree that the 
> > > tiny performance hit we noticed there
> > > would take a backseat to the functional benefits Brian is pointing out.  
> > > But if we really, truly want to be
> > > database agnostic, we should put more thought into this change (and 
> > > others going forwards).
> > > 
> > > Another factor that makes this a more complicated decision is that the 
> > > limitations on using bulk_create() with
> > > multi-table models are more of a "simplification" on the Django side than 
> > > a fundamental limitation.  According to
> > > this comment [0] in the Django source code, and this issue [1] it's 
> > > likely possible on PostgreSQL as-is, if we
> > > were willing to mess around inside the ORM a bit.  And it could be 
> > > possible on MySQL also *if* we used UUID PKs. 
> > > And maybe the performance benefits of being able to use bulk_create() 
> > > would override or reduce the performance
> > > downsides of using UUID with MySQL.  I don't know about that though... 
> > > that's chart chart and without some
> > > experimentation this is all speculation.
> > > 
> > > TL;DR If we want to stay DB agnostic it needs to be worked into our 
> > > decision making process and not be an
> > > afterthought

This ^.  While on the surface the orm layer helps provide you with the ability 
to say you're db agnostic, to truly have
the same experience across database technologies will require up front research 
and informed choices.

> > > 
> > > [0] 
> > > https://github.com/django/django/blob/master/django/db/models/query.py#L438
> > > [1] https://code.djangoproject.com/ticket/28821
> > > 
> > > 

___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Concerns about bulk_create and PostgreSQL

2018-11-20 Thread Patrick Creech
On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
> When we switched from UUID to integers for the PK
> with databases other than PostgreSQL [0].
> 
> With a goal of database agnosticism for Pulp3, if plugin writers plan to use 
> bulk_create with any object inherited
> from one of ours, they can't will get different behaviors on different 
> databases and they won't have PKs that they may
> require. bulk_create is a normal django thing, so plugin writers making a 
> django plugin should be able to use it. This
> concerned me already, but today it was also brought up by non-RH plugin 
> writers also [1] in a PR.
> 
> The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our 
> ticket where we discussed that change [2].
> Note, we did not consider this bulk_create downside at that time, which I 
> think is the most significant downside to
> consider.
> 
> Having bulk_create effectively not available for plugin writers (since we 
> can't rely on its pks being returned) I
> think is a non-starter for me. I love how short the UUIDs made our URLs so 
> that's the tradeoff mainly in my mind.
> Those balanced against each other, I think we should switch back.
> 
> Another option is to become PostgreSQL only which (though I love psql) I 
> think would be the wrong choice for Pulp from
> what I've heard from its users.
> 
> What do you think? What should we do?

So, my mind immediately goes to this question, which might be usefull for 
others to help make decisions, so I'll ask:

When you say: 

"we lost the ability to have the primary key set during bulk_create"

Can you clarify what you mean by this?

My mind immediately goes to this chain of events:

When you use bulk_create, the existing in-memory model objects 
representing the data to create do not get
updated with the primary key values that are created in the database.  

Upon a subsequent query of the database, for the exact same set of 
objects just added, those objects _will_ have
the primary key populated.

In other words, 

The database records themselves get the auto-increment IDs added, they 
just don't get reported back in that
query to the ORM layer, therefore it takes a subsequent query to get those ids 
out.

Does that about sum it up?


> 
> [0]: https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
> [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702
> [2]: https://pulp.plan.io/issues/3848
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev

___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


[Pulp-dev] Concerns about bulk_create and PostgreSQL

2018-11-19 Thread Brian Bouterse
When we switched from UUID to integers for the PK we lost the ability to
have the primary key set during bulk_create with databases other than
PostgreSQL [0].

With a goal of database agnosticism for Pulp3, if plugin writers plan to
use bulk_create with any object inherited from one of ours, they can't will
get different behaviors on different databases and they won't have PKs that
they may require. bulk_create is a normal django thing, so plugin writers
making a django plugin should be able to use it. This concerned me already,
but today it was also brought up by non-RH plugin writers also [1] in a PR.

The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our
ticket where we discussed that change [2]. Note, we did not consider this
bulk_create downside at that time, which I think is the most significant
downside to consider.

Having bulk_create effectively not available for plugin writers (since we
can't rely on its pks being returned) I think is a non-starter for me. I
love how short the UUIDs made our URLs so that's the tradeoff mainly in my
mind. Those balanced against each other, I think we should switch back.

Another option is to become PostgreSQL only which (though I love psql) I
think would be the wrong choice for Pulp from what I've heard from its
users.

What do you think? What should we do?

[0]: https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
[1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702
[2]: https://pulp.plan.io/issues/3848
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev