Re: [Pulp-dev] Integer IDs in Pulp 3

Brian Bouterse Thu, 12 Jul 2018 05:47:28 -0700

I'm +1 on grooming that ticket and sprint nominating it. I commented on
question there about how to handle RQ.


On Wed, Jul 11, 2018 at 4:53 PM, Dennis Kliban <[email protected]> wrote:

> Thanks David. I am in favor of this  change.
>
> On Wed, Jul 11, 2018 at 4:39 PM, David Davis <[email protected]>
> wrote:
>
>> There is now:
>>
>> https://pulp.plan.io/issues/3848
>>
>> David
>>
>>
>> On Wed, Jul 11, 2018 at 4:23 PM Brian Bouterse <[email protected]>
>> wrote:
>>
>>> A 30% improvement I think is a good case for integers over uuids.
>>>
>>> Is there a ticket tracking that change?
>>>
>>> On Wed, Jul 11, 2018 at 3:55 PM, Daniel Alley <[email protected]> wrote:
>>>
>>>> w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22
>>>> seconds vs. 55.98 seconds.
>>>>
>>>> w/ searching through the same 400,000 units, performance is still about
>>>> 30% faster.  Doing a filter for file content units that have a
>>>> relative_path__startswith={some random letter} (I put UUIDs in all the
>>>> fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33
>>>> seconds if the model has a default Django auto-incrementing PK.
>>>>
>>>> On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <[email protected]>
>>>> wrote:
>>>>
>>>>> So, since I've already been working on some Pulp 3 benchmarking I
>>>>> decided to go ahead and benchmark this to get some actual data.
>>>>>
>>>>> Disclaimer:  The following data is using bulk_create() with a
>>>>> modified, flat, non-inheriting content model, not the current multi-table
>>>>> inherited content model we're currently using.  It's also using
>>>>> bulk_create() which we are not currently using in Pulp 3, but likely will
>>>>> end up using eventually.
>>>>>
>>>>> Using normal IDs instead of UUIDs was between 13% and 25% faster with
>>>>> 15,000 units.  15,000 units isn't really a sufficient value to actually
>>>>> test index performance, so I'm rerunning it with a few hundred thousand
>>>>> units, but that will take a substantial amount of time to run.  I'll 
>>>>> follow
>>>>> up later.
>>>>>
>>>>> As far as search/update performance goes, that probably has better
>>>>> margins than just insert performance, but I'll need to write new code to
>>>>> benchmark that properly.
>>>>>
>>>>> On Thu, May 24, 2018 at 11:52 AM, David Davis <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Agreed on performance. Doing some more Googling seems to have mixed
>>>>>> opinions on whether UUIDs performance is worse or not. If this is a
>>>>>> significant reason to switch, I agree we should test out the performance.
>>>>>>
>>>>>> Regarding the disk size, I think using UUIDs is cumulative. Larger
>>>>>> PKs mean bigger index sizes, bigger FKs, etc. I agree that it’s probably
>>>>>> not a major concern but I wouldn’t say it’s trivial.
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Responses inline.
>>>>>>>
>>>>>>> On 05/23/2018 02:26 PM, David Davis wrote:
>>>>>>> > Before the release of Pulp 3.0 GA, I think it’s worth just
>>>>>>> checking in to
>>>>>>> > make sure we want to use UUIDs over integer based IDs. Changing
>>>>>>> from UUIDs
>>>>>>> > to ints would be a very easy change at this point  (1-2 lines of
>>>>>>> code) but
>>>>>>> > after GA ships, it would be hard if not impossible to switch.
>>>>>>> >
>>>>>>> > I think there are a number of reasons why we might want to
>>>>>>> consider integer
>>>>>>> > IDs:
>>>>>>> >
>>>>>>> > - Better performance all around for inserts[0], searches,
>>>>>>> indexing, etc
>>>>>>>
>>>>>>> I don't really care either way, but it's worth pointing out that
>>>>>>> UUIDs are
>>>>>>> integers (in the sense that the entire internet can be reduced to a
>>>>>>> single
>>>>>>> integer since it's all just bits). To the best of my knowledge they
>>>>>>> are equally
>>>>>>> performant to integers and stored in similar ways in Postgres.
>>>>>>>
>>>>>>> You linked a MySQL experiment, done using a version of MySQL that is
>>>>>>> nearly 10
>>>>>>> years old. If there are concerns about the performance of UUID PKs
>>>>>>> vs. int PKs
>>>>>>> in Pulp, we should compare apples to apples and profile Pulp using
>>>>>>> UUID PKs,
>>>>>>> profile Pulp using integer PKs, and then compare the two.
>>>>>>>
>>>>>>> In my small-scale testing (100,000 randomly generated content rows
>>>>>>> of a
>>>>>>> proto-RPM content model, 1000 repositories randomly related to each,
>>>>>>> no db funny
>>>>>>> business beyond enforced uniqueness constraints), there was either no
>>>>>>> difference, or what difference there was fell into the margin of
>>>>>>> error.
>>>>>>>
>>>>>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs)
>>>>>>>
>>>>>>> Well, okay...UUIDs are *huge* integers. But it's the length of an
>>>>>>> IPv6 address
>>>>>>> vs. the length of an IPv4 address. While it's true that 4 < 16, both
>>>>>>> are still
>>>>>>> pretty small. Trivially so, I think.
>>>>>>>
>>>>>>> Without taking relations into account, a table with a million rows
>>>>>>> should be a
>>>>>>> little less than twelve mega(mebi)bytes larger. Even at scale, the
>>>>>>> size
>>>>>>> difference is negligible, especially when compared to the size on
>>>>>>> disk of the
>>>>>>> actual content you'd need to be storing that those million rows
>>>>>>> represent.
>>>>>>>
>>>>>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
>>>>>>> > - In line with other apps like Katello
>>>>>>>
>>>>>>> I think these two are definitely worth considering, though.
>>>>>>>
>>>>>>> > There are some downsides to consider though:
>>>>>>> >
>>>>>>> > - Integer ids expose info like how many records there are
>>>>>>>
>>>>>>> This was the main intent, if I recall correctly. UUID PKs are not:
>>>>>>> - monotonically increasing
>>>>>>> - variably sized (string length, not bit length)
>>>>>>>
>>>>>>> So an objects PK doesn't give you any indication of how many other
>>>>>>> objects may
>>>>>>> be in the same collection, and while the Hrefs are long, for any
>>>>>>> given resource
>>>>>>> they will always be a predictable size.
>>>>>>>
>>>>>>> The major downside is really that they're a pain in the butt to type
>>>>>>> out when
>>>>>>> compared to int PKs, so if users are in a situation where they do
>>>>>>> have to type
>>>>>>> these things out, I think something has gone wrong.
>>>>>>>
>>>>>>> If users typing in PKs can't be avoided, UUIDs probably should be
>>>>>>> avoided. I
>>>>>>> recognize that this is effectively a restatement of "Hrefs would be
>>>>>>> shorter" in
>>>>>>> the context of how that impacts the user.
>>>>>>>
>>>>>>> > - Can’t support sharding or multiple dbs (are we ever going to
>>>>>>> need this?)
>>>>>>>
>>>>>>> A very good question. To the best of my recollection this was never
>>>>>>> stated as a
>>>>>>> hard requirement; it was only ever mentioned like it is here, as a
>>>>>>> potential
>>>>>>> positive side-effect of UUID keys. If collision-avoidance is not
>>>>>>> desired, and
>>>>>>> will certainly never be desired, then a normal integer field would
>>>>>>> likely be a
>>>>>>> less astonishing[0] user experience, and therefore a better user
>>>>>>> experience.
>>>>>>>
>>>>>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pulp-dev mailing list
>>>>>>> [email protected]
>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> [email protected]
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> [email protected]
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>>
>>>
>> _______________________________________________
>> Pulp-dev mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Integer IDs in Pulp 3

Reply via email to