Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

David Davis Thu, 07 Mar 2019 07:31:26 -0800

The changes to switch to UUIDs have been merged. I opened issues against
all the Pulp 3 plugins I could think of to update their docs. There may be
some other changes needed too though.


David


On Wed, Mar 6, 2019 at 9:18 AM David Davis <[email protected]> wrote:

> Since there seems to be no objections to switching to UUIDs, I’d like to
> propose we merge the PRs[0][1] that will switch core to use UUID PKs
> tomorrow (in 24 hours). After that, we'll open redmine issues to update
> plugins to use UUIDs.
>
> [0] https://github.com/pulp/pulpcore/pull/16
> [1] https://github.com/pulp/pulpcore-plugin/pull/69
>
> David
>
>
> On Tue, Mar 5, 2019 at 5:15 PM Jeff Ortel <[email protected]> wrote:
>
>> +1 to switching back to UUIDs for the reasons Brian gave.
>>
>> On 3/1/19 2:23 PM, Brian Bouterse wrote:
>>
>> I've finally gotten to read through the numbers and this thread. It is a
>> tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
>> vs int case because that is our default database. I don't think too much
>> about how things perform on MariaDB because they can improve their own
>> performance to catch up to PostgreSQL which regularly is performing better
>> afaict. I agree with the assessment of 30% ish slowdown in the large unit
>> cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
>> are worth it. Two main reasons stick out in my mind.
>>
>> 1. Our core code and all plugin code will always be compatible with
>> common db backends even when using bulk_create()
>> 2. We get database sharding with postgresql which you can only do with
>> UUID pks. I was advised this years ago by jcline.
>>
>> Performance and compatibility are a pretty classic trade-off. Overall
>> I've found that initial releases launch with less performance and improve
>> (often significantly) overtime. Consider the interpreter pypy (not pypi).
>> It started "roughly 2000x slower [at initial launch] than CPython, to
>> roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
>> worst-case but runs everywhere with zero "db-behavior surprises" I think is
>> worth it. Also conversely, if we don't adopt UUIDs, how will we address
>> item 1 pre RC?
>>
>> @dawalker for the "can we have both" option, we probably can have some
>> db-specific codepaths, but I don't think doing an application wide PK type
>> change as a setting is feasible to support. The db specific codepaths are
>> one way performance improves over time. For the initial release, to keep
>> things simple I hope we don't have conditional database codepaths (for now).
>>
>> More discussion on this change is encouraged. Thanks @dalley so much for
>> all the detailed investigation!
>>
>> [0]:
>> https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
>>
>> Thank you,
>> Brian
>>
>> On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <[email protected]> wrote:
>>
>>> As I brought up on irc, I don't know how feasible the complications to
>>> maintenance would be going forward, but I would prefer if we could use some
>>> sort of settings in order to choose uuid or id based on MariaDB or
>>> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
>>> impact to our users of a 30-40% efficiency drop in speed and storage.
>>>
>>> David wrote up a quick Proof of Concept after I brought this up but
>>> wasn't necessarily advocating it himself.  I think Daniel and Dennis
>>> expressed some concerns.  I'd like to see more people discussing it here
>>> with reasoning/examples on how doable something like this could be?
>>>
>>> If it's not on the table, I understand, but want to make sure we've
>>> considered all reasonable options, and that might not be a simple binary of
>>> either/or.
>>>
>>> Thanks,
>>>
>>> --Dana
>>>
>>> Dana Walker
>>>
>>> Associate Software Engineer
>>>
>>> Red Hat
>>>
>>> <https://www.redhat.com>
>>> <https://red.ht/sig>
>>>
>>>
>>> On Fri, Mar 1, 2019 at 9:15 AM David Davis <[email protected]>
>>> wrote:
>>>
>>>> I just want to bump this thread. If we hope to make the Pulp 3 RC date,
>>>> we need feedback today.
>>>>
>>>> David
>>>>
>>>>
>>>> On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri <[email protected]>
>>>> wrote:
>>>>
>>>>> Not sure if https://www.webyog.com/ Monyog will give a free
>>>>> opensource project license.  But that might help diagnose the MariaDB
>>>>> performance.  Monyog is really nice, I wish it supported Postgres.
>>>>>
>>>>> Matt P.
>>>>>
>>>>> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> We've had an ongoing discussion about whether Pulp would be able to
>>>>>> perform acceptably if we switched back to UUID primary keys.  I've 
>>>>>> finished
>>>>>> doing the performance testing and I *think* the answer is yes.  Although 
>>>>>> to
>>>>>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>>>>>
>>>>>> I linked my testing methodology and results here:
>>>>>> https://pulp.plan.io/issues/4290#note-18
>>>>>>
>>>>>> To summarize, I tested the following:
>>>>>>
>>>>>> * How long it takes to perform subsequent large (lazy) syncs, with
>>>>>> lots of content in the database (100-400k content units)
>>>>>> * How long it takes to perform various small but important database
>>>>>> queries
>>>>>>
>>>>>> The results were weirdly in contrast in some cases.
>>>>>>
>>>>>> The first four syncs (202,000 content total) behaved mostly the same
>>>>>> on PostgreSQL whether it used an autoincrement or UUID primary key.
>>>>>> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
>>>>>> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
>>>>>> the amont of content in the repository in both cases, which was a bit
>>>>>> surprising to me.  The size of the database at the end was 30-40% larger
>>>>>> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
>>>>>> typical usage when you consider that most content types have more 
>>>>>> metadata
>>>>>> than FileContent (what I was testing).
>>>>>>
>>>>>> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff
>>>>>> form
>>>>>> https://www.diffchecker.com/40AF8vvM
>>>>>>
>>>>>> With MariaDB the first sync was almost 80% slower than the first sync
>>>>>> w/ PostgreSQL, but every subsequent sync was as fast or faster, despite 
>>>>>> the
>>>>>> tests of specific queries performing multiple times worse.  Additionally
>>>>>> the sync performance did not decrease as rapidly as it did under
>>>>>> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
>>>>>> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
>>>>>> after 25 or so minutes. [0]  I would consider that a blocker to claiming 
>>>>>> we
>>>>>> support MariaDB / MySQL.
>>>>>>
>>>>>> But overall I'm not sure how to interpret the fact that on one hand
>>>>>> the real-usage performance is equal or better better, and on the
>>>>>> performance of some of the underlying queries is noticably worse.  Maybe
>>>>>> there's some weird caching going on in the backend, or the generated
>>>>>> indexes are different?
>>>>>>
>>>>>> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
>>>>>> https://www.diffchecker.com/W1nnIQgj
>>>>>>
>>>>>> I'd like to invite some discussion on this, but nothing I've
>>>>>> mentioned seems like it would be a problem for going forwards with using
>>>>>> UUID primary keys in a general sense.  If we're all in agreement about 
>>>>>> that
>>>>>> engineering decision then we can move forwards with that work.
>>>>>>
>>>>>> [0] for *some* but not all repository versions.  No idea what's up
>>>>>> there.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> [email protected]
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> [email protected]
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> [email protected]
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> [email protected]
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>
>> _______________________________________________
>> Pulp-dev mailing 
>> [email protected]https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Reply via email to