Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Jeff Ortel Tue, 05 Mar 2019 14:15:06 -0800

+1 to switching back to UUIDs for the reasons Brian gave.


On 3/1/19 2:23 PM, Brian Bouterse wrote:

I've finally gotten to read through the numbers and this thread. It isa tradeoff but I am +1 for switching to UUIDs. I focus on thePostgreSQL UUID vs int case because that is our default database. Idon't think too much about how things perform on MariaDB because theycan improve their own performance to catch up to PostgreSQL whichregularly is performing better afaict. I agree with the assessment of30% ish slowdown in the large unit cases for PostgreSQL. Still, Ibelieve the advantages of switching to UUIDs are worth it. Two mainreasons stick out in my mind.

1. Our core code and all plugin code will always be compatible withcommon db backends even when using bulk_create()2. We get database sharding with postgresql which you can only do withUUID pks. I was advised this years ago by jcline.

Performance and compatibility are a pretty classic trade-off. OverallI've found that initial releases launch with less performance andimprove (often significantly) overtime. Consider the interpreter pypy(not pypi). It started "roughly 2000x slower [at initial launch] thanCPython, to roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is30% slower in the worst-case but runs everywhere with zero"db-behavior surprises" I think is worth it. Also conversely, if wedon't adopt UUIDs, how will we address item 1 pre RC?

@dawalker for the "can we have both" option, we probably can have somedb-specific codepaths, but I don't think doing an application wide PKtype change as a setting is feasible to support. The db specificcodepaths are one way performance improves over time. For the initialrelease, to keep things simple I hope we don't have conditionaldatabase codepaths (for now).

More discussion on this change is encouraged. Thanks @dalley so muchfor all the detailed investigation!


[0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html

Thank you,
Brian

On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <[email protected]<mailto:[email protected]>> wrote:


    As I brought up on irc, I don't know how feasible the
    complications to maintenance would be going forward, but I would
    prefer if we could use some sort of settings in order to choose
    uuid or id based on MariaDB or PostgreSQL.  I want us to work
    everywhere, but I'm really concerned at the impact to our users of
    a 30-40% efficiency drop in speed and storage.

    David wrote up a quick Proof of Concept after I brought this up
    but wasn't necessarily advocating it himself.  I think Daniel and
    Dennis expressed some concerns.  I'd like to see more people
    discussing it here with reasoning/examples on how doable something
    like this could be?

    If it's not on the table, I understand, but want to make sure
    we've considered all reasonable options, and that might not be a
    simple binary of either/or.

    Thanks,

    --Dana

    Dana Walker

    Associate Software Engineer

    Red Hat

    <https://www.redhat.com>

    <https://red.ht/sig>



    On Fri, Mar 1, 2019 at 9:15 AM David Davis <[email protected]
    <mailto:[email protected]>> wrote:

        I just want to bump this thread. If we hope to make the Pulp 3
        RC date, we need feedback today.

        David


        On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri
        <[email protected] <mailto:[email protected]>> wrote:

            Not sure if https://www.webyog.com/ Monyog will give a
            free opensource project license.  But that might help
            diagnose the MariaDB performance.  Monyog is really nice,
            I wish it supported Postgres.

            Matt P.

            On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley
            <[email protected] <mailto:[email protected]>> wrote:

                Hello all,

                We've had an ongoing discussion about whether Pulp
                would be able to perform acceptably if we switched
                back to UUID primary keys.  I've finished doing the
                performance testing and I *think* the answer is yes. 
                Although to be honest, I'm not sure that I understand
                why, in the case of MariaDB.

                I linked my testing methodology and results here:
                https://pulp.plan.io/issues/4290#note-18

                To summarize, I tested the following:

                * How long it takes to perform subsequent large (lazy)
                syncs, with lots of content in the database (100-400k
                content units)
                * How long it takes to perform various small but
                important database queries

                The results were weirdly in contrast in some cases.

                The first four syncs (202,000 content total) behaved
                mostly the same on PostgreSQL whether it used an
                autoincrement or UUID primary key. Subsequent syncs
                had a performance drop of between 30-40%.  Likewise,
                the code snippets performed 30+% worse.  Sync time
                scaled linearly"ish" with the amont of content in the
                repository in both cases, which was a bit surprising
                to me.  The size of the database at the end was 30-40%
                larger with UUID primary keys, 736 MB vs 521 MB.  The
                gap would be smaller in typical usage when you
                consider that most content types have more metadata
                than FileContent (what I was testing).

                Autoincrement PostgreSQL (left) vs. UUID PostgreSQL
                (right) in diff form
                https://www.diffchecker.com/40AF8vvM

                With MariaDB the first sync was almost 80% slower than
                the first sync w/ PostgreSQL, but every subsequent
                sync was as fast or faster, despite the tests of
                specific queries performing multiple times worse. 
                Additionally the sync performance did not decrease as
                rapidly as it did under PostgreSQL.  With MariaDB, one
                of my test queries that worked fine when backed by
                PostgreSQL ended up hanging endlessly and I had to cut
                it off after 25 or so minutes. [0] I would consider
                that a blocker to claiming we support MariaDB / MySQL.

                But overall I'm not sure how to interpret the fact
                that on one hand the real-usage performance is equal
                or better better, and on the performance of some of
                the underlying queries is noticably worse.  Maybe
                there's some weird caching going on in the backend, or
                the generated indexes are different?

                UUID PostgreSQL (left) vs. UUID MariaDB (right) in
                diff form
                https://www.diffchecker.com/W1nnIQgj

                I'd like to invite some discussion on this, but
                nothing I've mentioned seems like it would be a
                problem for going forwards with using UUID primary
                keys in a general sense.  If we're all in agreement
                about that engineering decision then we can move
                forwards with that work.

                [0] for *some* but not all repository versions.  No
                idea what's up there.







                _______________________________________________
                Pulp-dev mailing list
                [email protected] <mailto:[email protected]>
                https://www.redhat.com/mailman/listinfo/pulp-dev

            _______________________________________________
            Pulp-dev mailing list
            [email protected] <mailto:[email protected]>
            https://www.redhat.com/mailman/listinfo/pulp-dev

        _______________________________________________
        Pulp-dev mailing list
        [email protected] <mailto:[email protected]>
        https://www.redhat.com/mailman/listinfo/pulp-dev

    _______________________________________________
    Pulp-dev mailing list
    [email protected] <mailto:[email protected]>
    https://www.redhat.com/mailman/listinfo/pulp-dev


_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Reply via email to