+1 to switching back to UUIDs for the reasons Brian gave.

On 3/1/19 2:23 PM, Brian Bouterse wrote:
I've finally gotten to read through the numbers and this thread. It is a tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID vs int case because that is our default database. I don't think too much about how things perform on MariaDB because they can improve their own performance to catch up to PostgreSQL which regularly is performing better afaict. I agree with the assessment of 30% ish slowdown in the large unit cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs are worth it. Two main reasons stick out in my mind.

1. Our core code and all plugin code will always be compatible with common db backends even when using bulk_create() 2. We get database sharding with postgresql which you can only do with UUID pks. I was advised this years ago by jcline.

Performance and compatibility are a pretty classic trade-off. Overall I've found that initial releases launch with less performance and improve (often significantly) overtime. Consider the interpreter pypy (not pypi).  It started "roughly 2000x slower [at initial launch] than CPython, to roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the worst-case but runs everywhere with zero "db-behavior surprises" I think is worth it. Also conversely, if we don't adopt UUIDs, how will we address item 1 pre RC?

@dawalker for the "can we have both" option, we probably can have some db-specific codepaths, but I don't think doing an application wide PK type change as a setting is feasible to support. The db specific codepaths are one way performance improves over time. For the initial release, to keep things simple I hope we don't have conditional database codepaths (for now).

More discussion on this change is encouraged. Thanks @dalley so much for all the detailed investigation!

[0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html

Thank you,
Brian

On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <dawal...@redhat.com <mailto:dawal...@redhat.com>> wrote:

    As I brought up on irc, I don't know how feasible the
    complications to maintenance would be going forward, but I would
    prefer if we could use some sort of settings in order to choose
    uuid or id based on MariaDB or PostgreSQL.  I want us to work
    everywhere, but I'm really concerned at the impact to our users of
    a 30-40% efficiency drop in speed and storage.

    David wrote up a quick Proof of Concept after I brought this up
    but wasn't necessarily advocating it himself.  I think Daniel and
    Dennis expressed some concerns.  I'd like to see more people
    discussing it here with reasoning/examples on how doable something
    like this could be?

    If it's not on the table, I understand, but want to make sure
    we've considered all reasonable options, and that might not be a
    simple binary of either/or.

    Thanks,

    --Dana

    Dana Walker

    Associate Software Engineer

    Red Hat

    <https://www.redhat.com>

    <https://red.ht/sig>



    On Fri, Mar 1, 2019 at 9:15 AM David Davis <davidda...@redhat.com
    <mailto:davidda...@redhat.com>> wrote:

        I just want to bump this thread. If we hope to make the Pulp 3
        RC date, we need feedback today.

        David


        On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri
        <mpusa...@redhat.com <mailto:mpusa...@redhat.com>> wrote:

            Not sure if https://www.webyog.com/ Monyog will give a
            free opensource project license.  But that might help
            diagnose the MariaDB performance.  Monyog is really nice,
            I wish it supported Postgres.

            Matt P.

            On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley
            <dal...@redhat.com <mailto:dal...@redhat.com>> wrote:

                Hello all,

                We've had an ongoing discussion about whether Pulp
                would be able to perform acceptably if we switched
                back to UUID primary keys.  I've finished doing the
                performance testing and I *think* the answer is yes. 
                Although to be honest, I'm not sure that I understand
                why, in the case of MariaDB.

                I linked my testing methodology and results here:
                https://pulp.plan.io/issues/4290#note-18

                To summarize, I tested the following:

                * How long it takes to perform subsequent large (lazy)
                syncs, with lots of content in the database (100-400k
                content units)
                * How long it takes to perform various small but
                important database queries

                The results were weirdly in contrast in some cases.

                The first four syncs (202,000 content total) behaved
                mostly the same on PostgreSQL whether it used an
                autoincrement or UUID primary key. Subsequent syncs
                had a performance drop of between 30-40%.  Likewise,
                the code snippets performed 30+% worse.  Sync time
                scaled linearly"ish" with the amont of content in the
                repository in both cases, which was a bit surprising
                to me.  The size of the database at the end was 30-40%
                larger with UUID primary keys, 736 MB vs 521 MB.  The
                gap would be smaller in typical usage when you
                consider that most content types have more metadata
                than FileContent (what I was testing).

                Autoincrement PostgreSQL (left) vs. UUID PostgreSQL
                (right) in diff form
                https://www.diffchecker.com/40AF8vvM

                With MariaDB the first sync was almost 80% slower than
                the first sync w/ PostgreSQL, but every subsequent
                sync was as fast or faster, despite the tests of
                specific queries performing multiple times worse. 
                Additionally the sync performance did not decrease as
                rapidly as it did under PostgreSQL.  With MariaDB, one
                of my test queries that worked fine when backed by
                PostgreSQL ended up hanging endlessly and I had to cut
                it off after 25 or so minutes. [0] I would consider
                that a blocker to claiming we support MariaDB / MySQL.

                But overall I'm not sure how to interpret the fact
                that on one hand the real-usage performance is equal
                or better better, and on the performance of some of
                the underlying queries is noticably worse.  Maybe
                there's some weird caching going on in the backend, or
                the generated indexes are different?

                UUID PostgreSQL (left) vs. UUID MariaDB (right) in
                diff form
                https://www.diffchecker.com/W1nnIQgj

                I'd like to invite some discussion on this, but
                nothing I've mentioned seems like it would be a
                problem for going forwards with using UUID primary
                keys in a general sense.  If we're all in agreement
                about that engineering decision then we can move
                forwards with that work.

                [0] for *some* but not all repository versions.  No
                idea what's up there.







                _______________________________________________
                Pulp-dev mailing list
                Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
                https://www.redhat.com/mailman/listinfo/pulp-dev

            _______________________________________________
            Pulp-dev mailing list
            Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
            https://www.redhat.com/mailman/listinfo/pulp-dev

        _______________________________________________
        Pulp-dev mailing list
        Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
        https://www.redhat.com/mailman/listinfo/pulp-dev

    _______________________________________________
    Pulp-dev mailing list
    Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
    https://www.redhat.com/mailman/listinfo/pulp-dev


_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev

_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev

Reply via email to