+1 to switching back to UUIDs for the reasons Brian gave.
On 3/1/19 2:23 PM, Brian Bouterse wrote:
I've finally gotten to read through the numbers and this thread. It is
a tradeoff but I am +1 for switching to UUIDs. I focus on the
PostgreSQL UUID vs int case because that is our default database. I
don't think too much about how things perform on MariaDB because they
can improve their own performance to catch up to PostgreSQL which
regularly is performing better afaict. I agree with the assessment of
30% ish slowdown in the large unit cases for PostgreSQL. Still, I
believe the advantages of switching to UUIDs are worth it. Two main
reasons stick out in my mind.
1. Our core code and all plugin code will always be compatible with
common db backends even when using bulk_create()
2. We get database sharding with postgresql which you can only do with
UUID pks. I was advised this years ago by jcline.
Performance and compatibility are a pretty classic trade-off. Overall
I've found that initial releases launch with less performance and
improve (often significantly) overtime. Consider the interpreter pypy
(not pypi). It started "roughly 2000x slower [at initial launch] than
CPython, to roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is
30% slower in the worst-case but runs everywhere with zero
"db-behavior surprises" I think is worth it. Also conversely, if we
don't adopt UUIDs, how will we address item 1 pre RC?
@dawalker for the "can we have both" option, we probably can have some
db-specific codepaths, but I don't think doing an application wide PK
type change as a setting is feasible to support. The db specific
codepaths are one way performance improves over time. For the initial
release, to keep things simple I hope we don't have conditional
database codepaths (for now).
More discussion on this change is encouraged. Thanks @dalley so much
for all the detailed investigation!
[0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
Thank you,
Brian
On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <dawal...@redhat.com
<mailto:dawal...@redhat.com>> wrote:
As I brought up on irc, I don't know how feasible the
complications to maintenance would be going forward, but I would
prefer if we could use some sort of settings in order to choose
uuid or id based on MariaDB or PostgreSQL. I want us to work
everywhere, but I'm really concerned at the impact to our users of
a 30-40% efficiency drop in speed and storage.
David wrote up a quick Proof of Concept after I brought this up
but wasn't necessarily advocating it himself. I think Daniel and
Dennis expressed some concerns. I'd like to see more people
discussing it here with reasoning/examples on how doable something
like this could be?
If it's not on the table, I understand, but want to make sure
we've considered all reasonable options, and that might not be a
simple binary of either/or.
Thanks,
--Dana
Dana Walker
Associate Software Engineer
Red Hat
<https://www.redhat.com>
<https://red.ht/sig>
On Fri, Mar 1, 2019 at 9:15 AM David Davis <davidda...@redhat.com
<mailto:davidda...@redhat.com>> wrote:
I just want to bump this thread. If we hope to make the Pulp 3
RC date, we need feedback today.
David
On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri
<mpusa...@redhat.com <mailto:mpusa...@redhat.com>> wrote:
Not sure if https://www.webyog.com/ Monyog will give a
free opensource project license. But that might help
diagnose the MariaDB performance. Monyog is really nice,
I wish it supported Postgres.
Matt P.
On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley
<dal...@redhat.com <mailto:dal...@redhat.com>> wrote:
Hello all,
We've had an ongoing discussion about whether Pulp
would be able to perform acceptably if we switched
back to UUID primary keys. I've finished doing the
performance testing and I *think* the answer is yes.
Although to be honest, I'm not sure that I understand
why, in the case of MariaDB.
I linked my testing methodology and results here:
https://pulp.plan.io/issues/4290#note-18
To summarize, I tested the following:
* How long it takes to perform subsequent large (lazy)
syncs, with lots of content in the database (100-400k
content units)
* How long it takes to perform various small but
important database queries
The results were weirdly in contrast in some cases.
The first four syncs (202,000 content total) behaved
mostly the same on PostgreSQL whether it used an
autoincrement or UUID primary key. Subsequent syncs
had a performance drop of between 30-40%. Likewise,
the code snippets performed 30+% worse. Sync time
scaled linearly"ish" with the amont of content in the
repository in both cases, which was a bit surprising
to me. The size of the database at the end was 30-40%
larger with UUID primary keys, 736 MB vs 521 MB. The
gap would be smaller in typical usage when you
consider that most content types have more metadata
than FileContent (what I was testing).
Autoincrement PostgreSQL (left) vs. UUID PostgreSQL
(right) in diff form
https://www.diffchecker.com/40AF8vvM
With MariaDB the first sync was almost 80% slower than
the first sync w/ PostgreSQL, but every subsequent
sync was as fast or faster, despite the tests of
specific queries performing multiple times worse.
Additionally the sync performance did not decrease as
rapidly as it did under PostgreSQL. With MariaDB, one
of my test queries that worked fine when backed by
PostgreSQL ended up hanging endlessly and I had to cut
it off after 25 or so minutes. [0] I would consider
that a blocker to claiming we support MariaDB / MySQL.
But overall I'm not sure how to interpret the fact
that on one hand the real-usage performance is equal
or better better, and on the performance of some of
the underlying queries is noticably worse. Maybe
there's some weird caching going on in the backend, or
the generated indexes are different?
UUID PostgreSQL (left) vs. UUID MariaDB (right) in
diff form
https://www.diffchecker.com/W1nnIQgj
I'd like to invite some discussion on this, but
nothing I've mentioned seems like it would be a
problem for going forwards with using UUID primary
keys in a general sense. If we're all in agreement
about that engineering decision then we can move
forwards with that work.
[0] for *some* but not all repository versions. No
idea what's up there.
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev