Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-07 Thread David Davis
The changes to switch to UUIDs have been merged. I opened issues against
all the Pulp 3 plugins I could think of to update their docs. There may be
some other changes needed too though.

David


On Wed, Mar 6, 2019 at 9:18 AM David Davis  wrote:

> Since there seems to be no objections to switching to UUIDs, I’d like to
> propose we merge the PRs[0][1] that will switch core to use UUID PKs
> tomorrow (in 24 hours). After that, we'll open redmine issues to update
> plugins to use UUIDs.
>
> [0] https://github.com/pulp/pulpcore/pull/16
> [1] https://github.com/pulp/pulpcore-plugin/pull/69
>
> David
>
>
> On Tue, Mar 5, 2019 at 5:15 PM Jeff Ortel  wrote:
>
>> +1 to switching back to UUIDs for the reasons Brian gave.
>>
>> On 3/1/19 2:23 PM, Brian Bouterse wrote:
>>
>> I've finally gotten to read through the numbers and this thread. It is a
>> tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
>> vs int case because that is our default database. I don't think too much
>> about how things perform on MariaDB because they can improve their own
>> performance to catch up to PostgreSQL which regularly is performing better
>> afaict. I agree with the assessment of 30% ish slowdown in the large unit
>> cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
>> are worth it. Two main reasons stick out in my mind.
>>
>> 1. Our core code and all plugin code will always be compatible with
>> common db backends even when using bulk_create()
>> 2. We get database sharding with postgresql which you can only do with
>> UUID pks. I was advised this years ago by jcline.
>>
>> Performance and compatibility are a pretty classic trade-off. Overall
>> I've found that initial releases launch with less performance and improve
>> (often significantly) overtime. Consider the interpreter pypy (not pypi).
>> It started "roughly 2000x slower [at initial launch] than CPython, to
>> roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
>> worst-case but runs everywhere with zero "db-behavior surprises" I think is
>> worth it. Also conversely, if we don't adopt UUIDs, how will we address
>> item 1 pre RC?
>>
>> @dawalker for the "can we have both" option, we probably can have some
>> db-specific codepaths, but I don't think doing an application wide PK type
>> change as a setting is feasible to support. The db specific codepaths are
>> one way performance improves over time. For the initial release, to keep
>> things simple I hope we don't have conditional database codepaths (for now).
>>
>> More discussion on this change is encouraged. Thanks @dalley so much for
>> all the detailed investigation!
>>
>> [0]:
>> https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
>>
>> Thank you,
>> Brian
>>
>> On Fri, Mar 1, 2019 at 2:51 PM Dana Walker  wrote:
>>
>>> As I brought up on irc, I don't know how feasible the complications to
>>> maintenance would be going forward, but I would prefer if we could use some
>>> sort of settings in order to choose uuid or id based on MariaDB or
>>> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
>>> impact to our users of a 30-40% efficiency drop in speed and storage.
>>>
>>> David wrote up a quick Proof of Concept after I brought this up but
>>> wasn't necessarily advocating it himself.  I think Daniel and Dennis
>>> expressed some concerns.  I'd like to see more people discussing it here
>>> with reasoning/examples on how doable something like this could be?
>>>
>>> If it's not on the table, I understand, but want to make sure we've
>>> considered all reasonable options, and that might not be a simple binary of
>>> either/or.
>>>
>>> Thanks,
>>>
>>> --Dana
>>>
>>> Dana Walker
>>>
>>> Associate Software Engineer
>>>
>>> Red Hat
>>>
>>> 
>>> 
>>>
>>>
>>> On Fri, Mar 1, 2019 at 9:15 AM David Davis 
>>> wrote:
>>>
 I just want to bump this thread. If we hope to make the Pulp 3 RC date,
 we need feedback today.

 David


 On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri 
 wrote:

> Not sure if https://www.webyog.com/ Monyog will give a free
> opensource project license.  But that might help diagnose the MariaDB
> performance.  Monyog is really nice, I wish it supported Postgres.
>
> Matt P.
>
> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley 
> wrote:
>
>> Hello all,
>>
>> We've had an ongoing discussion about whether Pulp would be able to
>> perform acceptably if we switched back to UUID primary keys.  I've 
>> finished
>> doing the performance testing and I *think* the answer is yes.  Although 
>> to
>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>
>> I linked my testing methodology and results here:
>> https://pulp.plan.io/issues/4290#note-18
>>
>> To summarize, I tested the following:
>>
>> * How long it takes 

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-06 Thread David Davis
Since there seems to be no objections to switching to UUIDs, I’d like to
propose we merge the PRs[0][1] that will switch core to use UUID PKs
tomorrow (in 24 hours). After that, we'll open redmine issues to update
plugins to use UUIDs.

[0] https://github.com/pulp/pulpcore/pull/16
[1] https://github.com/pulp/pulpcore-plugin/pull/69

David


On Tue, Mar 5, 2019 at 5:15 PM Jeff Ortel  wrote:

> +1 to switching back to UUIDs for the reasons Brian gave.
>
> On 3/1/19 2:23 PM, Brian Bouterse wrote:
>
> I've finally gotten to read through the numbers and this thread. It is a
> tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
> vs int case because that is our default database. I don't think too much
> about how things perform on MariaDB because they can improve their own
> performance to catch up to PostgreSQL which regularly is performing better
> afaict. I agree with the assessment of 30% ish slowdown in the large unit
> cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
> are worth it. Two main reasons stick out in my mind.
>
> 1. Our core code and all plugin code will always be compatible with common
> db backends even when using bulk_create()
> 2. We get database sharding with postgresql which you can only do with
> UUID pks. I was advised this years ago by jcline.
>
> Performance and compatibility are a pretty classic trade-off. Overall I've
> found that initial releases launch with less performance and improve (often
> significantly) overtime. Consider the interpreter pypy (not pypi).  It
> started "roughly 2000x slower [at initial launch] than CPython, to roughly
> 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
> worst-case but runs everywhere with zero "db-behavior surprises" I think is
> worth it. Also conversely, if we don't adopt UUIDs, how will we address
> item 1 pre RC?
>
> @dawalker for the "can we have both" option, we probably can have some
> db-specific codepaths, but I don't think doing an application wide PK type
> change as a setting is feasible to support. The db specific codepaths are
> one way performance improves over time. For the initial release, to keep
> things simple I hope we don't have conditional database codepaths (for now).
>
> More discussion on this change is encouraged. Thanks @dalley so much for
> all the detailed investigation!
>
> [0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
>
> Thank you,
> Brian
>
> On Fri, Mar 1, 2019 at 2:51 PM Dana Walker  wrote:
>
>> As I brought up on irc, I don't know how feasible the complications to
>> maintenance would be going forward, but I would prefer if we could use some
>> sort of settings in order to choose uuid or id based on MariaDB or
>> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
>> impact to our users of a 30-40% efficiency drop in speed and storage.
>>
>> David wrote up a quick Proof of Concept after I brought this up but
>> wasn't necessarily advocating it himself.  I think Daniel and Dennis
>> expressed some concerns.  I'd like to see more people discussing it here
>> with reasoning/examples on how doable something like this could be?
>>
>> If it's not on the table, I understand, but want to make sure we've
>> considered all reasonable options, and that might not be a simple binary of
>> either/or.
>>
>> Thanks,
>>
>> --Dana
>>
>> Dana Walker
>>
>> Associate Software Engineer
>>
>> Red Hat
>>
>> 
>> 
>>
>>
>> On Fri, Mar 1, 2019 at 9:15 AM David Davis  wrote:
>>
>>> I just want to bump this thread. If we hope to make the Pulp 3 RC date,
>>> we need feedback today.
>>>
>>> David
>>>
>>>
>>> On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri 
>>> wrote:
>>>
 Not sure if https://www.webyog.com/ Monyog will give a free opensource
 project license.  But that might help diagnose the MariaDB performance.
 Monyog is really nice, I wish it supported Postgres.

 Matt P.

 On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley  wrote:

> Hello all,
>
> We've had an ongoing discussion about whether Pulp would be able to
> perform acceptably if we switched back to UUID primary keys.  I've 
> finished
> doing the performance testing and I *think* the answer is yes.  Although 
> to
> be honest, I'm not sure that I understand why, in the case of MariaDB.
>
> I linked my testing methodology and results here:
> https://pulp.plan.io/issues/4290#note-18
>
> To summarize, I tested the following:
>
> * How long it takes to perform subsequent large (lazy) syncs, with
> lots of content in the database (100-400k content units)
> * How long it takes to perform various small but important database
> queries
>
> The results were weirdly in contrast in some cases.
>
> The first four syncs (202,000 content total) behaved mostly the same
> on PostgreSQL whether it used an 

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-05 Thread Jeff Ortel

+1 to switching back to UUIDs for the reasons Brian gave.

On 3/1/19 2:23 PM, Brian Bouterse wrote:
I've finally gotten to read through the numbers and this thread. It is 
a tradeoff but I am +1 for switching to UUIDs. I focus on the 
PostgreSQL UUID vs int case because that is our default database. I 
don't think too much about how things perform on MariaDB because they 
can improve their own performance to catch up to PostgreSQL which 
regularly is performing better afaict. I agree with the assessment of 
30% ish slowdown in the large unit cases for PostgreSQL. Still, I 
believe the advantages of switching to UUIDs are worth it. Two main 
reasons stick out in my mind.


1. Our core code and all plugin code will always be compatible with 
common db backends even when using bulk_create()
2. We get database sharding with postgresql which you can only do with 
UUID pks. I was advised this years ago by jcline.


Performance and compatibility are a pretty classic trade-off. Overall 
I've found that initial releases launch with less performance and 
improve (often significantly) overtime. Consider the interpreter pypy 
(not pypi).  It started "roughly 2000x slower [at initial launch] than 
CPython, to roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is 
30% slower in the worst-case but runs everywhere with zero 
"db-behavior surprises" I think is worth it. Also conversely, if we 
don't adopt UUIDs, how will we address item 1 pre RC?


@dawalker for the "can we have both" option, we probably can have some 
db-specific codepaths, but I don't think doing an application wide PK 
type change as a setting is feasible to support. The db specific 
codepaths are one way performance improves over time. For the initial 
release, to keep things simple I hope we don't have conditional 
database codepaths (for now).


More discussion on this change is encouraged. Thanks @dalley so much 
for all the detailed investigation!


[0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html

Thank you,
Brian

On Fri, Mar 1, 2019 at 2:51 PM Dana Walker > wrote:


As I brought up on irc, I don't know how feasible the
complications to maintenance would be going forward, but I would
prefer if we could use some sort of settings in order to choose
uuid or id based on MariaDB or PostgreSQL.  I want us to work
everywhere, but I'm really concerned at the impact to our users of
a 30-40% efficiency drop in speed and storage.

David wrote up a quick Proof of Concept after I brought this up
but wasn't necessarily advocating it himself.  I think Daniel and
Dennis expressed some concerns.  I'd like to see more people
discussing it here with reasoning/examples on how doable something
like this could be?

If it's not on the table, I understand, but want to make sure
we've considered all reasonable options, and that might not be a
simple binary of either/or.

Thanks,

--Dana

Dana Walker

Associate Software Engineer

Red Hat







On Fri, Mar 1, 2019 at 9:15 AM David Davis mailto:davidda...@redhat.com>> wrote:

I just want to bump this thread. If we hope to make the Pulp 3
RC date, we need feedback today.

David


On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri
mailto:mpusa...@redhat.com>> wrote:

Not sure if https://www.webyog.com/ Monyog will give a
free opensource project license.  But that might help
diagnose the MariaDB performance.  Monyog is really nice,
I wish it supported Postgres.

Matt P.

On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley
mailto:dal...@redhat.com>> wrote:

Hello all,

We've had an ongoing discussion about whether Pulp
would be able to perform acceptably if we switched
back to UUID primary keys.  I've finished doing the
performance testing and I *think* the answer is yes. 
Although to be honest, I'm not sure that I understand
why, in the case of MariaDB.

I linked my testing methodology and results here:
https://pulp.plan.io/issues/4290#note-18

To summarize, I tested the following:

* How long it takes to perform subsequent large (lazy)
syncs, with lots of content in the database (100-400k
content units)
* How long it takes to perform various small but
important database queries

The results were weirdly in contrast in some cases.

The first four syncs (202,000 content total) behaved
mostly the same on PostgreSQL whether it used an
autoincrement or UUID primary key. Subsequent syncs
had a performance drop of 

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-01 Thread Daniel Alley
I'm less concerned with the difference between autoincrement vs. UUID
speed, and more concerned with how quickly performance was getting worse
with database size on PostgreSQL in both cases (and not on MariaDB
strangely).  There's probably a *lot* that can be done to improve
performance that we just haven't even looked into yet, and a few weeks of
effort on that front would make a much larger difference (probably) than
the type of PK.  Not that we should disregard it entirely.  The PK decision
has to be made soon though and the work I mentioned will have to wait a bit.

But I do think that If we *really* wanted to support MySQL/MariaDB while
retaining autoincrement PKs, the best option would be a small
MySQL-specific reimplementation of "bulk_create()" that would just call
.save() on all the objects in a loop.  It would probably be *much* slower
for MySQL, but it would be fairly simple (only a couple of lines), it would
work for both without compromising PostgreSQL performance, it would avoid
making the docs more confusing to users and it would be a lot less risky.

@Brian  Would sharding actually be valuable? Have any Pulp users approached
the sort of scale where it would be the right thing to do.  From what I've
heard, a single PostgreSQL installation is capable of handling 20 Terabytes
without tremendous issue. I can't imagine Pulp's database growing so large
that it would be more economical to manage a second database server than it
would be to add more storage to the server you have. I can be convinced
otherwise though.

I think a more compelling point 2 would be that in the multi-tenant use
case, UUIDs would make it vastly more difficult for one API user to gather
information on another user than autoincrement PKs.  Which, even though
we're not going to handle multi-tenant out of the gate, is a reasonable
thing to think about and possibly a good reason to go in that direction.



On Fri, Mar 1, 2019 at 3:24 PM Brian Bouterse  wrote:

> I've finally gotten to read through the numbers and this thread. It is a
> tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
> vs int case because that is our default database. I don't think too much
> about how things perform on MariaDB because they can improve their own
> performance to catch up to PostgreSQL which regularly is performing better
> afaict. I agree with the assessment of 30% ish slowdown in the large unit
> cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
> are worth it. Two main reasons stick out in my mind.
>
> 1. Our core code and all plugin code will always be compatible with common
> db backends even when using bulk_create()
> 2. We get database sharding with postgresql which you can only do with
> UUID pks. I was advised this years ago by jcline.
>
> Performance and compatibility are a pretty classic trade-off. Overall I've
> found that initial releases launch with less performance and improve (often
> significantly) overtime. Consider the interpreter pypy (not pypi).  It
> started "roughly 2000x slower [at initial launch] than CPython, to roughly
> 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
> worst-case but runs everywhere with zero "db-behavior surprises" I think is
> worth it. Also conversely, if we don't adopt UUIDs, how will we address
> item 1 pre RC?
>
> @dawalker for the "can we have both" option, we probably can have some
> db-specific codepaths, but I don't think doing an application wide PK type
> change as a setting is feasible to support. The db specific codepaths are
> one way performance improves over time. For the initial release, to keep
> things simple I hope we don't have conditional database codepaths (for now).
>
> More discussion on this change is encouraged. Thanks @dalley so much for
> all the detailed investigation!
>
> [0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
>
> Thank you,
> Brian
>
> On Fri, Mar 1, 2019 at 2:51 PM Dana Walker  wrote:
>
>> As I brought up on irc, I don't know how feasible the complications to
>> maintenance would be going forward, but I would prefer if we could use some
>> sort of settings in order to choose uuid or id based on MariaDB or
>> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
>> impact to our users of a 30-40% efficiency drop in speed and storage.
>>
>> David wrote up a quick Proof of Concept after I brought this up but
>> wasn't necessarily advocating it himself.  I think Daniel and Dennis
>> expressed some concerns.  I'd like to see more people discussing it here
>> with reasoning/examples on how doable something like this could be?
>>
>> If it's not on the table, I understand, but want to make sure we've
>> considered all reasonable options, and that might not be a simple binary of
>> either/or.
>>
>> Thanks,
>>
>> --Dana
>>
>> Dana Walker
>>
>> Associate Software Engineer
>>
>> Red Hat
>>
>> 
>> 
>>
>>
>> 

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-01 Thread Brian Bouterse
I've finally gotten to read through the numbers and this thread. It is a
tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
vs int case because that is our default database. I don't think too much
about how things perform on MariaDB because they can improve their own
performance to catch up to PostgreSQL which regularly is performing better
afaict. I agree with the assessment of 30% ish slowdown in the large unit
cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
are worth it. Two main reasons stick out in my mind.

1. Our core code and all plugin code will always be compatible with common
db backends even when using bulk_create()
2. We get database sharding with postgresql which you can only do with UUID
pks. I was advised this years ago by jcline.

Performance and compatibility are a pretty classic trade-off. Overall I've
found that initial releases launch with less performance and improve (often
significantly) overtime. Consider the interpreter pypy (not pypi).  It
started "roughly 2000x slower [at initial launch] than CPython, to roughly
7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
worst-case but runs everywhere with zero "db-behavior surprises" I think is
worth it. Also conversely, if we don't adopt UUIDs, how will we address
item 1 pre RC?

@dawalker for the "can we have both" option, we probably can have some
db-specific codepaths, but I don't think doing an application wide PK type
change as a setting is feasible to support. The db specific codepaths are
one way performance improves over time. For the initial release, to keep
things simple I hope we don't have conditional database codepaths (for now).

More discussion on this change is encouraged. Thanks @dalley so much for
all the detailed investigation!

[0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html

Thank you,
Brian

On Fri, Mar 1, 2019 at 2:51 PM Dana Walker  wrote:

> As I brought up on irc, I don't know how feasible the complications to
> maintenance would be going forward, but I would prefer if we could use some
> sort of settings in order to choose uuid or id based on MariaDB or
> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
> impact to our users of a 30-40% efficiency drop in speed and storage.
>
> David wrote up a quick Proof of Concept after I brought this up but wasn't
> necessarily advocating it himself.  I think Daniel and Dennis expressed
> some concerns.  I'd like to see more people discussing it here with
> reasoning/examples on how doable something like this could be?
>
> If it's not on the table, I understand, but want to make sure we've
> considered all reasonable options, and that might not be a simple binary of
> either/or.
>
> Thanks,
>
> --Dana
>
> Dana Walker
>
> Associate Software Engineer
>
> Red Hat
>
> 
> 
>
>
> On Fri, Mar 1, 2019 at 9:15 AM David Davis  wrote:
>
>> I just want to bump this thread. If we hope to make the Pulp 3 RC date,
>> we need feedback today.
>>
>> David
>>
>>
>> On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri 
>> wrote:
>>
>>> Not sure if https://www.webyog.com/ Monyog will give a free opensource
>>> project license.  But that might help diagnose the MariaDB performance.
>>> Monyog is really nice, I wish it supported Postgres.
>>>
>>> Matt P.
>>>
>>> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley  wrote:
>>>
 Hello all,

 We've had an ongoing discussion about whether Pulp would be able to
 perform acceptably if we switched back to UUID primary keys.  I've finished
 doing the performance testing and I *think* the answer is yes.  Although to
 be honest, I'm not sure that I understand why, in the case of MariaDB.

 I linked my testing methodology and results here:
 https://pulp.plan.io/issues/4290#note-18

 To summarize, I tested the following:

 * How long it takes to perform subsequent large (lazy) syncs, with lots
 of content in the database (100-400k content units)
 * How long it takes to perform various small but important database
 queries

 The results were weirdly in contrast in some cases.

 The first four syncs (202,000 content total) behaved mostly the same on
 PostgreSQL whether it used an autoincrement or UUID primary key.
 Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
 code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
 the amont of content in the repository in both cases, which was a bit
 surprising to me.  The size of the database at the end was 30-40% larger
 with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
 typical usage when you consider that most content types have more metadata
 than FileContent (what I was testing).

 Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
 

Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-01 Thread Dana Walker
As I brought up on irc, I don't know how feasible the complications to
maintenance would be going forward, but I would prefer if we could use some
sort of settings in order to choose uuid or id based on MariaDB or
PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
impact to our users of a 30-40% efficiency drop in speed and storage.

David wrote up a quick Proof of Concept after I brought this up but wasn't
necessarily advocating it himself.  I think Daniel and Dennis expressed
some concerns.  I'd like to see more people discussing it here with
reasoning/examples on how doable something like this could be?

If it's not on the table, I understand, but want to make sure we've
considered all reasonable options, and that might not be a simple binary of
either/or.

Thanks,

--Dana

Dana Walker

Associate Software Engineer

Red Hat





On Fri, Mar 1, 2019 at 9:15 AM David Davis  wrote:

> I just want to bump this thread. If we hope to make the Pulp 3 RC date, we
> need feedback today.
>
> David
>
>
> On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri  wrote:
>
>> Not sure if https://www.webyog.com/ Monyog will give a free opensource
>> project license.  But that might help diagnose the MariaDB performance.
>> Monyog is really nice, I wish it supported Postgres.
>>
>> Matt P.
>>
>> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley  wrote:
>>
>>> Hello all,
>>>
>>> We've had an ongoing discussion about whether Pulp would be able to
>>> perform acceptably if we switched back to UUID primary keys.  I've finished
>>> doing the performance testing and I *think* the answer is yes.  Although to
>>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>>
>>> I linked my testing methodology and results here:
>>> https://pulp.plan.io/issues/4290#note-18
>>>
>>> To summarize, I tested the following:
>>>
>>> * How long it takes to perform subsequent large (lazy) syncs, with lots
>>> of content in the database (100-400k content units)
>>> * How long it takes to perform various small but important database
>>> queries
>>>
>>> The results were weirdly in contrast in some cases.
>>>
>>> The first four syncs (202,000 content total) behaved mostly the same on
>>> PostgreSQL whether it used an autoincrement or UUID primary key.
>>> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
>>> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
>>> the amont of content in the repository in both cases, which was a bit
>>> surprising to me.  The size of the database at the end was 30-40% larger
>>> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
>>> typical usage when you consider that most content types have more metadata
>>> than FileContent (what I was testing).
>>>
>>> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
>>> https://www.diffchecker.com/40AF8vvM
>>>
>>> With MariaDB the first sync was almost 80% slower than the first sync w/
>>> PostgreSQL, but every subsequent sync was as fast or faster, despite the
>>> tests of specific queries performing multiple times worse.  Additionally
>>> the sync performance did not decrease as rapidly as it did under
>>> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
>>> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
>>> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
>>> support MariaDB / MySQL.
>>>
>>> But overall I'm not sure how to interpret the fact that on one hand the
>>> real-usage performance is equal or better better, and on the performance of
>>> some of the underlying queries is noticably worse.  Maybe there's some
>>> weird caching going on in the backend, or the generated indexes are
>>> different?
>>>
>>> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
>>> https://www.diffchecker.com/W1nnIQgj
>>>
>>> I'd like to invite some discussion on this, but nothing I've mentioned
>>> seems like it would be a problem for going forwards with using UUID primary
>>> keys in a general sense.  If we're all in agreement about that engineering
>>> decision then we can move forwards with that work.
>>>
>>> [0] for *some* but not all repository versions.  No idea what's up there.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ___
>>> Pulp-dev mailing list
>>> Pulp-dev@redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>> ___
>> Pulp-dev mailing list
>> Pulp-dev@redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-03-01 Thread David Davis
I just want to bump this thread. If we hope to make the Pulp 3 RC date, we
need feedback today.

David


On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri  wrote:

> Not sure if https://www.webyog.com/ Monyog will give a free opensource
> project license.  But that might help diagnose the MariaDB performance.
> Monyog is really nice, I wish it supported Postgres.
>
> Matt P.
>
> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley  wrote:
>
>> Hello all,
>>
>> We've had an ongoing discussion about whether Pulp would be able to
>> perform acceptably if we switched back to UUID primary keys.  I've finished
>> doing the performance testing and I *think* the answer is yes.  Although to
>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>
>> I linked my testing methodology and results here:
>> https://pulp.plan.io/issues/4290#note-18
>>
>> To summarize, I tested the following:
>>
>> * How long it takes to perform subsequent large (lazy) syncs, with lots
>> of content in the database (100-400k content units)
>> * How long it takes to perform various small but important database
>> queries
>>
>> The results were weirdly in contrast in some cases.
>>
>> The first four syncs (202,000 content total) behaved mostly the same on
>> PostgreSQL whether it used an autoincrement or UUID primary key.
>> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
>> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
>> the amont of content in the repository in both cases, which was a bit
>> surprising to me.  The size of the database at the end was 30-40% larger
>> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
>> typical usage when you consider that most content types have more metadata
>> than FileContent (what I was testing).
>>
>> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
>> https://www.diffchecker.com/40AF8vvM
>>
>> With MariaDB the first sync was almost 80% slower than the first sync w/
>> PostgreSQL, but every subsequent sync was as fast or faster, despite the
>> tests of specific queries performing multiple times worse.  Additionally
>> the sync performance did not decrease as rapidly as it did under
>> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
>> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
>> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
>> support MariaDB / MySQL.
>>
>> But overall I'm not sure how to interpret the fact that on one hand the
>> real-usage performance is equal or better better, and on the performance of
>> some of the underlying queries is noticably worse.  Maybe there's some
>> weird caching going on in the backend, or the generated indexes are
>> different?
>>
>> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
>> https://www.diffchecker.com/W1nnIQgj
>>
>> I'd like to invite some discussion on this, but nothing I've mentioned
>> seems like it would be a problem for going forwards with using UUID primary
>> keys in a general sense.  If we're all in agreement about that engineering
>> decision then we can move forwards with that work.
>>
>> [0] for *some* but not all repository versions.  No idea what's up there.
>>
>>
>>
>>
>>
>>
>>
>> ___
>> Pulp-dev mailing list
>> Pulp-dev@redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-02-27 Thread Matt Pusateri
Not sure if https://www.webyog.com/ Monyog will give a free opensource
project license.  But that might help diagnose the MariaDB performance.
Monyog is really nice, I wish it supported Postgres.

Matt P.

On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley  wrote:

> Hello all,
>
> We've had an ongoing discussion about whether Pulp would be able to
> perform acceptably if we switched back to UUID primary keys.  I've finished
> doing the performance testing and I *think* the answer is yes.  Although to
> be honest, I'm not sure that I understand why, in the case of MariaDB.
>
> I linked my testing methodology and results here:
> https://pulp.plan.io/issues/4290#note-18
>
> To summarize, I tested the following:
>
> * How long it takes to perform subsequent large (lazy) syncs, with lots of
> content in the database (100-400k content units)
> * How long it takes to perform various small but important database queries
>
> The results were weirdly in contrast in some cases.
>
> The first four syncs (202,000 content total) behaved mostly the same on
> PostgreSQL whether it used an autoincrement or UUID primary key.
> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
> the amont of content in the repository in both cases, which was a bit
> surprising to me.  The size of the database at the end was 30-40% larger
> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
> typical usage when you consider that most content types have more metadata
> than FileContent (what I was testing).
>
> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
> https://www.diffchecker.com/40AF8vvM
>
> With MariaDB the first sync was almost 80% slower than the first sync w/
> PostgreSQL, but every subsequent sync was as fast or faster, despite the
> tests of specific queries performing multiple times worse.  Additionally
> the sync performance did not decrease as rapidly as it did under
> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
> support MariaDB / MySQL.
>
> But overall I'm not sure how to interpret the fact that on one hand the
> real-usage performance is equal or better better, and on the performance of
> some of the underlying queries is noticably worse.  Maybe there's some
> weird caching going on in the backend, or the generated indexes are
> different?
>
> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
> https://www.diffchecker.com/W1nnIQgj
>
> I'd like to invite some discussion on this, but nothing I've mentioned
> seems like it would be a problem for going forwards with using UUID primary
> keys in a general sense.  If we're all in agreement about that engineering
> decision then we can move forwards with that work.
>
> [0] for *some* but not all repository versions.  No idea what's up there.
>
>
>
>
>
>
>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-02-27 Thread Daniel Alley
Yes, I used the "started_at" and "finished_at" timestamps.  And there's
definitely things we can do to speed up sync times since they dropped by a
sizable amount since the last time I did this testing. I'm not sure where
that slowdown could have come from but I'm sure we can figure it out.

On Wed, Feb 27, 2019 at 4:10 AM David Davis  wrote:

> Daniel,
>
> Thanks for the work on this. I'm wondering where you got the times from.
> The task timestamps? I'm asking because when you say 30-40% slow down, I am
> wondering if that's the overall time it takes to sync or if that's just
> part of the sync. I think it's the former which I do find a bit troubling.
> That said, I think I agree with your conclusion that we should probably
> switch to UUIDs anyway. Perhaps we can find other ways to speed up sync
> times.
>
> David
>
>
> On Wed, Feb 27, 2019 at 1:23 AM Daniel Alley  wrote:
>
>> Hello all,
>>
>> We've had an ongoing discussion about whether Pulp would be able to
>> perform acceptably if we switched back to UUID primary keys.  I've finished
>> doing the performance testing and I *think* the answer is yes.  Although to
>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>
>> I linked my testing methodology and results here:
>> https://pulp.plan.io/issues/4290#note-18
>>
>> To summarize, I tested the following:
>>
>> * How long it takes to perform subsequent large (lazy) syncs, with lots
>> of content in the database (100-400k content units)
>> * How long it takes to perform various small but important database
>> queries
>>
>> The results were weirdly in contrast in some cases.
>>
>> The first four syncs (202,000 content total) behaved mostly the same on
>> PostgreSQL whether it used an autoincrement or UUID primary key.
>> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
>> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
>> the amont of content in the repository in both cases, which was a bit
>> surprising to me.  The size of the database at the end was 30-40% larger
>> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
>> typical usage when you consider that most content types have more metadata
>> than FileContent (what I was testing).
>>
>> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
>> https://www.diffchecker.com/40AF8vvM
>>
>> With MariaDB the first sync was almost 80% slower than the first sync w/
>> PostgreSQL, but every subsequent sync was as fast or faster, despite the
>> tests of specific queries performing multiple times worse.  Additionally
>> the sync performance did not decrease as rapidly as it did under
>> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
>> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
>> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
>> support MariaDB / MySQL.
>>
>> But overall I'm not sure how to interpret the fact that on one hand the
>> real-usage performance is equal or better better, and on the performance of
>> some of the underlying queries is noticably worse.  Maybe there's some
>> weird caching going on in the backend, or the generated indexes are
>> different?
>>
>> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
>> https://www.diffchecker.com/W1nnIQgj
>>
>> I'd like to invite some discussion on this, but nothing I've mentioned
>> seems like it would be a problem for going forwards with using UUID primary
>> keys in a general sense.  If we're all in agreement about that engineering
>> decision then we can move forwards with that work.
>>
>> [0] for *some* but not all repository versions.  No idea what's up there.
>>
>>
>>
>>
>>
>>
>>
>> ___
>> Pulp-dev mailing list
>> Pulp-dev@redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-02-27 Thread David Davis
Daniel,

Thanks for the work on this. I'm wondering where you got the times from.
The task timestamps? I'm asking because when you say 30-40% slow down, I am
wondering if that's the overall time it takes to sync or if that's just
part of the sync. I think it's the former which I do find a bit troubling.
That said, I think I agree with your conclusion that we should probably
switch to UUIDs anyway. Perhaps we can find other ways to speed up sync
times.

David


On Wed, Feb 27, 2019 at 1:23 AM Daniel Alley  wrote:

> Hello all,
>
> We've had an ongoing discussion about whether Pulp would be able to
> perform acceptably if we switched back to UUID primary keys.  I've finished
> doing the performance testing and I *think* the answer is yes.  Although to
> be honest, I'm not sure that I understand why, in the case of MariaDB.
>
> I linked my testing methodology and results here:
> https://pulp.plan.io/issues/4290#note-18
>
> To summarize, I tested the following:
>
> * How long it takes to perform subsequent large (lazy) syncs, with lots of
> content in the database (100-400k content units)
> * How long it takes to perform various small but important database queries
>
> The results were weirdly in contrast in some cases.
>
> The first four syncs (202,000 content total) behaved mostly the same on
> PostgreSQL whether it used an autoincrement or UUID primary key.
> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
> the amont of content in the repository in both cases, which was a bit
> surprising to me.  The size of the database at the end was 30-40% larger
> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
> typical usage when you consider that most content types have more metadata
> than FileContent (what I was testing).
>
> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
> https://www.diffchecker.com/40AF8vvM
>
> With MariaDB the first sync was almost 80% slower than the first sync w/
> PostgreSQL, but every subsequent sync was as fast or faster, despite the
> tests of specific queries performing multiple times worse.  Additionally
> the sync performance did not decrease as rapidly as it did under
> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
> support MariaDB / MySQL.
>
> But overall I'm not sure how to interpret the fact that on one hand the
> real-usage performance is equal or better better, and on the performance of
> some of the underlying queries is noticably worse.  Maybe there's some
> weird caching going on in the backend, or the generated indexes are
> different?
>
> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
> https://www.diffchecker.com/W1nnIQgj
>
> I'd like to invite some discussion on this, but nothing I've mentioned
> seems like it would be a problem for going forwards with using UUID primary
> keys in a general sense.  If we're all in agreement about that engineering
> decision then we can move forwards with that work.
>
> [0] for *some* but not all repository versions.  No idea what's up there.
>
>
>
>
>
>
>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


[Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

2019-02-26 Thread Daniel Alley
Hello all,

We've had an ongoing discussion about whether Pulp would be able to perform
acceptably if we switched back to UUID primary keys.  I've finished doing
the performance testing and I *think* the answer is yes.  Although to be
honest, I'm not sure that I understand why, in the case of MariaDB.

I linked my testing methodology and results here:
https://pulp.plan.io/issues/4290#note-18

To summarize, I tested the following:

* How long it takes to perform subsequent large (lazy) syncs, with lots of
content in the database (100-400k content units)
* How long it takes to perform various small but important database queries

The results were weirdly in contrast in some cases.

The first four syncs (202,000 content total) behaved mostly the same on
PostgreSQL whether it used an autoincrement or UUID primary key.
Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
the amont of content in the repository in both cases, which was a bit
surprising to me.  The size of the database at the end was 30-40% larger
with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
typical usage when you consider that most content types have more metadata
than FileContent (what I was testing).

Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
https://www.diffchecker.com/40AF8vvM

With MariaDB the first sync was almost 80% slower than the first sync w/
PostgreSQL, but every subsequent sync was as fast or faster, despite the
tests of specific queries performing multiple times worse.  Additionally
the sync performance did not decrease as rapidly as it did under
PostgreSQL.  With MariaDB, one of my test queries that worked fine when
backed by PostgreSQL ended up hanging endlessly and I had to cut it off
after 25 or so minutes. [0]  I would consider that a blocker to claiming we
support MariaDB / MySQL.

But overall I'm not sure how to interpret the fact that on one hand the
real-usage performance is equal or better better, and on the performance of
some of the underlying queries is noticably worse.  Maybe there's some
weird caching going on in the backend, or the generated indexes are
different?

UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
https://www.diffchecker.com/W1nnIQgj

I'd like to invite some discussion on this, but nothing I've mentioned
seems like it would be a problem for going forwards with using UUID primary
keys in a general sense.  If we're all in agreement about that engineering
decision then we can move forwards with that work.

[0] for *some* but not all repository versions.  No idea what's up there.
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev