Thanks for sending this concise summary to update where we are on this. I think the format is great and very helpful!
On Mon, Dec 18, 2017 at 5:09 PM, David Davis <[email protected]> wrote: > tl;dr - @dkliban, @bmbouter, and I met and we propose adopting the second > proposal because it has better performance and is more line with how we > think users will use repository versions (i.e. in a linear fashion rather > than a tree/branching model). We've also updated the user stories to remove > the base_version features and we're hoping to get @mhrivnak's PR merged this > week. > > # Background > > I ran through some performance tests on the first proposal which involved > storing a direct relationship between repository versions and content. The > results[0] show that for a smalli/medium-size system with 100M associations > between repository versions and content, it would take about a minute to > create a new repo version with 10,000 units in the database. 100M > associations also required a table size of at least 7GB and an index size of > 15GB. > > I don't think this is a dealbreaker in and of itself. It's possible we could > do some optimizations if we really want to adopt the first proposal (e.g. > use int keys instead of UUIDs, table partitioning, etc). I think it's worth > asking though what we want to optimize for which brings me to the next > point. > > # Linear vs Branching > > A main consideration for us was how users would use Pulp 3. The strength of > the second proposal (in which additions/removals are stored) is when a few > units are added/removed to the latest repo version. This case captures how a > majority of users will create new versions in Pulp. This is basically a > linear sort of model in which new versions are always based off the previous > version. > > The first proposal better supports creating versions from a base_version > which may or may not be a latest version. This is a branching sort of model > (like git) that offers more flexibility to our users but we feel like a > majority of the time, users would not be doing this when creating a new > version. And optimizing for a less frequently used use case is imprudent. > > Therefore, we think it makes sense to adopt the second proposal and store > only additions/removals of content from a repository version. Also, we think > that the base_version feature (allowing users to make changes to an older > repo version) should not be a part of the MVP and maybe we can consider it > for 3.1+. > > # Next Steps > > We've updated the user stories in the MVP document to remove the terminology > around base_version[1]. We're going to break them up into separate user > stories under our Repo Version tracker[2] and add a few of the basic ones > around CRD repo versions to the sprint. > > Also, we're going to work on accepting @mhrivnak's repo version PR[3]. I > think it's mostly ready, and just needs some re-review and ACKs. > > # Feedback > > If you have any thoughts, please respond. We're hoping to get the ball > rolling on repo versions ASAP. Thank you all for your help! > > [0] https://github.com/daviddavis/pulp_repo_version_test#results > [1] > https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Minimum_Viable_Product/diff?utf8=%E2%9C%93&version=136&version_from=135&commit=View+differences > [2] https://pulp.plan.io/issues/3209 > [3] https://github.com/pulp/pulp/pull/3228 > > > David > > On Sun, Dec 17, 2017 at 3:30 PM, Michael Hrivnak <[email protected]> > wrote: >> >> I decided to rebase the PR onto latest 3.0-dev just so it doesn't get too >> stale, particularly since the un-nesting work had a substantial impact. I >> also updated the gist containing tests. Feel free to have a look. >> >> I also addressed all the feedback on the PR. I did not implement any new >> behavior, such as adding a boolean value to the version model, since it >> seems like discussions may not be complete about what to name it and how it >> should be used. That seems easy enough to implement as an additional change. >> >> On Mon, Dec 4, 2017 at 10:11 AM, Dennis Kliban <[email protected]> wrote: >>> >>> I am looking forward to discussing the use cases. I hope we can get >>> versioned repositories into 3.0. Thanks everyone for the discussion so far. >>> >>> -Dennis >>> >>> On Fri, Dec 1, 2017 at 5:16 PM, Brian Bouterse <[email protected]> >>> wrote: >>>> >>>> Thank you all for such great discussion! >>>> >>>> To recap some discussion we had today. We are going to look at the >>>> versioned repos use cases at an upcoming MVP call in the near future >>>> (probably 12/8). Look for the pulp-list announcement. If you have use cases >>>> you want to share, you can add them in red in the Versioned Repos section >>>> of >>>> the MVP here: >>>> https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Minimum_Viable_Product/#Versioned-Repositories >>>> >>>> Once the use cases are known, we can look at the PR and see if it >>>> fulfills them. From the discussion today, the general consensus is that gap >>>> will be relatively small, which makes including it in Pulp3 feasible. >>>> >>>> @misa providing those types of features may be possible. Imagine an >>>> optional attribute on a repo version named 'frozen' that defaults to True. >>>> While the latest repo_version for a repo has frozen=False, any action that >>>> would normally create a new repo version (copy, add/remove, delete, etc) >>>> would act on the existing repo version and *not* create a new one. Then the >>>> user can update the frozen attribute of the repo version when they want, >>>> which commits the transaction as a repo version. I don't think this would >>>> be >>>> too hard to implement. >>>> >>>> >>>> On Thu, Nov 30, 2017 at 3:20 PM, Michael Hrivnak <[email protected]> >>>> wrote: >>>>> >>>>> >>>>> >>>>> On Thu, Nov 30, 2017 at 11:43 AM, Mihai Ibanescu >>>>> <[email protected]> wrote: >>>>>> >>>>>> I am late to the thread, so I apologize if I repeat things that have >>>>>> been discussed already. >>>>>> >>>>>> Is it a meaningful use case to publish an older version of the repo? >>>>>> Once published, do you keep track of which version got published, and >>>>>> how do >>>>>> you decide which version to push next? This seems like a complication to >>>>>> me. >>>>>> >>>>> >>>>> A publication will have a reference to the version that it was created >>>>> from. To illustrate how that would get used: Your CTO calls early on a >>>>> Saturday morning and says "I read in the news about a major security flaw >>>>> in >>>>> cowsay, and I know our applications depend heavily on it. What version do >>>>> we >>>>> have deployed right now???!!!" You can concretely determine which >>>>> publications are being currently "distributed" to your infrastructure, and >>>>> from there see their exact content sets by virtue of the repo version. >>>>> >>>>> Then there is the promotion workflow, which in Pulp 2 requires a lot of >>>>> copying and re-publishing. With repo versions, you'll have a sequence of >>>>> versions of course. Let's say there's 1, 2 and 3. Version 1 is deployed >>>>> now, >>>>> version 2 is undergoing testing, and version 3 got created last night by >>>>> the >>>>> weekly sync job you setup. You would have two different distributors that >>>>> make these publications available to clients: one for production, and one >>>>> for testing. "Promotion" becomes just the act of updating the reference >>>>> on a >>>>> distribution to a different publication. When testing on version 2 is >>>>> done, >>>>> assuming it passes, you can update the production distribution to make it >>>>> use version 2. >>>>> >>>>> There are a few use cases for publishing an old version. >>>>> >>>>> One is: I want to publish the same exact content set two different >>>>> ways, with two different publishers. If the contents change between >>>>> publishes, I want a guarantee that it won't cause the second publish to >>>>> use >>>>> different content than the first. >>>>> >>>>> Second: I like the state of the content in a repo as it is right now. I >>>>> want to publish that exact content set. If any changes happen to the >>>>> content >>>>> in that repo between now and when my publish task gets run by a worker, I >>>>> don't want those changes to affect the publish I'm requesting right now. >>>>> >>>>> Third: I want the ability to roll back from a bad content set to a >>>>> known-good one. How many publications must I keep around to have >>>>> confidence >>>>> that if I need to roll back some distance, that publication will still be >>>>> available? It's valuable to know I can re-publish an older version any >>>>> time >>>>> I need it. >>>>> >>>>> Fourth: In some cases you may decide after-the-fact that you need to >>>>> publish the same content set a different way. Maybe you went to kickstart >>>>> from a yum repo and then remembered that (this is a true story) one >>>>> version >>>>> of your installer is too old to know about sha256 checksums, so you have >>>>> to >>>>> go re-publish the same content set with different settings for how the >>>>> metadata gets generated. >>>>> >>>>> Otherwise, just as reproducible builds of software is a very valuable >>>>> trait, reproducible publishes of repositories are valuable for similar >>>>> reasons. >>>>> >>>>> >>>>>> >>>>>> As a user / content developer, it seems more useful to me to always >>>>>> publish the latest (i.e. don't have an optional version for publishing), >>>>>> but >>>>>> have the ability to copy from a specific version of a repo into another >>>>>> repo >>>>>> (or the same repo, effectively reverting the content of latest). >>>>>> >>>>>> So I would shift the discussion away from the REST API (for now), and >>>>>> more into the expected behavior for manipulating content within pulp. The >>>>>> operations I am aware of are: syncing units, importing units, >>>>>> copying/deleting units, and I am seeking clarification on how versioning >>>>>> will work for each. >>>>>> >>>>>> Syncing is probably the easiest, because it can handle all the changes >>>>>> internally and create a new version at the end. >>>>>> >>>>>> For importing, if you don't want to create unnecessary intermediate >>>>>> versions that are meaningless, I would want the ability to upload more >>>>>> than >>>>>> one unit and associate it to the repo, and then create a version. In >>>>>> other >>>>>> words, a transactional multi-upload. >>>>> >>>>> >>>>> Indeed. We want to have a behavior in Pulp 3 anyway that lets you >>>>> arbitrarily add and remove multiple content units in one operation. That's >>>>> one of the more notable missing features from Pulp 2. As Brian has pointed >>>>> out, one option is to let a user directly POST to a "versions" endpoint >>>>> and >>>>> express what content they want to add/remove. Even without repo versions, >>>>> we'd still want an API that lets you bulk add/remove. >>>>> >>>>>> >>>>>> For copying, as suggested above, I want to optionally specify the >>>>>> version. >>>>>> >>>>>> Deleting by itself is not hard, it does what it needs to do and then >>>>>> creates a version. >>>>>> >>>>>> The more complicated use case would be: what if I wanted to change the >>>>>> contents of repoA: >>>>>> * add 3 packages from repo1 version 1 >>>>>> * add 4 packages from repo2 (latest) >>>>>> * delete 5 packages >>>>>> >>>>>> and at the end have a single version change for repoA. >>>>>> >>>>>> Or, for the same repoA: >>>>>> * delete all units of type "rpm" and name "glibc" >>>>>> * copy unit type "rpm" and name "glibc" from two versions ago >>>>>> >>>>>> >>>>>> If you wanted this use case, then you need a new resource type, >>>>>> somewhat similar to a Task, let's call it Transaction. It is tied to the >>>>>> repository it operates on (repoA in the example above), and locks it from >>>>>> further changes until the transaction is committed or aborted. It could >>>>>> be >>>>>> implemented internally as a repository. You start with the current >>>>>> contents >>>>>> of repoA, and you perform whatever operations you need to do (including >>>>>> changing repo metadata). When you "commit" the Transaction, it becomes >>>>>> *the* >>>>>> new version of the repository and unlocks repoA. >>>>> >>>>> >>>>> Yep, we're on the same page with the use case I think. The other option >>>>> is to let you as a user query for whatever content you care about adding >>>>> and >>>>> removing; find it however you see fit. Then use the bulk add/remove >>>>> feature >>>>> to carry that out in one operation. >>>>> >>>>> I do like the idea of persistently storing a Transaction as you call >>>>> it, and possibly even letting a user build one explicitly. Even just as an >>>>> implementation detail, any bulk add/remove endpoint may need to store the >>>>> requested changes temporarily in the database as a means to get the input >>>>> from the web handler to a celery worker. We probably don't want to stuff >>>>> 10k+ content references into an AMQP message and pass them all in as an >>>>> argument to the task. And if we're going to store them in the DB, maybe it >>>>> would make sense to expose that to the user and let them create a >>>>> Transaction directly. >>>>> >>>>>> >>>>>> Whether a Version is a full copy of the repo or a delta is an >>>>>> implementation detail. I would argue for full copy, otherwise you run >>>>>> into >>>>>> the inefficiencies of cvs which had to apply patches in reverse order >>>>>> just >>>>>> to get to a version in the past. I would find it more useful to have a >>>>>> repo >>>>>> diff resource (diff version 1 with version 3, or repo1 version 1 with >>>>>> repo2 >>>>>> latest). >>>>> >>>>> >>>>> Agreed that it's an implementation detail. In the case of cvs and >>>>> similar, all changes had to be applied sequentially in order to construct >>>>> a >>>>> final product. When you're only tracking set membership, querying becomes >>>>> MUCH simpler and is very efficient. >>>>> >>>>>> >>>>>> >>>>>> Unfortunately, it is a rather large paradigm shift, and not one that >>>>>> you can push in a 3.0 -> 3.1 transition. Parts of it will need to land in >>>>>> 3.0 proper, determining what can be left out is an exercise to the reader >>>>>> who managed to keep up with my long emails. >>>>>> >>>>>> Hey, a man can dream. >>>>> >>>>> >>>>> I'm dreaming with you! (and also likely putting people to sleep with my >>>>> own long emails) I also think this is a hallmark behavior that is >>>>> important >>>>> to get right conceptually, and very important to a variety of >>>>> stakeholders. >>>>> >>>>> Thanks a lot for sharing your insight! If you have more thoughts on >>>>> these use cases, please keep it coming. >>>>> >>>>> _______________________________________________ >>>>> Pulp-dev mailing list >>>>> [email protected] >>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Pulp-dev mailing list >>>> [email protected] >>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>> >>> >> >> >> >> -- >> >> Michael Hrivnak >> >> Principal Software Engineer, RHCE >> >> Red Hat >> >> >> _______________________________________________ >> Pulp-dev mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/pulp-dev >> > > > _______________________________________________ > Pulp-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/pulp-dev > _______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
