Both of those tickets look good to me. I can clearly see how the proposed changes will simplify the plugin writing experience.
Thank you both for putting together this plan. On Thu, Apr 19, 2018 at 2:50 PM, Brian Bouterse <[email protected]> wrote: > Jeff and I met and we put together two pieces of work which would create a > declarative interface for a plugin writer to use. This would be used in > stead of the Changeset interface by plugin writers. Whether or not to > continue including the ChangeSet in the plugin API is still being discussed. > > There seemed to be interest in offering an interface like this so on > Monday we will put together a PR so that we can see what it looks like and > how hard it would be to switch. Look at these stories in the hopes that we > can groom them and put them on the sprint. > > * https://pulp.plan.io/issues/3570 > * https://pulp.plan.io/issues/3582 > > Our plan is to start on ^ on Monday so if there are questions, ideas, or > concerns let us know. Once we have something to share, we'll email back to > this thread. Feel free to comment on the issues directly also. > > Thanks, > Brian & Jeff > > > On Mon, Apr 16, 2018 at 3:10 PM, Dennis Kliban <[email protected]> wrote: > >> On Mon, Apr 16, 2018 at 2:13 PM, Dennis Kliban <[email protected]> >> wrote: >> >>> On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel <[email protected]> wrote: >>> >>>> Thanks for the proposal, Brian. I also commented on the issue. >>>> >>>> On 04/16/2018 09:41 AM, Brian Bouterse wrote: >>>> >>>> I wrote up a description of the opportunity I see here [0]. I put a >>>> high level pro/con analysis below. I would like feedback on (a) if this >>>> adequately addresses the problem statements, (b) if there are alternatives, >>>> and (c) does this improve the plugin wrtier's experience enough to adopt >>>> this? >>>> >>>> pros: >>>> * significantly less plugin code to write. Compare the Thing example >>>> code versus the current docs. >>>> >>>> +1 >>>> >>>> * Higher performing with metadata downloading and parsing being >>>> included in stream processing. This causes sync's for pulp_ansible to start >>>> 6+ min earlier. >>>> >>>> >>>> This could also be done currently with the ChangeSet as-is. >>>> >>>> >>>> cons: >>>> * Progress reporting doesn't know how many things it's processing (it's >>>> a stream). So user's would see progress as "X things completed", not "X of >>>> Y things completed". Y can't be known until just before the stream >>>> processing completes otherwise it's not stream processing. >>>> >>>> >>>> I'm not a fan of the SizedIterator either. >>>> I contemplated this when designing the ChangeSet. An alternative I >>>> considered was to report progress like OSTree does. It reports progress by >>>> periodically updating the expected TOTAL. It's better than nothing. >>>> >>>> >>> What if we allow plugin writers to optionally provide a total number >>> when instantiating the ChangeSet? I bet there will be cases where the >>> number of items in the repository version will be known without having to >>> fully parse all the metadata. In these cases the progress reporting could >>> be more informative. >>> >>> >> >> Here is another idea for progress reporting for stream processing: have >> ChangeSet create a separate progress report for downloads. The total could >> by dynamically updated as downloads are scheduled. The complete count can >> be updated after each successful download. >> >> Any limitations in progress reporting are outweighed by the efficiency >> gained by having plugins always use stream processing. Just imagine not >> having to wait for the RPM plugin to finish "processing metadata" to start >> downloading content. >> >> >>> >>>> >>>> [0]: https://pulp.plan.io/issues/3570 >>>> >>>> Thanks! >>>> Brian >>>> >>>> >>>> >>>> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On 04/12/2018 04:00 PM, Brian Bouterse wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On 04/12/2018 10:01 AM, Brian Bouterse wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On 04/11/2018 03:29 PM, Brian Bouterse wrote: >>>>>>> >>>>>>> I think we should look into this in the near-term. Changing an >>>>>>> interface on an object used by all plugins will be significantly easier, >>>>>>> earlier. >>>>>>> >>>>>>> >>>>>>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 04/11/2018 10:59 AM, Brian Bouterse wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 04/06/2018 09:15 AM, Brian Bouterse wrote: >>>>>>>>> >>>>>>>>> Several plugins have started using the Changesets including >>>>>>>>> pulp_ansible, pulp_python, pulp_file, and perhaps others. The >>>>>>>>> Changesets >>>>>>>>> provide several distinct points of value which are great, but there >>>>>>>>> are two >>>>>>>>> challenges I want to bring up. I want to focus only on the problem >>>>>>>>> statements first. >>>>>>>>> >>>>>>>>> 1. There is redundant "differencing" code in all plugins. The >>>>>>>>> Changeset interface requires the plugin writer to determine what >>>>>>>>> units need >>>>>>>>> to be added and those to be removed. This requires all plugin writers >>>>>>>>> to >>>>>>>>> write the same non-trivial differencing code over and over. For >>>>>>>>> example, >>>>>>>>> you can see the same non-trivial differencing code present in >>>>>>>>> pulp_ansible >>>>>>>>> <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>, >>>>>>>>> pulp_file >>>>>>>>> <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>, >>>>>>>>> and pulp_python >>>>>>>>> <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>. >>>>>>>>> Line-wise, this "differencing" code makes up a large portion (maybe >>>>>>>>> 50%) of >>>>>>>>> the sync code itself in each plugin. >>>>>>>>> >>>>>>>>> >>>>>>>>> Ten lines of trivial set logic hardly seems like a big deal but >>>>>>>>> any duplication is worth exploring. >>>>>>>>> >>>>>>>> It's more than ten lines. Take pulp_ansible for example. By my >>>>>>>> count (the linked to section) it's 89 lines, which out of 306 lines of >>>>>>>> plugin code for sync is 29% of extra redundant code. The other plugins >>>>>>>> have >>>>>>>> similar numbers. So with those numbers in mind, what do you think? >>>>>>>> >>>>>>>> >>>>>>>> I was counting the lines (w/o comments) in find_delta() based on >>>>>>>> the linked code. Which functions are you counting? >>>>>>>> >>>>>>> >>>>>>> I was counting the find_delta, build_additions, and build_removals >>>>>>> methods. Regardless of how the lines are counted, that differencing >>>>>>> code is >>>>>>> the duplication I'm talking about. There isn't a way to use the >>>>>>> changesets >>>>>>> without duplicating that differencing code in a plugin. >>>>>>> >>>>>>> >>>>>>> The differencing code is limited to find_delta() and perhaps >>>>>>> build_removals(). Agreed, the line count is less useful than >>>>>>> specifically >>>>>>> identifying duplicate code. Outside of find_delta(), I see similar code >>>>>>> (in part because it got copied from file plugin) but not seeing actual >>>>>>> duplication. Can you be more specific? >>>>>>> >>>>>> >>>>>> Very similar code or identical code, I think it begs the question why >>>>>> are we having plugin writer's do this at all? What value are they >>>>>> creating >>>>>> with it? I don't have a reasonable answer to that question, so the >>>>>> requirement for plugin writer's to write that code brings me back to the >>>>>> problem statement: "plugin writers have redundant differencing code when >>>>>> using Changesets". More info on why it is valuable for the plugin writer >>>>>> to >>>>>> do the differencing code versus the Changesets would be helpful. >>>>>> >>>>>> >>>>>> The ChangeSet abstraction (and API) is based on following division of >>>>>> responsibility: >>>>>> >>>>>> The plugin (with an understanding of the remote and its content): >>>>>> - Download metadata. >>>>>> - Parse metadata >>>>>> - Based on the metadata: >>>>>> - determine content to be added to the repository. >>>>>> - define how artifacts are downloaded. >>>>>> - construct content >>>>>> - determine content to be removed to the repository. >>>>>> >>>>>> Core (without understand of specific remote or its content): >>>>>> - Provide low level API for plugin to affect the changes it has >>>>>> determined need to be made to the repository. This is downloaders, >>>>>> models >>>>>> etc. >>>>>> - Provide high(er) level API for plugin to affect the changes it >>>>>> has determined need to be made to the repository. This is the ChangeSet. >>>>>> >>>>>> Are you proposing that this is not the correct division? >>>>>> >>>>> >>>>> Yes I believe these problem statements suggest we should adjust the >>>>> plugin writer's responsibilities when interacting with the Changesets in >>>>> two specific ways. It's not exactly the language you used, but I believe >>>>> the following two responsibilities could be moved into the Changesets >>>>> entirely: >>>>> >>>>> - determining if any given Artifact or Content unit is already present >>>>> in Pulp (aka computing what needs tobe added) >>>>> >>>>> >>>>> Did you mean *added* to the repository or *created* in pulp. >>>>> Currently, the plugin determines the content that needs to be added to the >>>>> repository. This is modeled using a PendingContent which fully defines >>>>> the >>>>> Content (unit) and its PendingArtifact(s) which are included in the >>>>> *additions*. The ChangeSet does determine whether or not any >>>>> artifacts need to be downloaded (and downloads them based on policy) and >>>>> determines which Content needs to be *created* vs simply added to the >>>>> repository. The plugin blindly assumes that none of the *pending* >>>>> content has yet been created pulp. This accomplishes 2 things. 1) >>>>> reduces >>>>> complexity and decision making by the plugin. 2) provides the ChangeSet >>>>> with all the information needed to *create* and *download* as >>>>> needed. The *additions* represents what the plugin wants to be added >>>>> to the repository to synchronize it with the remote repository. >>>>> >>>>> - determining which content units need to be removed (aka computing >>>>> the removals) >>>>> >>>>> >>>>> I don't see how the ChangeSet has enough information to do this. The >>>>> plugin can (most likely will) make the decision about what to remove based >>>>> on remote metadata, policy and configuration. >>>>> >>>>> >>>>> ^ goals are a restating of the problem statement that plugin writers >>>>> are asked to do differencing calculations when the Changesets could >>>>> provide >>>>> that to the plugin writer instead. >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> So a shorter, simpler problem statement is: "to use the changesets >>>>>>> plugin writers have to do extra work to compute additions and removals >>>>>>> parameters". >>>>>>> >>>>>>> >>>>>>> This statement ^ is better but still too vague to actually solve. >>>>>>> Can we elaborate on specifically what "to do extra work" means? >>>>>>> >>>>>> >>>>>> Sure. Removing that vague language is one way to resolve its >>>>>> vagueness. Here's a revised problem statement: "to use the >>>>>> changesets plugin writers have to compute additions and removals >>>>>> parameters". This problem statement would be resolved by a solution that >>>>>> causes the plugin writer to never have to produce these parameters and be >>>>>> replaced by an interface that would require less effort from a plugin >>>>>> writer. >>>>>> >>>>>> >>>>>> I think it's the plugin's responsibility to determine the >>>>>> difference. Aside from that: without an understanding of the metadata >>>>>> and >>>>>> content type, how could the ChangeSet do this? What might that looks >>>>>> like? >>>>>> >>>>> >>>>> If I'm understanding this correctly, the Changesets already do this >>>>> for additions right? Help check my understanding. If a plugin writer >>>>> delivers PendingContent and PendingArtifacts to the Changesets as >>>>> 'additions', the Changesets will recognize them as already downloaded and >>>>> not download them right? If this is the case, what is the benefit of >>>>> having >>>>> plugin writers also try to figure out if things should be downloaded? >>>>> >>>>> >>>>> As you pointed out, the plugin writer does not need to figure out what >>>>> needs to be downloaded. >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> 2. Plugins can't do end-to-end stream processing. The Changesets >>>>>>>>> themselves do stream processing, but when you call into >>>>>>>>> changeset.apply_and_drain() you have to have fully parsed the metadata >>>>>>>>> already. Currently when fetching all metadata from Galaxy, >>>>>>>>> pulp_ansible >>>>>>>>> takes about 380 seconds (6+ min). This means that the actual Changeset >>>>>>>>> content downloading starts 380 seconds later than it could. At the >>>>>>>>> heart of >>>>>>>>> the problem, the fetching+parsing of the metadata is not part of the >>>>>>>>> stream >>>>>>>>> processing. >>>>>>>>> >>>>>>>>> >>>>>>>>> The additions/removals can be any interable (like generator) and >>>>>>>>> by using ChangeSet.apply() and iterating the returned object, the >>>>>>>>> pluign >>>>>>>>> can "turn the crank" while downloading and processing the metadata. >>>>>>>>> The >>>>>>>>> ChangeSet.apply_and_drain() is just a convenience method. I don't >>>>>>>>> see how >>>>>>>>> this is a limitation of the ChangeSet. >>>>>>>>> >>>>>>>> >>>>>>>> That is new info for me (and maybe everyone). OK so Changesets have >>>>>>>> two interfaces. apply() and apply_and_drain(). Why do we have two >>>>>>>> interfaces when apply() can support all existing use cases (that I >>>>>>>> know of) >>>>>>>> and do end-to-end stream processing but apply_and_drain() cannot? I >>>>>>>> see all >>>>>>>> of our examples (and all of our new plugins) using apply_and_drain(). >>>>>>>> >>>>>>>> >>>>>>>> The ChangeSet.apply() was how I designed (and documented) it. Not >>>>>>>> sure when/who added the apply_and_drain(). +1 for removing it. >>>>>>>> >>>>>>> >>>>>>> I read through the changeset docs. I think this stream processing >>>>>>> thing is still a problem but perhaps in how we're presenting the >>>>>>> Changeset >>>>>>> with it's arguments. I don't think apply() versus apply_and_drain() are >>>>>>> at >>>>>>> all related. Regardless of if you are using apply() or >>>>>>> apply_and_drain(), >>>>>>> the Changeset requires an 'additions' and 'removals' arguments. This >>>>>>> sends >>>>>>> a clear message to the plugin writer that they need to compute additions >>>>>>> and removals. They will fetch the metadata to compute these which is >>>>>>> mostly >>>>>>> how the changeset documentation reads. To know that they could present a >>>>>>> generator that would correctly allow the metdata from inside the >>>>>>> Changeset >>>>>>> is I feel as non-obvious. I want the high-performing implementation to >>>>>>> be >>>>>>> the obvious one. >>>>>>> >>>>>>> So what about a problem statement like this: "Changesets are >>>>>>> presented such that when you call into them you should already have >>>>>>> fetched >>>>>>> the metadata"? >>>>>>> >>>>>>> >>>>>>> I'm not sure what is meant by "presented". If this means that we >>>>>>> should provide an example of how the ChangeSet can be used by plugins >>>>>>> (with >>>>>>> large metadata) in such a way that does not require downloading all the >>>>>>> metadata first - that sounds like a good idea. >>>>>>> >>>>>> >>>>>> Cool so this is transitioning to ideas for resolution. The solution >>>>>> to add documentation on how to do this with the existing interface is one >>>>>> option. My concern with adding additional docs on how to use the current >>>>>> interface better is that if users choose to follow the existing docs then >>>>>> they will have the stream processing problem once again. To me, this >>>>>> suggests that this new example should actually replace the existing >>>>>> documentation. >>>>>> >>>>>> >>>>>> Seems like both example would be useful. I'm not convinced that all >>>>>> plugins would benefit from this. For example: the File plugin manifest >>>>>> is >>>>>> small and would likely not benefit from the extra complexity. For >>>>>> complicated plugins (like RPM), can differencing decision be made before >>>>>> analyzing the entire metadata (eg: primary.xml)? Also, it's not clear to >>>>>> me how this would work using the Downloader. Are you suggesting that the >>>>>> plugin would parse/process metadata files while they're being downloaded? >>>>>> Perhaps a better understanding of the flow to be supported would help me >>>>>> understand this. >>>>>> >>>>> >>>>> Yes I am suggesting just that: that the Changesets could facilitate >>>>> parse/processing metadata files while actual content named in those files >>>>> is also being downloaded. I have a straightforward idea on how to achieve >>>>> this. It's short and easy enough to write up (no code), but I want to make >>>>> sure I'm not moving beyond the problem statement without others. Is there >>>>> more we want to do on these problem statements, or would answering a bit >>>>> about one way it could work be helpful? >>>>> >>>>> >>>>> The *additions* can be (and usually is) a generator. The generator >>>>> can yield based on metadata as it is downloaded and digested. In this >>>>> way, >>>>> the ChangeSet already facilitates this. >>>>> >>>>> >>>>> Just to state my expectations: Moving beyond the problem statement I >>>>> don't consider to be a commitment to solve it; just an agreement on what >>>>> we're solving as we discuss various resolutions. Problem statements can >>>>> also always be revisited. Either way forward is fine w/ me, just let me >>>>> know how we should continue >>>>> >>>>> >>>>> So far, I'm not convinced that any specific problems/deficiencies have >>>>> been identified. That said, you seem to have a different abstraction in >>>>> mind. I would be interested in reviewing it and how it would be used by >>>>> plugin writers. It may help illustrate the gains that you are >>>>> envisioning. >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Do you see the same challenges I do? Are these the right problem >>>>>>>>> statements? I think with clear problem statements a solution will be >>>>>>>>> easy >>>>>>>>> to see and agree on. >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm not convinced that these are actual problems/challenges that >>>>>>>>> need to be addressed in the near term. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Brian >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Pulp-dev mailing >>>>>>>>> [email protected]https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Pulp-dev mailing list >>>>>>>>> [email protected] >>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pulp-dev mailing list >>>>>> [email protected] >>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pulp-dev mailing list >>>>> [email protected] >>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Pulp-dev mailing list >>>> [email protected] >>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>> >>>> >>> >> >> _______________________________________________ >> Pulp-dev mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/pulp-dev >> >> >
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
