After some offline discussion with several Pulp devs, we decided to dedicate this thread to one problem - duplicates (and move the other problem - filtering/validation - to a different thread).
The current proposal is to have a repo_key on a content model (thanks, Simon) and ensure its uniqueness at repository version creation time in pulpcore. I updated the issue https://pulp.plan.io/issues/5008 accordingly, see 'Solution' section of the description. Please share your thoughts or concerns, preferably on the issue. Thank you, Tanya On Wed, Jul 24, 2019 at 10:02 PM Brian Bouterse <[email protected]> wrote: > > > On Mon, Jul 22, 2019 at 4:47 AM Tatiana Tereshchenko <[email protected]> > wrote: > >> >> >> On Sun, Jul 21, 2019 at 3:00 PM Brian Bouterse <[email protected]> >> wrote: >> >>> >>> >>> On Sun, Jul 21, 2019 at 6:23 AM Tatiana Tereshchenko < >>> [email protected]> wrote: >>> >>>> +1 to the idea of a repo_key. >>>> >>>> Should we also add the ability to apply custom validation of the >>>> content being added? >>>> Similar to a repo_key, Content model can optionally provide an >>>> additional validator. >>>> Use cases: >>>> - for pulp_file to avoid relative path overlap - e.g. 'a/b' and 'a' >>>> >>> In thinking this over more, I'm unsure that pulp_file has the use case. >>> Two different Artifacts having relative paths 'a' and 'a/b' in one repo >>> version doesn't seem problematic. This problem statement is similar to the >>> Distribution.base_path overlap problem statement where it's unavoidably >>> ambiguous which Distribution should be matched when base_paths are allowed >>> to overlap. In this case for pulp_file, it's not ambiguous in the same way, >>> the relative_path I expect to match to exactly 1 content unit either 'a', >>> or 'a/b', but not both. What do you think about this? >>> >> >> I agree that the problem is similar to the Distribution.base_path >> overlap. If I understand you correctly, yes, it's not a problem if you >> query content one by one. >> What about use cases when we want to have a repo version on a filesystem? >> E.g. Browsable repositoiries (this feature has already been asked for by >> our stakeholders), export (e.g. rsync). >> > I see now that the exporting to a filesystem use case make any repo > containing units 'a' and 'a/b' because 'a' can't be both a directory and a > file on POSIX filesystems. We should identify this requirement in our > plugin docs somehow. > > I was thinking webservers themselves are in the same situation. They have > to store content on disk, but they also have to serve 'a' and 'a/b' to the > user. To make this work they have 'a' actually serve up index.html. In > practice the binary data served at 'a' isn't stored at 'a' on the > filesystem but actually 'a/index.html' or 'a/index.txt' or something. The > data you would serve up as a directory's response would be an index of the > directory so this makes logical sense to me also. At export time one option > is to translate 'a' to 'a/index.html' which in terms of having webservers > serving it back up is kind of necessary. What do you think? > > If ^ is what we do then the overlapping 'a' and 'a/b' I think would be ok > again. What do you think about this? > > >> >>> - for pulp_rpm to filter by signature/signing key >>>> >>> Can we expand on this use case a bit? Is it that the repo version should >>> only contains units signed or unsigned rpms? Or is it that we are ok with a >>> mixture as long as each NEVRA is unique? I suspect the former, but I want >>> to be sure. >>> >> >> I think it should contain only signed units and optionally signed by >> specific keys only. See pulp 2 feature description >> https://docs.pulpproject.org/plugins/pulp_rpm/user-guide/features.html#package-signatures-and-gpg-key-id-filtering >> Another use case which comes to mind is: keeping the last N versions of a >> unit within a repo verison. >> > I agree w/ this use case. In the context of repo_key, I don't see a > relationship between the repo_key mechanism for preserving repo-level > uniqueness and how it would also resolve this. Did you have an idea that > could handle all of these cases with one mechanism? > > >> Tanya >> >> >>> >>>> Plugins can solve it by defining their own stage but it seems like >>>> almost any plugin needs to ensure absence of collisions specific to it, >>>> even the simple pulp_file. >>>> It means that our default pipeline becomes less useful and will be >>>> hardly ever used by any [currently known] plugins. >>>> >>>> Any thoughts? >>>> >>>> Tanya >>>> >>>> >>>> On Mon, Jul 8, 2019 at 9:09 PM Brian Bouterse <[email protected]> >>>> wrote: >>>> >>>>> I want to retell Simon's proposal to have "Content defines a >>>>> 'repo_key' similar to a unit_key. This key must be unique within a repo >>>>> version (and not globally like the unit_key." >>>>> >>>>> We could adopt his proposal to have the repo_key tuple defined on >>>>> Content in pulpcore. If we left the add/remove APIs in core and adopt for >>>>> both sync and add/remove a "keep newest to associate" functionality >>>>> described earlier in the thread. This "keep newest to associate" code >>>>> would >>>>> be used by sync in the form of a core stage that is a generalized version >>>>> of the RemoveDuplicates stage. This would become part of the default >>>>> pipeline for all users of Stages API. I think this would be better than >>>>> plugin writers implementing it over and over and also less effort for >>>>> plugin wrtiers. This design would meet the current needs of pulp_cookbook, >>>>> pulp_file, and pulp_rpm which are the only 3 places I know we have this >>>>> problem so far, but I believe more content types are susceptible to this. >>>>> >>>>> What do you think we should do? >>>>> >>>>> Thanks! >>>>> Brian >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Jun 27, 2019 at 4:03 AM Tatiana Tereshchenko < >>>>> [email protected]> wrote: >>>>> >>>>>> Sure, the code can be de-duplicated. >>>>>> My main worry is that it's a responsibility of a plugin writer not to >>>>>> forget to ensure uniqueness constraints within a repo version for every >>>>>> workflow (sync, copy, anything else) where a repo version is created. >>>>>> Every time before RepositoryVersion.create() is called, there should >>>>>> be a check that there is no colliding content in a repo version. >>>>>> It would be much more reliable and friendly, in my opinion, if plugin >>>>>> writer could define rules/callbacks/whatever for each content and it >>>>>> would >>>>>> be applied to any repository creation. >>>>>> At the same time this eliminates the flexibility to define different >>>>>> logic for content collision for different workflows, however I'm not sure >>>>>> if such a use case exists or is desired. >>>>>> >>>>>> Tanya >>>>>> >>>>>> >>>>>> On Wed, Jun 26, 2019 at 6:49 PM Austin Macdonald <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> @Tanya Tereshchenko <[email protected]> >>>>>>> >>>>>>>> Do I understand correctly that it doesn't cover the sync case and >>>>>>>> it's only about explicit repo version creation? >>>>>>>> >>>>>>> >>>>>>> I don't mean that add/remove could not share code with remove >>>>>>> duplicate stage. I wanted to point out that we have a problem here (how >>>>>>> to >>>>>>> remove duplicates) that has similar patterns to other problems with add >>>>>>> remove (recursive, copy, deciding which content to keep with a >>>>>>> collision, >>>>>>> etc.) I don't doubt that pulpcore could help solve these problems, but I >>>>>>> think that as we approach our GA, we should consider solving this >>>>>>> problem >>>>>>> (for now) by getting out of the way of plugin writers rather than by >>>>>>> implementing code that is supposed to work for all plugins. I suspect >>>>>>> that >>>>>>> plenty of the plugins will be implementing their own add/remove anyway. >>>>>>> >>>>>>> On Tue, Jun 25, 2019 at 12:56 PM David Davis <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I don't think this solution would work in the case of creating a >>>>>>>> new repository version. Suppose for example you had two content units >>>>>>>> that >>>>>>>> collide, one in a repo version and one older unit that a user >>>>>>>> explicitly >>>>>>>> wants to add to the repo version. If the latter one is older, then what >>>>>>>> would happen? >>>>>>>> >>>>>>>> David >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 25, 2019 at 12:48 PM Brian Bouterse < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Having a way for units to express their uniqueness per repo sounds >>>>>>>>> good because then more areas of Pulp's code could answer the question: >>>>>>>>> "will I have a duplicate if I add content X to repo_version Y". >>>>>>>>> >>>>>>>>> Let's assume we know that situation is about to occur during sync >>>>>>>>> for example, what do we do about it? In the errata case we know the >>>>>>>>> "new" >>>>>>>>> one should replace the existing one. Maybe we start to 'order' the >>>>>>>>> units >>>>>>>>> with colliding repo keys and keep the newest one always? Would this >>>>>>>>> work >>>>>>>>> for pulp_cookbook and pulp_rpm? Would it generalize? Is this what you >>>>>>>>> imagined? >>>>>>>>> >>>>>>>>> On Tue, Jun 25, 2019 at 5:30 AM Tatiana Tereshchenko < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Do I understand correctly that it doesn't cover the sync case and >>>>>>>>>> it's only about explicit repo version creation? >>>>>>>>>> So the suggestion is to implement the same logic twice: for sync >>>>>>>>>> case - RemoveDuplicates stage and/or maybe some custom stage (e.g. to >>>>>>>>>> disallow overlapping paths), and for direct repo version creation - >>>>>>>>>> your >>>>>>>>>> proposal. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 24, 2019 at 3:13 PM Austin Macdonald < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I have a design in mind for solving this problem: >>>>>>>>>>> >>>>>>>>>>> 1. Remove POST to RepositoryVersion (no general add/remove >>>>>>>>>>> endpoint). >>>>>>>>>>> 2. Add an endpoint to kick off an add/remove task, namespaced by >>>>>>>>>>> plugin. ie `POST pulp/api/v3/docker/add-remove/` >>>>>>>>>>> This view can be provided to all plugins by the plugin >>>>>>>>>>> template, and will be based on the current RepositoryVersionCreate: >>>>>>>>>>> >>>>>>>>>>> https://github.com/pulp/pulpcore/blob/master/pulpcore/app/viewsets/repository.py#L221-L258 >>>>>>>>>>> Note: the main purpose of this view is to kick off the >>>>>>>>>>> general add/remove task, which will be unchanged: >>>>>>>>>>> >>>>>>>>>>> https://github.com/pulp/pulpcore/blob/master/pulpcore/app/tasks/repository.py#L70 >>>>>>>>>>> 3. Add an add/remove serializer to the plugin API. >>>>>>>>>>> 3. Plugins needing further customization can provide their own >>>>>>>>>>> task and subclassed serializer. >>>>>>>>>>> >>>>>>>>>>> This gives the plugin writer full control over the endpoint >>>>>>>>>>> (customizable arguments and validation), and full control over the >>>>>>>>>>> flow >>>>>>>>>>> (extra logic, depsolving, enforced uniqueness). It only uses the >>>>>>>>>>> existing >>>>>>>>>>> patterns (and existing required knowledge), but requires no work >>>>>>>>>>> (other >>>>>>>>>>> than using the template) for the simple case. >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 3, 2019 at 2:56 PM Simon Baatz <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 03, 2019 at 09:11:07AM -0400, David Davis wrote: >>>>>>>>>>>> > @Simon I like the idea behind the repo_key solution you >>>>>>>>>>>> came up with. >>>>>>>>>>>> > Can you be more specific around cases you think that it >>>>>>>>>>>> couldn't >>>>>>>>>>>> > handle? I imagine that plugin writers could use properties >>>>>>>>>>>> or >>>>>>>>>>>> > denormailzation (ie additional database columns) to solve >>>>>>>>>>>> cases where >>>>>>>>>>>> > they need uniqueness across data that isn't in the >>>>>>>>>>>> database. In a worst >>>>>>>>>>>> > case scenario, they can't use the pulpcore solution and >>>>>>>>>>>> just have to >>>>>>>>>>>> > roll their own. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> What I wrote probably sounded too pessimistic. You are right, in >>>>>>>>>>>> most cases that should be doable. >>>>>>>>>>>> >>>>>>>>>>>> I agree that we could have a simple default solution that just >>>>>>>>>>>> requires to specify a couple of field names in the easiest >>>>>>>>>>>> case. As you >>>>>>>>>>>> say, it should be possible use custom logic in a plugin if >>>>>>>>>>>> required. >>>>>>>>>>>> >>>>>>>>>>>> Here is the case I was thinking of that it can't handle: >>>>>>>>>>>> >>>>>>>>>>>> In pulp_file, a uniqueness constraint on "relative_path" would >>>>>>>>>>>> allow >>>>>>>>>>>> content units "a" and "a/b" to be in a repo version. >>>>>>>>>>>> >>>>>>>>>>>> However, we may want file repos to be representable on an >>>>>>>>>>>> actual file >>>>>>>>>>>> system (e.g. when exporting them as tar files). For the repo >>>>>>>>>>>> above, >>>>>>>>>>>> this does not work, as "a" can't be a file and a directory at >>>>>>>>>>>> the >>>>>>>>>>>> same time on a standard Unix file system. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>> [email protected] >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Pulp-dev mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Pulp-dev mailing list >>>>>>>>> [email protected] >>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pulp-dev mailing list >>>>>>>> [email protected] >>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>> >>>>>>> _______________________________________________ >>>>>> Pulp-dev mailing list >>>>>> [email protected] >>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>> >>>>> _______________________________________________ >>>> Pulp-dev mailing list >>>> [email protected] >>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>> >>>
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
