I agree with that problem statement. pulp_file may want to have the same Content at two different paths in different RepositoryVersions (or even the same RepositoryVersion). Without this capability a user could never "move" where content lives in a RepositoryVersion if its already been placed in any other RepositoryVersion.
Additionally pulp_maven may need to sync two repositories in the wild that already contain the same content in two locations. I offer this as example not to pile-on, but because it's a multi-content artifact which I believe we will need to consider also as we work towards a solution. I've been spending time on developing a solution, but it needs more work so it's not ready yet. Also other katello and galaxy_ng work continues to pre-empt this, so it could take a while. On Thu, May 7, 2020 at 3:39 AM Matthias Dellweg <mdell...@redhat.com> wrote: > > Users need to be able to store the same content unit at different > relative paths in different repository versions. This problem is not unique > to the RPM plugin. Do we agree about that? > Yes, we agree. In pulp_deb relative_path is part of the contents > natural_key to circumvent this problem. So this creates two content units > that only differ in relativ_path. At least they share the artifact. > > On Thu, May 7, 2020 at 2:06 AM Dennis Kliban <dkli...@redhat.com> wrote: > >> I'd like to provide a little bit more context for my previous email by >> going back to the original problem statement: >> >> On Wed, Apr 1, 2020 at 9:23 AM Daniel Alley <dal...@redhat.com> wrote: >> >>> Problem: >>> >>> Currently, a relative_path is tied to content in Pulp. This means that >>> if a content unit exists in two places within a repository or across >>> repositories, it has to be stored as two separate content units. This >>> creates redundant data and potential confusion for users. >>> >>> As a specific example, we need to support mirroring content in pulp_rpm >>> <https://pulp.plan.io/issues/6353>. Currently, for each location at >>> which a single package is stored, we’ll need to create a content unit. We >>> could end up with several records representing a single package. Users may >>> be confused about why they see multiple records for a package and they may >>> have trouble for example deciding which content unit to copy. >>> >> Users need to be able to store the same content unit at different >> relative paths in different repository versions. This problem is not unique >> to the RPM plugin. Do we agree about that? >> >> I've been working on a potential solution that solves this problem in a >> document[0]. It is a complicated change and the document does not fully >> capture the plan yet. Feedback and help on the design is welcome. >> >> [0] https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw?edit >> >> >> On Mon, May 4, 2020 at 4:11 PM Dennis Kliban <dkli...@redhat.com> wrote: >> >>> I've reached two conclusions while trying to formulate a solution: >>> >>> This problem needs to be solved at the repository version level. >>> Repository membership needs to be tracked at the artifact level, and not >>> content level as it is now. >>> >>> On Thu, Apr 30, 2020 at 1:11 PM Daniel Alley <dal...@redhat.com> wrote: >>> >>>> Cool, so the only difference is whether to try to store the >>>> relationship in the DB, or leverage the fact that we already have the >>>> metadata and just re-parse it. >>>> >>>> I know the latter approach has yet to be written up, but my concern >>>> there is that adding another layer of indirection between "repository >>>> version" and "content" is going to have an adverse impact on performance, >>>> since it is already the most complex and demanding query we issue to the DB >>>> and one of the most common and important. >>>> >>>> On Thu, Apr 30, 2020 at 12:50 PM David Davis <davidda...@redhat.com> >>>> wrote: >>>> >>>>> Yes but I was imagining the mapping would be stored not as Content but >>>>> as a separate object. So we wouldn't use filename for the mapping (rather >>>>> we'd use ContentArtifact pk) and we wouldn't need to change >>>>> ContentArtifact's relative_path at all. That said, I think your solution >>>>> captures the idea though and is better in some ways. >>>>> >>>>> Changing the RepositoryContent model to point to ContentArtifacts and >>>>> store relative_paths is probably the best and most correct solution in >>>>> theory. However, it's going to be painful to implement for both core and >>>>> plugins. >>>>> >>>>> David >>>>> >>>>> >>>>> On Thu, Apr 30, 2020 at 12:33 PM Daniel Alley <dal...@redhat.com> >>>>> wrote: >>>>> >>>>>> @David Davis <davidda...@redhat.com> so this proposal would go >>>>>> something like this, correct?: >>>>>> >>>>>> * For the signed metadata / exact mirror use-case we need to store >>>>>> the repository metadata itself as a content unit inside the >>>>>> RepositoryVersion anyway (because the hash must be equal) >>>>>> * Because we have this metadata lying around, we can reference it at >>>>>> publish time to discover the appropriate PublishedArtifact.relative_path >>>>>> * Create a map of "filename" -> "location_href" and look up the >>>>>> filename of each RPM package to find the appropriate path >>>>>> * This should be pretty fast for the RPM plugin since createrepo_c >>>>>> is doing all the hard work >>>>>> * Data migration to ensure ContentArtifact.relative_path is only >>>>>> storing the filename (and I would suggest we also change the name to >>>>>> "filename") >>>>>> * If metadata isn't present in the RepositoryVersion, then just tweak >>>>>> the PublishedArtifact.relative_path so that it uses whichever our default >>>>>> repo layout is >>>>>> >>>>>> On Tue, Apr 28, 2020 at 11:41 AM David Davis <davidda...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> Yes, that's correct. During our meeting we discussed two options: >>>>>>> the first was to extend RepositoryContent to store relative path per >>>>>>> ContentArtifact as storing a relative_path per Content won't work for >>>>>>> multi-Artifact Content units. >>>>>>> >>>>>>> An alternative that I pitched was to have plugins (or maybe even >>>>>>> core someday) store this information outside RepositoryContent and then >>>>>>> use >>>>>>> this information during publishing to set relative_path on >>>>>>> PublishedArtifacts. We'd have to modify the content app if we wanted to >>>>>>> support pass through publications but I think asking plugins to use >>>>>>> published artifacts in this case is warranted. That said, I don't think >>>>>>> anyone else was keen on this idea though. >>>>>>> >>>>>>> David >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 28, 2020 at 10:30 AM Matthias Dellweg < >>>>>>> mdell...@redhat.com> wrote: >>>>>>> >>>>>>>> That is only used for passthrough publication afaik. If you publish >>>>>>>> each content unit "by hand", you create a new relative path for each >>>>>>>> published artifact. That is, why it can be empty and still the content >>>>>>>> can >>>>>>>> be published. >>>>>>>> >>>>>>>> On Tue, Apr 28, 2020 at 4:09 PM Daniel Alley <dal...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We realized in our discussion that the original proposal described >>>>>>>>> in my email will not work, because "relative_path" ultimately >>>>>>>>> describes the >>>>>>>>> path of the published *artifacts* (not content), and for content >>>>>>>>> types with multiple artifacts, storing this information in a field on >>>>>>>>> RepositoryContent would not be possible. >>>>>>>>> >>>>>>>>> On Mon, Apr 27, 2020 at 6:08 PM Daniel Alley <dal...@redhat.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> There is a video call scheduled to discuss this issue tomorrow >>>>>>>>>> (Tuesday April 28th) at 13:30 UTC (please convert to your local >>>>>>>>>> time). >>>>>>>>>> https://meet.google.com/scy-csbx-qiu >>>>>>>>>> >>>>>>>>>> On Sat, Apr 25, 2020 at 7:02 AM David Davis < >>>>>>>>>> davidda...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> I had a chance to think about this some more yesterday and >>>>>>>>>>> wanted to email out my thoughts. I also think that this change >>>>>>>>>>> sounds scary >>>>>>>>>>> and will have a big impact on plugin writers so I thought of a >>>>>>>>>>> couple >>>>>>>>>>> alternatives: >>>>>>>>>>> >>>>>>>>>>> First, we could add a relative_path field to RepositoryContent >>>>>>>>>>> instead of moving it there. This would be an optional field. It >>>>>>>>>>> would be up >>>>>>>>>>> to plugins to manage this field and they would still need to >>>>>>>>>>> populate the >>>>>>>>>>> relative_path field on ContentArtifact. But plugins could use this >>>>>>>>>>> optional >>>>>>>>>>> field to store relative paths per repository and then use this >>>>>>>>>>> field when >>>>>>>>>>> generating publications. >>>>>>>>>>> >>>>>>>>>>> The second alternative is one that is already laid out in the >>>>>>>>>>> original email but to call it out again: it would be to not solve >>>>>>>>>>> this in >>>>>>>>>>> pulpcore. RPM would create its own object that would map content in >>>>>>>>>>> a >>>>>>>>>>> repository to relative_paths. >>>>>>>>>>> >>>>>>>>>>> David >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 21, 2020 at 9:22 AM Quirin Pamp <p...@atix.de> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am not currently very well versed in the classes involved, >>>>>>>>>>>> but moving relative_path around sounds slightly scary with the >>>>>>>>>>>> potential to >>>>>>>>>>>> break things. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> As such, I would be interested to be kept in the loop as this >>>>>>>>>>>> moves forward. (Mailing list once there is some movement is >>>>>>>>>>>> entirely >>>>>>>>>>>> sufficient 😉) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Quirin Pamp >>>>>>>>>>>> ------------------------------ >>>>>>>>>>>> *From:* pulp-dev-boun...@redhat.com < >>>>>>>>>>>> pulp-dev-boun...@redhat.com> on behalf of Ina Panova < >>>>>>>>>>>> ipan...@redhat.com> >>>>>>>>>>>> *Sent:* 21 April 2020 14:07:13 >>>>>>>>>>>> *To:* Daniel Alley <dal...@redhat.com> >>>>>>>>>>>> *Cc:* Pulp-dev <pulp-dev@redhat.com> >>>>>>>>>>>> *Subject:* Re: [Pulp-dev] the "relative path" problem >>>>>>>>>>>> >>>>>>>>>>>> Daniel, >>>>>>>>>>>> >>>>>>>>>>>> how about setting up a meeting and brainstorm the alternatives, >>>>>>>>>>>> pros/cons there? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -------- >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Ina Panova >>>>>>>>>>>> Senior Software Engineer| Pulp| Red Hat Inc. >>>>>>>>>>>> >>>>>>>>>>>> "Do not go where the path may lead, >>>>>>>>>>>> go instead where there is no path and leave a trail." >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 17, 2020 at 5:57 PM Daniel Alley <dal...@redhat.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Bump, this item needs to move forwards soon. Does anyone have >>>>>>>>>>>> any thoughts? >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Apr 1, 2020 at 9:40 AM Pavel Picka <ppi...@redhat.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> I'd like to add one more question to this topic. Do you think >>>>>>>>>>>> it is a blocker for PRs [0] & [1] as by testing [2] this features >>>>>>>>>>>> I haven't >>>>>>>>>>>> run into real world example where two really same name packages >>>>>>>>>>>> appears. >>>>>>>>>>>> I think this is a 'must have' feature but until we solve/decide >>>>>>>>>>>> it we can have two features working may with warning in docs for >>>>>>>>>>>> users that >>>>>>>>>>>> can happen in some 'special' repositories. >>>>>>>>>>>> >>>>>>>>>>>> To follow topic directly I like proposed move to >>>>>>>>>>>> 'RepositoryContent' and add it to its uniqueness constraint (if I >>>>>>>>>>>> understand well). >>>>>>>>>>>> >>>>>>>>>>>> [0] https://github.com/pulp/pulp_rpm/pull/1657 >>>>>>>>>>>> [1] https://github.com/pulp/pulp_rpm/pull/1642 >>>>>>>>>>>> [2] tested with centos 7, 8, opensuse and SLE repositories >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Apr 1, 2020 at 3:22 PM Daniel Alley <dal...@redhat.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> We'd like to start a discussion on the "relative path problem" >>>>>>>>>>>> identified recently. >>>>>>>>>>>> Problem: >>>>>>>>>>>> >>>>>>>>>>>> Currently, a relative_path is tied to content in Pulp. This >>>>>>>>>>>> means that if a content unit exists in two places within a >>>>>>>>>>>> repository or >>>>>>>>>>>> across repositories, it has to be stored as two separate content >>>>>>>>>>>> units. >>>>>>>>>>>> This creates redundant data and potential confusion for users. >>>>>>>>>>>> >>>>>>>>>>>> As a specific example, we need to support mirroring content in >>>>>>>>>>>> pulp_rpm <https://pulp.plan.io/issues/6353>. Currently, for >>>>>>>>>>>> each location at which a single package is stored, we’ll need to >>>>>>>>>>>> create a >>>>>>>>>>>> content unit. We could end up with several records representing a >>>>>>>>>>>> single >>>>>>>>>>>> package. Users may be confused about why they see multiple records >>>>>>>>>>>> for a >>>>>>>>>>>> package and they may have trouble for example deciding which >>>>>>>>>>>> content unit >>>>>>>>>>>> to copy. >>>>>>>>>>>> Proposed Solution: >>>>>>>>>>>> >>>>>>>>>>>> Move “relative_path” from its current location on >>>>>>>>>>>> ContentArtifact, to RepositoryContent. This will require a sizable >>>>>>>>>>>> data >>>>>>>>>>>> migration. It is possibly the case that in rare cases, repository >>>>>>>>>>>> versions >>>>>>>>>>>> may change slightly due to deduplication. >>>>>>>>>>>> >>>>>>>>>>>> A repository-version-wide uniqueness constraint will be present >>>>>>>>>>>> on “relative_path”, independently of any other repository uniquness >>>>>>>>>>>> constraints (repo_key_fields) defined by the plugin writer. >>>>>>>>>>>> >>>>>>>>>>>> Modify the Stages API so that the relative_path can be >>>>>>>>>>>> processed in the correct location – instead of >>>>>>>>>>>> “DeclarativeArtifact” it >>>>>>>>>>>> will likely need to go on “DeclarativeContent” >>>>>>>>>>>> >>>>>>>>>>>> Remove “location_href” from the RPM Package content model – it >>>>>>>>>>>> was never a true part of the RPM (file) metadata, it is derived >>>>>>>>>>>> from the >>>>>>>>>>>> repository metadata. So storing it as a part of the Content unit >>>>>>>>>>>> doesn’t >>>>>>>>>>>> entirely make sense. >>>>>>>>>>>> Alternatives >>>>>>>>>>>> >>>>>>>>>>>> In most cases, a content unit will have a single relative path >>>>>>>>>>>> for a content unit. Creating a general solution to solve a one-off >>>>>>>>>>>> problem >>>>>>>>>>>> is usually not a good idea. As an alternative, we could look at >>>>>>>>>>>> another >>>>>>>>>>>> solution for mirroring content. One example might be to create a >>>>>>>>>>>> new object >>>>>>>>>>>> (e.g. RpmRepoMirrorContentMapping) that maps content to specific >>>>>>>>>>>> paths >>>>>>>>>>>> within a repo or repo version. >>>>>>>>>>>> Questions >>>>>>>>>>>> >>>>>>>>>>>> - How do we handle this in pulp_file? How are content units >>>>>>>>>>>> identified in pulp_file without relative_path? >>>>>>>>>>>> - Checksum? >>>>>>>>>>>> - How was this problem handled in Pulp 2? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Please weigh in if you have any input on potential problems >>>>>>>>>>>> with the proposal, potential alternate solutions, or other >>>>>>>>>>>> insights or >>>>>>>>>>>> questions! >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Pavel Picka >>>>>>>>>>>> Red Hat >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>> Pulp-dev mailing list >>>>>>>>> Pulp-dev@redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>> >>>>>>>> _______________________________________________ >>>> Pulp-dev mailing list >>>> Pulp-dev@redhat.com >>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>> >>> _______________________________________________ >> Pulp-dev mailing list >> Pulp-dev@redhat.com >> https://www.redhat.com/mailman/listinfo/pulp-dev >> > _______________________________________________ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev >
_______________________________________________ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev