Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Jeff Ortel Thu, 12 Apr 2018 16:13:26 -0700


On 04/12/2018 04:00 PM, Brian Bouterse wrote:

On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <[email protected]<mailto:[email protected]>> wrote:




    On 04/12/2018 10:01 AM, Brian Bouterse wrote:



    On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <[email protected]
    <mailto:[email protected]>> wrote:



        On 04/11/2018 03:29 PM, Brian Bouterse wrote:

        I think we should look into this in the near-term. Changing
        an interface on an object used by all plugins will be
        significantly easier, earlier.


        On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
        <[email protected] <mailto:[email protected]>> wrote:



            On 04/11/2018 10:59 AM, Brian Bouterse wrote:



            On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
            <[email protected] <mailto:[email protected]>> wrote:





                        

                        

                        

                        


                On 04/06/2018 09:15 AM, Brian Bouterse wrote:

Several plugins have started using the Changesets
including pulp_ansible, pulp_python, pulp_file,
and perhaps others. The Changesets provide several
distinct points of value which are great, but
there are two challenges I want to bring up. I
want to focus only on the problem statements first.

1. There is redundant "differencing" code in all
plugins. The Changeset interface requires the
plugin writer to determine what units need to be
added and those to be removed. This requires all
plugin writers to write the same non-trivial
differencing code over and over. For example, you
can see the same non-trivial differencing code
present in pulp_ansible

<https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
pulp_file

<https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
and pulp_python

<https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
Line-wise, this "differencing" code makes up a
large portion (maybe 50%) of the sync code itself
in each plugin.


                Ten lines of trivial set logic hardly seems like a
                big deal but any duplication is worth exploring.

            It's more than ten lines. Take pulp_ansible for
            example. By my count (the linked to section) it's 89
            lines, which out of 306 lines of plugin code for sync
            is 29% of extra redundant code. The other plugins have
            similar numbers. So with those numbers in mind, what do
            you think?


            I was counting the lines (w/o comments) in find_delta()
            based on the linked code. Which functions are you counting?


        I was counting the find_delta, build_additions, and
        build_removals methods. Regardless of how the lines are
        counted, that differencing code is the duplication I'm
        talking about. There isn't a way to use the changesets
        without duplicating that differencing code in a plugin.


        The differencing code is limited to find_delta() and perhaps
        build_removals().  Agreed, the line count is less useful than
        specifically identifying duplicate code.  Outside of
        find_delta(), I see similar code (in part because it got
        copied from file plugin) but not seeing actual duplication. 
        Can you be more specific?


    Very similar code or identical code, I think it begs the question
    why are we having plugin writer's do this at all? What value are
    they creating with it? I don't have a reasonable answer to that
    question, so the requirement for plugin writer's to write that
    code brings me back to the problem statement: "plugin writers
    have redundant differencing code when using Changesets". More
    info on why it is valuable for the plugin writer to do the
    differencing code versus the Changesets would be helpful.


    The ChangeSet abstraction (and API) is based on following division
    of responsibility:

    The plugin  (with an understanding of the remote and its content):
      - Download metadata.
      - Parse metadata
      - Based on the metadata:
        - determine content to be added to the repository.
          - define how artifacts are downloaded.
          - construct content
        - determine content to be removed to the repository.

    Core (without understand of specific remote or its content):
      - Provide low level API for plugin to affect the changes it has
    determined need to be made to the repository.  This is
    downloaders, models etc.
      - Provide high(er) level API for plugin to affect the changes it
    has determined need to be made to the repository.  This is the
    ChangeSet.

    Are you proposing that this is not the correct division?

Yes I believe these problem statements suggest we should adjust theplugin writer's responsibilities when interacting with the Changesetsin two specific ways. It's not exactly the language you used, but Ibelieve the following two responsibilities could be moved into theChangesets entirely:

- determining if any given Artifact or Content unit is already presentin Pulp (aka computing what needs tobe added)

Did you mean /added/ to the repository or /created/ in pulp. Currently,the plugin determines the content that needs to be added to therepository. This is modeled using a PendingContent which fully definesthe Content (unit) and its PendingArtifact(s) which are included in the/additions/. The ChangeSet does determine whether or not any artifactsneed to be downloaded (and downloads them based on policy) anddetermines which Content needs to be /created/ vs simply added to therepository. The plugin blindly assumes that none of the /pending/content has yet been created pulp. This accomplishes 2 things. 1)reduces complexity and decision making by the plugin. 2) provides theChangeSet with all the information needed to /create/ and /download/ asneeded. The /additions/ represents what the plugin wants to be added tothe repository to synchronize it with the remote repository.

- determining which content units need to be removed (aka computingthe removals)

I don't see how the ChangeSet has enough information to do this. Theplugin can (most likely will) make the decision about what to removebased on remote metadata, policy and configuration.

^ goals are a restating of the problem statement that plugin writersare asked to do differencing calculations when the Changesets couldprovide that to the plugin writer instead.
        So a shorter, simpler problem statement is: "to use the
        changesets plugin writers have to do extra work to compute
        additions and removals parameters".
        This statement ^ is better but still too vague to actually
        solve.  Can we elaborate on specifically what "to do extra
        work" means?


    Sure. Removing that vague language is one way to resolve its
    vagueness. Here's a revised problem statement: "to use the
    changesets plugin writers have to compute additions and removals
    parameters". This problem statement would be resolved by a
    solution that causes the plugin writer to never have to produce
    these parameters and be replaced by an interface that would
    require less effort from a plugin writer.
    I think it's the plugin's responsibility to determine the
    difference.  Aside from that: without an understanding of the
    metadata and content type, how could the ChangeSet do this?  What
    might that looks like?
If I'm understanding this correctly, the Changesets already do thisfor additions right? Help check my understanding. If a plugin writerdelivers PendingContent and PendingArtifacts to the Changesets as'additions', the Changesets will recognize them as already downloadedand not download them right? If this is the case, what is the benefitof having plugin writers also try to figure out if things should bedownloaded?

As you pointed out, the plugin writer does not need to figure out whatneeds to be downloaded.


                2. Plugins can't do end-to-end stream processing.
                The Changesets themselves do stream processing,
                but when you call into changeset.apply_and_drain()
                you have to have fully parsed the metadata
                already. Currently when fetching all metadata from
                Galaxy, pulp_ansible takes about 380 seconds (6+
                min). This means that the actual Changeset content
                downloading starts 380 seconds later than it
                could. At the heart of the problem, the
                fetching+parsing of the metadata is not part of
                the stream processing.


                The additions/removals can be any interable (like
                generator) and by using ChangeSet.apply() and
                iterating the returned object, the pluign can "turn
                the crank" while downloading and processing the
                metadata. The ChangeSet.apply_and_drain() is just a
                convenience method.  I don't see how this is a
                limitation of the ChangeSet.


            That is new info for me (and maybe everyone). OK so
            Changesets have two interfaces. apply() and
            apply_and_drain(). Why do we have two interfaces when
            apply() can support all existing use cases (that I know
            of) and do end-to-end stream processing but
            apply_and_drain() cannot? I see all of our examples
            (and all of our new plugins) using apply_and_drain().


            The ChangeSet.apply() was how I designed (and
            documented) it.  Not sure when/who added the
            apply_and_drain().  +1 for removing it.


        I read through the changeset docs. I think this stream
        processing thing is still a problem but perhaps in how we're
        presenting the Changeset with it's arguments. I don't think
        apply() versus apply_and_drain() are at all related.
        Regardless of if you are using apply() or apply_and_drain(),
        the Changeset requires an 'additions' and 'removals'
        arguments. This sends a clear message to the plugin writer
        that they need to compute additions and removals. They will
        fetch the metadata to compute these which is mostly how the
        changeset documentation reads. To know that they could
        present a generator that would correctly allow the metdata
        from inside the Changeset is I feel as non-obvious. I want
        the high-performing implementation to be the obvious one.

        So what about a problem statement like this: "Changesets are
        presented such that when you call into them you should
        already have fetched the metadata"?


        I'm not sure what is meant by "presented".  If this means
        that we should provide an example of how the ChangeSet can be
        used by plugins (with large metadata) in such a way that does
        not require downloading all the metadata first - that sounds
        like a good idea.


    Cool so this is transitioning to ideas for resolution. The
    solution to add documentation on how to do this with the existing
    interface is one option. My concern with adding additional docs
    on how to use the current interface better is that if users
    choose to follow the existing docs then they will have the stream
    processing problem once again. To me, this suggests that this new
    example should actually replace the existing documentation.


    Seems like both example would be useful.  I'm not convinced that
    all plugins would benefit from this. For example: the File plugin
    manifest is small and would likely not benefit from the extra
    complexity.  For complicated plugins (like RPM), can differencing
    decision be made before analyzing the entire metadata (eg:
    primary.xml)?  Also, it's not clear to me how this would work
    using the Downloader.  Are you suggesting that the plugin would
    parse/process metadata files while they're being downloaded? 
    Perhaps a better understanding of the flow to be supported would
    help me understand this.

Yes I am suggesting just that: that the Changesets could facilitateparse/processing metadata files while actual content named in thosefiles is also being downloaded. I have a straightforward idea on howto achieve this. It's short and easy enough to write up (no code), butI want to make sure I'm not moving beyond the problem statementwithout others. Is there more we want to do on these problemstatements, or would answering a bit about one way it could work behelpful?

The /additions/ can be (and usually is) a generator. The generator canyield based on metadata as it is downloaded and digested. In this way,the ChangeSet already facilitates this.

Just to state my expectations: Moving beyond the problem statement Idon't consider to be a commitment to solve it; just an agreement onwhat we're solving as we discuss various resolutions. Problemstatements can also always be revisited. Either way forward is fine w/me, just let me know how we should continue

So far, I'm not convinced that any specific problems/deficiencies havebeen identified. That said, you seem to have a different abstraction inmind. I would be interested in reviewing it and how it would be used byplugin writers. It may help illustrate the gains that you are envisioning.


                Do you see the same challenges I do? Are these the
                right problem statements? I think with clear
                problem statements a solution will be easy to see
                and agree on.


                I'm not convinced that these are actual
                problems/challenges that need to be addressed in
                the near term.


                Thanks!
                Brian


                _______________________________________________
                Pulp-dev mailing list
                [email protected] <mailto:[email protected]>
                https://www.redhat.com/mailman/listinfo/pulp-dev
                <https://www.redhat.com/mailman/listinfo/pulp-dev>



                _______________________________________________
                Pulp-dev mailing list
                [email protected] <mailto:[email protected]>
                https://www.redhat.com/mailman/listinfo/pulp-dev
                <https://www.redhat.com/mailman/listinfo/pulp-dev>



    _______________________________________________
    Pulp-dev mailing list
    [email protected] <mailto:[email protected]>
    https://www.redhat.com/mailman/listinfo/pulp-dev
    <https://www.redhat.com/mailman/listinfo/pulp-dev>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Reply via email to