Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Jeff Ortel Mon, 16 Apr 2018 09:22:06 -0700

Thanks for the proposal, Brian.  I also commented on the issue.


On 04/16/2018 09:41 AM, Brian Bouterse wrote:

I wrote up a description of the opportunity I see here [0]. I put ahigh level pro/con analysis below. I would like feedback on (a) ifthis adequately addresses the problem statements, (b) if there arealternatives, and (c) does this improve the plugin wrtier's experienceenough to adopt this?
pros:
* significantly less plugin code to write. Compare the Thing examplecode versus the current docs.

+1

* Higher performing with metadata downloading and parsing beingincluded in stream processing. This causes sync's for pulp_ansible tostart 6+ min earlier.


This could also be done currently with the ChangeSet as-is.

cons:
* Progress reporting doesn't know how many things it's processing(it's a stream). So user's would see progress as "X things completed",not "X of Y things completed". Y can't be known until just before thestream processing completes otherwise it's not stream processing.


I'm not a fan of the SizedIterator either.

I contemplated this when designing the ChangeSet. An alternative Iconsidered was to report progress like OSTree does. It reports progressby periodically updating the expected TOTAL. It's better than nothing.


[0]: https://pulp.plan.io/issues/3570

Thanks!
Brian

On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel <[email protected]<mailto:[email protected]>> wrote:




    On 04/12/2018 04:00 PM, Brian Bouterse wrote:


    On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <[email protected]
    <mailto:[email protected]>> wrote:



        On 04/12/2018 10:01 AM, Brian Bouterse wrote:



        On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel
        <[email protected] <mailto:[email protected]>> wrote:



            On 04/11/2018 03:29 PM, Brian Bouterse wrote:

            I think we should look into this in the near-term.
            Changing an interface on an object used by all plugins
            will be significantly easier, earlier.


            On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
            <[email protected] <mailto:[email protected]>> wrote:



                On 04/11/2018 10:59 AM, Brian Bouterse wrote:



                On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
                <[email protected] <mailto:[email protected]>> wrote:





                        

                        

                        

                        


                    On 04/06/2018 09:15 AM, Brian Bouterse wrote:

Several plugins have started using the
Changesets including pulp_ansible,
pulp_python, pulp_file, and perhaps others.
The Changesets provide several distinct
points of value which are great, but there
are two challenges I want to bring up. I want
to focus only on the problem statements first.

1. There is redundant "differencing" code in
all plugins. The Changeset interface requires
the plugin writer to determine what units
need to be added and those to be removed.
This requires all plugin writers to write the
same non-trivial differencing code over and
over. For example, you can see the same
non-trivial differencing code present in
pulp_ansible

<https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
pulp_file

<https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
and pulp_python

<https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
Line-wise, this "differencing" code makes up
a large portion (maybe 50%) of the sync code
itself in each plugin.


                    Ten lines of trivial set logic hardly seems
                    like a big deal but any duplication is worth
                    exploring.

                It's more than ten lines. Take pulp_ansible for
                example. By my count (the linked to section) it's
                89 lines, which out of 306 lines of plugin code
                for sync is 29% of extra redundant code. The other
                plugins have similar numbers. So with those
                numbers in mind, what do you think?


                I was counting the lines (w/o comments) in
                find_delta() based on the linked code. Which
                functions are you counting?


            I was counting the find_delta, build_additions, and
            build_removals methods. Regardless of how the lines are
            counted, that differencing code is the duplication I'm
            talking about. There isn't a way to use the changesets
            without duplicating that differencing code in a plugin.


            The differencing code is limited to find_delta() and
            perhaps build_removals().  Agreed, the line count is
            less useful than specifically identifying duplicate
            code. Outside of find_delta(), I see similar code (in
            part because it got copied from file plugin) but not
            seeing actual duplication.  Can you be more specific?


        Very similar code or identical code, I think it begs the
        question why are we having plugin writer's do this at all?
        What value are they creating with it? I don't have a
        reasonable answer to that question, so the requirement for
        plugin writer's to write that code brings me back to the
        problem statement: "plugin writers have redundant
        differencing code when using Changesets". More info on why
        it is valuable for the plugin writer to do the differencing
        code versus the Changesets would be helpful.


        The ChangeSet abstraction (and API) is based on following
        division of responsibility:

        The plugin  (with an understanding of the remote and its
        content):
          - Download metadata.
          - Parse metadata
          - Based on the metadata:
            - determine content to be added to the repository.
              - define how artifacts are downloaded.
              - construct content
            - determine content to be removed to the repository.

        Core (without understand of specific remote or its content):
          - Provide low level API for plugin to affect the changes it
        has determined need to be made to the repository.  This is
        downloaders, models etc.
          - Provide high(er) level API for plugin to affect the
        changes it has determined need to be made to the repository. 
        This is the ChangeSet.

        Are you proposing that this is not the correct division?


    Yes I believe these problem statements suggest we should adjust
    the plugin writer's responsibilities when interacting with the
    Changesets in two specific ways. It's not exactly the language
    you used, but I believe the following two responsibilities could
    be moved into the Changesets entirely:

    - determining if any given Artifact or Content unit is already
    present in Pulp (aka computing what needs tobe added)


    Did you mean /added/ to the repository or /created/ in pulp. 
    Currently, the plugin determines the content that needs to be
    added to the repository.  This is modeled using a PendingContent
    which fully defines the Content (unit) and its PendingArtifact(s)
    which are included in the /additions/. The ChangeSet does
    determine whether or not any artifacts need to be downloaded (and
    downloads them based on policy) and determines which Content needs
    to be /created/ vs simply added to the repository.  The plugin
    blindly assumes that none of the /pending/ content has yet been
    created pulp.  This accomplishes 2 things.  1) reduces complexity
    and decision making by the plugin.  2) provides the ChangeSet with
    all the information needed to /create/ and /download/ as needed. 
    The /additions/ represents what the plugin wants to be added to
    the repository to synchronize it with the remote repository.

    - determining which content units need to be removed (aka
    computing the removals)


    I don't see how the ChangeSet has enough information to do this. 
    The plugin can (most likely will) make the decision about what to
    remove based on remote metadata, policy and configuration.


    ^ goals are a restating of the problem statement that plugin
    writers are asked to do differencing calculations when the
    Changesets could provide that to the plugin writer instead.


            So a shorter, simpler problem statement is: "to use the
            changesets plugin writers have to do extra work to
            compute additions and removals parameters".


            This statement ^ is better but still too vague to
            actually solve.  Can we elaborate on specifically what
            "to do extra work" means?


        Sure. Removing that vague language is one way to resolve its
        vagueness. Here's a revised problem statement: "to use the
        changesets plugin writers have to compute additions and
        removals parameters". This problem statement would be
        resolved by a solution that causes the plugin writer to
        never have to produce these parameters and be replaced by an
        interface that would require less effort from a plugin writer.


        I think it's the plugin's responsibility to determine the
        difference. Aside from that: without an understanding of the
        metadata and content type, how could the ChangeSet do this? 
        What might that looks like?


    If I'm understanding this correctly, the Changesets already do
    this for additions right? Help check my understanding. If a
    plugin writer delivers PendingContent and PendingArtifacts to the
    Changesets as 'additions', the Changesets will recognize them as
    already downloaded and not download them right? If this is the
    case, what is the benefit of having plugin writers also try to
    figure out if things should be downloaded?


    As you pointed out, the plugin writer does not need to figure out
    what needs to be downloaded.


                    2. Plugins can't do end-to-end stream
                    processing. The Changesets themselves do
                    stream processing, but when you call into
                    changeset.apply_and_drain() you have to have
                    fully parsed the metadata already. Currently
                    when fetching all metadata from Galaxy,
                    pulp_ansible takes about 380 seconds (6+
                    min). This means that the actual Changeset
                    content downloading starts 380 seconds later
                    than it could. At the heart of the problem,
                    the fetching+parsing of the metadata is not
                    part of the stream processing.


                    The additions/removals can be any interable
                    (like generator) and by using
                    ChangeSet.apply() and iterating the returned
                    object, the pluign can "turn the crank" while
                    downloading and processing the metadata. The
                    ChangeSet.apply_and_drain() is just a
                    convenience method.  I don't see how this is a
                    limitation of the ChangeSet.


                That is new info for me (and maybe everyone). OK
                so Changesets have two interfaces. apply() and
                apply_and_drain(). Why do we have two interfaces
                when apply() can support all existing use cases
                (that I know of) and do end-to-end stream
                processing but apply_and_drain() cannot? I see all
                of our examples (and all of our new plugins) using
                apply_and_drain().


                The ChangeSet.apply() was how I designed (and
                documented) it.  Not sure when/who added the
                apply_and_drain(). +1 for removing it.


            I read through the changeset docs. I think this stream
            processing thing is still a problem but perhaps in how
            we're presenting the Changeset with it's arguments. I
            don't think apply() versus apply_and_drain() are at all
            related. Regardless of if you are using apply() or
            apply_and_drain(), the Changeset requires an
            'additions' and 'removals' arguments. This sends a
            clear message to the plugin writer that they need to
            compute additions and removals. They will fetch the
            metadata to compute these which is mostly how the
            changeset documentation reads. To know that they could
            present a generator that would correctly allow the
            metdata from inside the Changeset is I feel as
            non-obvious. I want the high-performing implementation
            to be the obvious one.

            So what about a problem statement like this:
            "Changesets are presented such that when you call into
            them you should already have fetched the metadata"?


            I'm not sure what is meant by "presented".  If this
            means that we should provide an example of how the
            ChangeSet can be used by plugins (with large metadata)
            in such a way that does not require downloading all the
            metadata first - that sounds like a good idea.


        Cool so this is transitioning to ideas for resolution. The
        solution to add documentation on how to do this with the
        existing interface is one option. My concern with adding
        additional docs on how to use the current interface better
        is that if users choose to follow the existing docs then
        they will have the stream processing problem once again. To
        me, this suggests that this new example should actually
        replace the existing documentation.


        Seems like both example would be useful.  I'm not convinced
        that all plugins would benefit from this.  For example: the
        File plugin manifest is small and would likely not benefit
        from the extra complexity.  For complicated plugins (like
        RPM), can differencing decision be made before analyzing the
        entire metadata (eg: primary.xml)?  Also, it's not clear to
        me how this would work using the Downloader. Are you
        suggesting that the plugin would parse/process metadata files
        while they're being downloaded?  Perhaps a better
        understanding of the flow to be supported would help me
        understand this.


    Yes I am suggesting just that: that the Changesets could
    facilitate parse/processing metadata files while actual content
    named in those files is also being downloaded. I have a
    straightforward idea on how to achieve this. It's short and easy
    enough to write up (no code), but I want to make sure I'm not
    moving beyond the problem statement without others. Is there more
    we want to do on these problem statements, or would answering a
    bit about one way it could work be helpful?


    The /additions/ can be (and usually is) a generator.  The
    generator can yield based on metadata as it is downloaded and
    digested.  In this way, the ChangeSet already facilitates this.


    Just to state my expectations: Moving beyond the problem
    statement I don't consider to be a commitment to solve it; just
    an agreement on what we're solving as we discuss various
    resolutions. Problem statements can also always be revisited.
    Either way forward is fine w/ me, just let me know how we should
    continue


    So far, I'm not convinced that any specific problems/deficiencies
    have been identified.  That said, you seem to have a different
    abstraction in mind. I would be interested in reviewing it and how
    it would be used by plugin writers.  It may help illustrate the
    gains that you are envisioning.


                    Do you see the same challenges I do? Are
                    these the right problem statements? I think
                    with clear problem statements a solution will
                    be easy to see and agree on.


                    I'm not convinced that these are actual
                    problems/challenges that need to be addressed
                    in the near term.


                    Thanks!
                    Brian


                    _______________________________________________
                    Pulp-dev mailing list
                    [email protected] <mailto:[email protected]>
                    https://www.redhat.com/mailman/listinfo/pulp-dev
                    <https://www.redhat.com/mailman/listinfo/pulp-dev>



                    _______________________________________________
                    Pulp-dev mailing list
                    [email protected] <mailto:[email protected]>
                    https://www.redhat.com/mailman/listinfo/pulp-dev
                    <https://www.redhat.com/mailman/listinfo/pulp-dev>



        _______________________________________________
        Pulp-dev mailing list
        [email protected] <mailto:[email protected]>
        https://www.redhat.com/mailman/listinfo/pulp-dev
        <https://www.redhat.com/mailman/listinfo/pulp-dev>



    _______________________________________________
    Pulp-dev mailing list
    [email protected] <mailto:[email protected]>
    https://www.redhat.com/mailman/listinfo/pulp-dev
    <https://www.redhat.com/mailman/listinfo/pulp-dev>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Reply via email to