Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Jeff Ortel Wed, 11 Apr 2018 15:08:15 -0700


On 04/11/2018 03:29 PM, Brian Bouterse wrote:

I think we should look into this in the near-term. Changing aninterface on an object used by all plugins will be significantlyeasier, earlier.

On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <[email protected]<mailto:[email protected]>> wrote:




    On 04/11/2018 10:59 AM, Brian Bouterse wrote:



    On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <[email protected]
    <mailto:[email protected]>> wrote:





                

                

                

                


        On 04/06/2018 09:15 AM, Brian Bouterse wrote:

        Several plugins have started using the Changesets including
        pulp_ansible, pulp_python, pulp_file, and perhaps others.
        The Changesets provide several distinct points of value
        which are great, but there are two challenges I want to
        bring up. I want to focus only on the problem statements first.

        1. There is redundant "differencing" code in all plugins.
        The Changeset interface requires the plugin writer to
        determine what units need to be added and those to be
        removed. This requires all plugin writers to write the same
        non-trivial differencing code over and over. For example,
        you can see the same non-trivial differencing code present
        in pulp_ansible
        
<https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
        pulp_file
        
<https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
        and pulp_python
        
<https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
        Line-wise, this "differencing" code makes up a large portion
        (maybe 50%) of the sync code itself in each plugin.


        Ten lines of trivial set logic hardly seems like a big deal
        but any duplication is worth exploring.

    It's more than ten lines. Take pulp_ansible for example. By my
    count (the linked to section) it's 89 lines, which out of 306
    lines of plugin code for sync is 29% of extra redundant code. The
    other plugins have similar numbers. So with those numbers in
    mind, what do you think?


    I was counting the lines (w/o comments) in find_delta() based on
    the linked code.  Which functions are you counting?

I was counting the find_delta, build_additions, and build_removalsmethods. Regardless of how the lines are counted, that differencingcode is the duplication I'm talking about. There isn't a way to usethe changesets without duplicating that differencing code in a plugin.

The differencing code is limited to find_delta() and perhapsbuild_removals(). Agreed, the line count is less useful thanspecifically identifying duplicate code. Outside of find_delta(), I seesimilar code (in part because it got copied from file plugin) but notseeing actual duplication. Can you be more specific?

So a shorter, simpler problem statement is: "to use the changesetsplugin writers have to do extra work to compute additions and removalsparameters".

This statement ^ is better but still too vague to actually solve. Can weelaborate on specifically what "to do extra work" means?

        2. Plugins can't do end-to-end stream processing. The
        Changesets themselves do stream processing, but when you
        call into changeset.apply_and_drain() you have to have fully
        parsed the metadata already. Currently when fetching all
        metadata from Galaxy, pulp_ansible takes about 380 seconds
        (6+ min). This means that the actual Changeset content
        downloading starts 380 seconds later than it could. At the
        heart of the problem, the fetching+parsing of the metadata
        is not part of the stream processing.
        The additions/removals can be any interable (like generator)
        and by using ChangeSet.apply() and iterating the returned
        object, the pluign can "turn the crank" while downloading and
        processing the metadata. The ChangeSet.apply_and_drain() is
        just a convenience method.  I don't see how this is a
        limitation of the ChangeSet.


    That is new info for me (and maybe everyone). OK so Changesets
    have two interfaces. apply() and apply_and_drain(). Why do we
    have two interfaces when apply() can support all existing use
    cases (that I know of) and do end-to-end stream processing but
    apply_and_drain() cannot? I see all of our examples (and all of
    our new plugins) using apply_and_drain().
    The ChangeSet.apply() was how I designed (and documented) it.  Not
    sure when/who added the apply_and_drain().  +1 for removing it.
I read through the changeset docs. I think this stream processingthing is still a problem but perhaps in how we're presenting theChangeset with it's arguments. I don't think apply() versusapply_and_drain() are at all related. Regardless of if you are usingapply() or apply_and_drain(), the Changeset requires an 'additions'and 'removals' arguments. This sends a clear message to the pluginwriter that they need to compute additions and removals. They willfetch the metadata to compute these which is mostly how the changesetdocumentation reads. To know that they could present a generator thatwould correctly allow the metdata from inside the Changeset is I feelas non-obvious. I want the high-performing implementation to be theobvious one.
So what about a problem statement like this: "Changesets are presentedsuch that when you call into them you should already have fetched themetadata"?

I'm not sure what is meant by "presented". If this means that we shouldprovide an example of how the ChangeSet can be used by plugins (withlarge metadata) in such a way that does not require downloading all themetadata first - that sounds like a good idea.


        Do you see the same challenges I do? Are these the right
        problem statements? I think with clear problem statements a
        solution will be easy to see and agree on.


        I'm not convinced that these are actual problems/challenges
        that need to be addressed in the near term.


        Thanks!
        Brian


        _______________________________________________
        Pulp-dev mailing list
        [email protected] <mailto:[email protected]>
        https://www.redhat.com/mailman/listinfo/pulp-dev
        <https://www.redhat.com/mailman/listinfo/pulp-dev>



        _______________________________________________
        Pulp-dev mailing list
        [email protected] <mailto:[email protected]>
        https://www.redhat.com/mailman/listinfo/pulp-dev
        <https://www.redhat.com/mailman/listinfo/pulp-dev>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Reply via email to