Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Jeff Ortel Thu, 12 Apr 2018 08:54:00 -0700


On 04/12/2018 10:01 AM, Brian Bouterse wrote:

On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <[email protected]<mailto:[email protected]>> wrote:




    On 04/11/2018 03:29 PM, Brian Bouterse wrote:

    I think we should look into this in the near-term. Changing an
    interface on an object used by all plugins will be significantly
    easier, earlier.


    On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <[email protected]
    <mailto:[email protected]>> wrote:



        On 04/11/2018 10:59 AM, Brian Bouterse wrote:



        On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
        <[email protected] <mailto:[email protected]>> wrote:





                

                

                

                


            On 04/06/2018 09:15 AM, Brian Bouterse wrote:

            Several plugins have started using the Changesets
            including pulp_ansible, pulp_python, pulp_file, and
            perhaps others. The Changesets provide several distinct
            points of value which are great, but there are two
            challenges I want to bring up. I want to focus only on
            the problem statements first.

            1. There is redundant "differencing" code in all
            plugins. The Changeset interface requires the plugin
            writer to determine what units need to be added and
            those to be removed. This requires all plugin writers
            to write the same non-trivial differencing code over
            and over. For example, you can see the same non-trivial
            differencing code present in pulp_ansible
            
<https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
            pulp_file
            
<https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
            and pulp_python
            
<https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
            Line-wise, this "differencing" code makes up a large
            portion (maybe 50%) of the sync code itself in each plugin.


            Ten lines of trivial set logic hardly seems like a big
            deal but any duplication is worth exploring.

        It's more than ten lines. Take pulp_ansible for example. By
        my count (the linked to section) it's 89 lines, which out of
        306 lines of plugin code for sync is 29% of extra redundant
        code. The other plugins have similar numbers. So with those
        numbers in mind, what do you think?


        I was counting the lines (w/o comments) in find_delta() based
        on the linked code.  Which functions are you counting?


    I was counting the find_delta, build_additions, and
    build_removals methods. Regardless of how the lines are counted,
    that differencing code is the duplication I'm talking about.
    There isn't a way to use the changesets without duplicating that
    differencing code in a plugin.


    The differencing code is limited to find_delta() and perhaps
    build_removals().  Agreed, the line count is less useful than
    specifically identifying duplicate code.  Outside of find_delta(),
    I see similar code (in part because it got copied from file
    plugin) but not seeing actual duplication.  Can you be more specific?

Very similar code or identical code, I think it begs the question whyare we having plugin writer's do this at all? What value are theycreating with it? I don't have a reasonable answer to that question,so the requirement for plugin writer's to write that code brings meback to the problem statement: "plugin writers have redundantdifferencing code when using Changesets". More info on why it isvaluable for the plugin writer to do the differencing code versus theChangesets would be helpful.

The ChangeSet abstraction (and API) is based on following division ofresponsibility:


The plugin  (with an understanding of the remote and its content):
  - Download metadata.
  - Parse metadata
  - Based on the metadata:
    - determine content to be added to the repository.
      - define how artifacts are downloaded.
      - construct content
    - determine content to be removed to the repository.

Core (without understand of specific remote or its content):

- Provide low level API for plugin to affect the changes it hasdetermined need to be made to the repository. This is downloaders,models etc. - Provide high(er) level API for plugin to affect the changes it hasdetermined need to be made to the repository. This is the ChangeSet.


Are you proposing that this is not the correct division?

    So a shorter, simpler problem statement is: "to use the
    changesets plugin writers have to do extra work to compute
    additions and removals parameters".
    This statement ^ is better but still too vague to actually solve. 
    Can we elaborate on specifically what "to do extra work" means?
Sure. Removing that vague language is one way to resolve itsvagueness. Here's a revised problem statement: "to use the changesetsplugin writers have to compute additions and removals parameters".This problem statement would be resolved by a solution that causes theplugin writer to never have to produce these parameters and bereplaced by an interface that would require less effort from a pluginwriter.

I think it's the plugin's responsibility to determine the difference. Aside from that: without an understanding of the metadata and contenttype, how could the ChangeSet do this? What might that looks like?


            2. Plugins can't do end-to-end stream processing. The
            Changesets themselves do stream processing, but when
            you call into changeset.apply_and_drain() you have to
            have fully parsed the metadata already. Currently when
            fetching all metadata from Galaxy, pulp_ansible takes
            about 380 seconds (6+ min). This means that the actual
            Changeset content downloading starts 380 seconds later
            than it could. At the heart of the problem, the
            fetching+parsing of the metadata is not part of the
            stream processing.


            The additions/removals can be any interable (like
            generator) and by using ChangeSet.apply() and iterating
            the returned object, the pluign can "turn the crank"
            while downloading and processing the metadata.  The
            ChangeSet.apply_and_drain() is just a convenience
            method.  I don't see how this is a limitation of the
            ChangeSet.


        That is new info for me (and maybe everyone). OK so
        Changesets have two interfaces. apply() and
        apply_and_drain(). Why do we have two interfaces when
        apply() can support all existing use cases (that I know of)
        and do end-to-end stream processing but apply_and_drain()
        cannot? I see all of our examples (and all of our new
        plugins) using apply_and_drain().


        The ChangeSet.apply() was how I designed (and documented) it.
        Not sure when/who added the apply_and_drain().  +1 for
        removing it.


    I read through the changeset docs. I think this stream processing
    thing is still a problem but perhaps in how we're presenting the
    Changeset with it's arguments. I don't think apply() versus
    apply_and_drain() are at all related. Regardless of if you are
    using apply() or apply_and_drain(), the Changeset requires an
    'additions' and 'removals' arguments. This sends a clear message
    to the plugin writer that they need to compute additions and
    removals. They will fetch the metadata to compute these which is
    mostly how the changeset documentation reads. To know that they
    could present a generator that would correctly allow the metdata
    from inside the Changeset is I feel as non-obvious. I want the
    high-performing implementation to be the obvious one.

    So what about a problem statement like this: "Changesets are
    presented such that when you call into them you should already
    have fetched the metadata"?


    I'm not sure what is meant by "presented".  If this means that we
    should provide an example of how the ChangeSet can be used by
    plugins (with large metadata) in such a way that does not require
    downloading all the metadata first - that sounds like a good idea.

Cool so this is transitioning to ideas for resolution. The solution toadd documentation on how to do this with the existing interface is oneoption. My concern with adding additional docs on how to use thecurrent interface better is that if users choose to follow theexisting docs then they will have the stream processing problem onceagain. To me, this suggests that this new example should actuallyreplace the existing documentation.

Seems like both example would be useful. I'm not convinced that allplugins would benefit from this. For example: the File plugin manifestis small and would likely not benefit from the extra complexity. Forcomplicated plugins (like RPM), can differencing decision be made beforeanalyzing the entire metadata (eg: primary.xml)? Also, it's not clearto me how this would work using the Downloader. Are you suggesting thatthe plugin would parse/process metadata files while they're beingdownloaded? Perhaps a better understanding of the flow to be supportedwould help me understand this.


            Do you see the same challenges I do? Are these the
            right problem statements? I think with clear problem
            statements a solution will be easy to see and agree on.


            I'm not convinced that these are actual
            problems/challenges that need to be addressed in the
            near term.


            Thanks!
            Brian


            _______________________________________________
            Pulp-dev mailing list
            [email protected] <mailto:[email protected]>
            https://www.redhat.com/mailman/listinfo/pulp-dev
            <https://www.redhat.com/mailman/listinfo/pulp-dev>



            _______________________________________________
            Pulp-dev mailing list
            [email protected] <mailto:[email protected]>
            https://www.redhat.com/mailman/listinfo/pulp-dev
            <https://www.redhat.com/mailman/listinfo/pulp-dev>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

Reply via email to