On 04/11/2018 03:29 PM, Brian Bouterse wrote:
I think we should look into this in the near-term. Changing an
interface on an object used by all plugins will be significantly
easier, earlier.
On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <[email protected]
<mailto:[email protected]>> wrote:
On 04/11/2018 10:59 AM, Brian Bouterse wrote:
On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <[email protected]
<mailto:[email protected]>> wrote:
On 04/06/2018 09:15 AM, Brian Bouterse wrote:
Several plugins have started using the Changesets including
pulp_ansible, pulp_python, pulp_file, and perhaps others.
The Changesets provide several distinct points of value
which are great, but there are two challenges I want to
bring up. I want to focus only on the problem statements first.
1. There is redundant "differencing" code in all plugins.
The Changeset interface requires the plugin writer to
determine what units need to be added and those to be
removed. This requires all plugin writers to write the same
non-trivial differencing code over and over. For example,
you can see the same non-trivial differencing code present
in pulp_ansible
<https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
pulp_file
<https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
and pulp_python
<https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
Line-wise, this "differencing" code makes up a large portion
(maybe 50%) of the sync code itself in each plugin.
Ten lines of trivial set logic hardly seems like a big deal
but any duplication is worth exploring.
It's more than ten lines. Take pulp_ansible for example. By my
count (the linked to section) it's 89 lines, which out of 306
lines of plugin code for sync is 29% of extra redundant code. The
other plugins have similar numbers. So with those numbers in
mind, what do you think?
I was counting the lines (w/o comments) in find_delta() based on
the linked code. Which functions are you counting?
I was counting the find_delta, build_additions, and build_removals
methods. Regardless of how the lines are counted, that differencing
code is the duplication I'm talking about. There isn't a way to use
the changesets without duplicating that differencing code in a plugin.
The differencing code is limited to find_delta() and perhaps
build_removals(). Agreed, the line count is less useful than
specifically identifying duplicate code. Outside of find_delta(), I see
similar code (in part because it got copied from file plugin) but not
seeing actual duplication. Can you be more specific?
So a shorter, simpler problem statement is: "to use the changesets
plugin writers have to do extra work to compute additions and removals
parameters".
This statement ^ is better but still too vague to actually solve. Can we
elaborate on specifically what "to do extra work" means?
2. Plugins can't do end-to-end stream processing. The
Changesets themselves do stream processing, but when you
call into changeset.apply_and_drain() you have to have fully
parsed the metadata already. Currently when fetching all
metadata from Galaxy, pulp_ansible takes about 380 seconds
(6+ min). This means that the actual Changeset content
downloading starts 380 seconds later than it could. At the
heart of the problem, the fetching+parsing of the metadata
is not part of the stream processing.
The additions/removals can be any interable (like generator)
and by using ChangeSet.apply() and iterating the returned
object, the pluign can "turn the crank" while downloading and
processing the metadata. The ChangeSet.apply_and_drain() is
just a convenience method. I don't see how this is a
limitation of the ChangeSet.
That is new info for me (and maybe everyone). OK so Changesets
have two interfaces. apply() and apply_and_drain(). Why do we
have two interfaces when apply() can support all existing use
cases (that I know of) and do end-to-end stream processing but
apply_and_drain() cannot? I see all of our examples (and all of
our new plugins) using apply_and_drain().
The ChangeSet.apply() was how I designed (and documented) it. Not
sure when/who added the apply_and_drain(). +1 for removing it.
I read through the changeset docs. I think this stream processing
thing is still a problem but perhaps in how we're presenting the
Changeset with it's arguments. I don't think apply() versus
apply_and_drain() are at all related. Regardless of if you are using
apply() or apply_and_drain(), the Changeset requires an 'additions'
and 'removals' arguments. This sends a clear message to the plugin
writer that they need to compute additions and removals. They will
fetch the metadata to compute these which is mostly how the changeset
documentation reads. To know that they could present a generator that
would correctly allow the metdata from inside the Changeset is I feel
as non-obvious. I want the high-performing implementation to be the
obvious one.
So what about a problem statement like this: "Changesets are presented
such that when you call into them you should already have fetched the
metadata"?
I'm not sure what is meant by "presented". If this means that we should
provide an example of how the ChangeSet can be used by plugins (with
large metadata) in such a way that does not require downloading all the
metadata first - that sounds like a good idea.
Do you see the same challenges I do? Are these the right
problem statements? I think with clear problem statements a
solution will be easy to see and agree on.
I'm not convinced that these are actual problems/challenges
that need to be addressed in the near term.
Thanks!
Brian
_______________________________________________
Pulp-dev mailing list
[email protected] <mailto:[email protected]>
https://www.redhat.com/mailman/listinfo/pulp-dev
<https://www.redhat.com/mailman/listinfo/pulp-dev>
_______________________________________________
Pulp-dev mailing list
[email protected] <mailto:[email protected]>
https://www.redhat.com/mailman/listinfo/pulp-dev
<https://www.redhat.com/mailman/listinfo/pulp-dev>
_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev