On 04/12/2018 10:01 AM, Brian Bouterse wrote:
On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <jor...@redhat.com
<mailto:jor...@redhat.com>> wrote:
On 04/11/2018 03:29 PM, Brian Bouterse wrote:
I think we should look into this in the near-term. Changing an
interface on an object used by all plugins will be significantly
easier, earlier.
On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jor...@redhat.com
<mailto:jor...@redhat.com>> wrote:
On 04/11/2018 10:59 AM, Brian Bouterse wrote:
On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
<jor...@redhat.com <mailto:jor...@redhat.com>> wrote:
On 04/06/2018 09:15 AM, Brian Bouterse wrote:
Several plugins have started using the Changesets
including pulp_ansible, pulp_python, pulp_file, and
perhaps others. The Changesets provide several distinct
points of value which are great, but there are two
challenges I want to bring up. I want to focus only on
the problem statements first.
1. There is redundant "differencing" code in all
plugins. The Changeset interface requires the plugin
writer to determine what units need to be added and
those to be removed. This requires all plugin writers
to write the same non-trivial differencing code over
and over. For example, you can see the same non-trivial
differencing code present in pulp_ansible
<https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
pulp_file
<https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
and pulp_python
<https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
Line-wise, this "differencing" code makes up a large
portion (maybe 50%) of the sync code itself in each plugin.
Ten lines of trivial set logic hardly seems like a big
deal but any duplication is worth exploring.
It's more than ten lines. Take pulp_ansible for example. By
my count (the linked to section) it's 89 lines, which out of
306 lines of plugin code for sync is 29% of extra redundant
code. The other plugins have similar numbers. So with those
numbers in mind, what do you think?
I was counting the lines (w/o comments) in find_delta() based
on the linked code. Which functions are you counting?
I was counting the find_delta, build_additions, and
build_removals methods. Regardless of how the lines are counted,
that differencing code is the duplication I'm talking about.
There isn't a way to use the changesets without duplicating that
differencing code in a plugin.
The differencing code is limited to find_delta() and perhaps
build_removals(). Agreed, the line count is less useful than
specifically identifying duplicate code. Outside of find_delta(),
I see similar code (in part because it got copied from file
plugin) but not seeing actual duplication. Can you be more specific?
Very similar code or identical code, I think it begs the question why
are we having plugin writer's do this at all? What value are they
creating with it? I don't have a reasonable answer to that question,
so the requirement for plugin writer's to write that code brings me
back to the problem statement: "plugin writers have redundant
differencing code when using Changesets". More info on why it is
valuable for the plugin writer to do the differencing code versus the
Changesets would be helpful.
The ChangeSet abstraction (and API) is based on following division of
responsibility:
The plugin (with an understanding of the remote and its content):
- Download metadata.
- Parse metadata
- Based on the metadata:
- determine content to be added to the repository.
- define how artifacts are downloaded.
- construct content
- determine content to be removed to the repository.
Core (without understand of specific remote or its content):
- Provide low level API for plugin to affect the changes it has
determined need to be made to the repository. This is downloaders,
models etc.
- Provide high(er) level API for plugin to affect the changes it has
determined need to be made to the repository. This is the ChangeSet.
Are you proposing that this is not the correct division?
So a shorter, simpler problem statement is: "to use the
changesets plugin writers have to do extra work to compute
additions and removals parameters".
This statement ^ is better but still too vague to actually solve.
Can we elaborate on specifically what "to do extra work" means?
Sure. Removing that vague language is one way to resolve its
vagueness. Here's a revised problem statement: "to use the changesets
plugin writers have to compute additions and removals parameters".
This problem statement would be resolved by a solution that causes the
plugin writer to never have to produce these parameters and be
replaced by an interface that would require less effort from a plugin
writer.
I think it's the plugin's responsibility to determine the difference.
Aside from that: without an understanding of the metadata and content
type, how could the ChangeSet do this? What might that looks like?
2. Plugins can't do end-to-end stream processing. The
Changesets themselves do stream processing, but when
you call into changeset.apply_and_drain() you have to
have fully parsed the metadata already. Currently when
fetching all metadata from Galaxy, pulp_ansible takes
about 380 seconds (6+ min). This means that the actual
Changeset content downloading starts 380 seconds later
than it could. At the heart of the problem, the
fetching+parsing of the metadata is not part of the
stream processing.
The additions/removals can be any interable (like
generator) and by using ChangeSet.apply() and iterating
the returned object, the pluign can "turn the crank"
while downloading and processing the metadata. The
ChangeSet.apply_and_drain() is just a convenience
method. I don't see how this is a limitation of the
ChangeSet.
That is new info for me (and maybe everyone). OK so
Changesets have two interfaces. apply() and
apply_and_drain(). Why do we have two interfaces when
apply() can support all existing use cases (that I know of)
and do end-to-end stream processing but apply_and_drain()
cannot? I see all of our examples (and all of our new
plugins) using apply_and_drain().
The ChangeSet.apply() was how I designed (and documented) it.
Not sure when/who added the apply_and_drain(). +1 for
removing it.
I read through the changeset docs. I think this stream processing
thing is still a problem but perhaps in how we're presenting the
Changeset with it's arguments. I don't think apply() versus
apply_and_drain() are at all related. Regardless of if you are
using apply() or apply_and_drain(), the Changeset requires an
'additions' and 'removals' arguments. This sends a clear message
to the plugin writer that they need to compute additions and
removals. They will fetch the metadata to compute these which is
mostly how the changeset documentation reads. To know that they
could present a generator that would correctly allow the metdata
from inside the Changeset is I feel as non-obvious. I want the
high-performing implementation to be the obvious one.
So what about a problem statement like this: "Changesets are
presented such that when you call into them you should already
have fetched the metadata"?
I'm not sure what is meant by "presented". If this means that we
should provide an example of how the ChangeSet can be used by
plugins (with large metadata) in such a way that does not require
downloading all the metadata first - that sounds like a good idea.
Cool so this is transitioning to ideas for resolution. The solution to
add documentation on how to do this with the existing interface is one
option. My concern with adding additional docs on how to use the
current interface better is that if users choose to follow the
existing docs then they will have the stream processing problem once
again. To me, this suggests that this new example should actually
replace the existing documentation.
Seems like both example would be useful. I'm not convinced that all
plugins would benefit from this. For example: the File plugin manifest
is small and would likely not benefit from the extra complexity. For
complicated plugins (like RPM), can differencing decision be made before
analyzing the entire metadata (eg: primary.xml)? Also, it's not clear
to me how this would work using the Downloader. Are you suggesting that
the plugin would parse/process metadata files while they're being
downloaded? Perhaps a better understanding of the flow to be supported
would help me understand this.
Do you see the same challenges I do? Are these the
right problem statements? I think with clear problem
statements a solution will be easy to see and agree on.
I'm not convinced that these are actual
problems/challenges that need to be addressed in the
near term.
Thanks!
Brian
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
<https://www.redhat.com/mailman/listinfo/pulp-dev>
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
<https://www.redhat.com/mailman/listinfo/pulp-dev>
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev