Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-19 Thread Dennis Kliban
Both of those tickets look good to me. I can clearly see how the proposed
changes will simplify the plugin writing experience.

Thank you both for putting together this plan.

On Thu, Apr 19, 2018 at 2:50 PM, Brian Bouterse  wrote:

> Jeff and I met and we put together two pieces of work which would create a
> declarative interface for a plugin writer to use. This would be used in
> stead of the Changeset interface by plugin writers. Whether or not to
> continue including the ChangeSet in the plugin API is still being discussed.
>
> There seemed to be interest in offering an interface like this so on
> Monday we will put together a PR so that we can see what it looks like and
> how hard it would be to switch. Look at these stories in the hopes that we
> can groom them and put them on the sprint.
>
> * https://pulp.plan.io/issues/3570
> * https://pulp.plan.io/issues/3582
>
> Our plan is to start on ^ on Monday so if there are questions, ideas, or
> concerns let us know. Once we have something to share, we'll email back to
> this thread. Feel free to comment on the issues directly also.
>
> Thanks,
> Brian & Jeff
>
>
> On Mon, Apr 16, 2018 at 3:10 PM, Dennis Kliban  wrote:
>
>> On Mon, Apr 16, 2018 at 2:13 PM, Dennis Kliban 
>> wrote:
>>
>>> On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel  wrote:
>>>
 Thanks for the proposal, Brian.  I also commented on the issue.

 On 04/16/2018 09:41 AM, Brian Bouterse wrote:

 I wrote up a description of the opportunity I see here [0]. I put a
 high level pro/con analysis below. I would like feedback on (a) if this
 adequately addresses the problem statements, (b) if there are alternatives,
 and (c) does this improve the plugin wrtier's experience enough to adopt
 this?

 pros:
 * significantly less plugin code to write. Compare the Thing example
 code versus the current docs.

 +1

 * Higher performing with metadata downloading and parsing being
 included in stream processing. This causes sync's for pulp_ansible to start
 6+ min earlier.


 This could also be done currently with the ChangeSet as-is.


 cons:
 * Progress reporting doesn't know how many things it's processing (it's
 a stream). So user's would see progress as "X things completed", not "X of
 Y things completed". Y can't be known until just before the stream
 processing completes otherwise it's not stream processing.


 I'm not a fan of the SizedIterator either.
 I contemplated this when designing the ChangeSet.  An alternative I
 considered was to report progress like OSTree does.  It reports progress by
 periodically updating the expected TOTAL.  It's better than nothing.


>>> What if we allow plugin writers to optionally provide a total number
>>> when instantiating the ChangeSet? I bet there will be cases where the
>>> number of items in the repository version will be known without having to
>>> fully parse all the metadata. In these cases the progress reporting could
>>> be more informative.
>>>
>>>
>>
>> Here is another idea for progress reporting for stream processing: have
>> ChangeSet create a separate progress report for downloads. The total could
>> by dynamically updated as downloads are scheduled. The complete count can
>> be updated after each successful download.
>>
>> Any limitations in progress reporting are outweighed by the efficiency
>> gained by having plugins always use stream processing. Just imagine not
>> having to wait for the RPM plugin to finish "processing metadata" to start
>> downloading content.
>>
>>
>>>

 [0]: https://pulp.plan.io/issues/3570

 Thanks!
 Brian



 On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel  wrote:

>
>
> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>
>
> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel 
> wrote:
>
>>
>>
>> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>
>>
>>
>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel 
>> wrote:
>>
>>>
>>>
>>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>
>>> I think we should look into this in the near-term. Changing an
>>> interface on an object used by all plugins will be significantly easier,
>>> earlier.
>>>
>>>
>>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel 
>>> wrote:
>>>


 On 04/11/2018 10:59 AM, Brian Bouterse wrote:



 On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel 
 wrote:

>
>
>
>
>
>
>
>
>
>
>
>
> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>
> 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-19 Thread Brian Bouterse
Jeff and I met and we put together two pieces of work which would create a
declarative interface for a plugin writer to use. This would be used in
stead of the Changeset interface by plugin writers. Whether or not to
continue including the ChangeSet in the plugin API is still being discussed.

There seemed to be interest in offering an interface like this so on Monday
we will put together a PR so that we can see what it looks like and how
hard it would be to switch. Look at these stories in the hopes that we can
groom them and put them on the sprint.

* https://pulp.plan.io/issues/3570
* https://pulp.plan.io/issues/3582

Our plan is to start on ^ on Monday so if there are questions, ideas, or
concerns let us know. Once we have something to share, we'll email back to
this thread. Feel free to comment on the issues directly also.

Thanks,
Brian & Jeff


On Mon, Apr 16, 2018 at 3:10 PM, Dennis Kliban  wrote:

> On Mon, Apr 16, 2018 at 2:13 PM, Dennis Kliban  wrote:
>
>> On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel  wrote:
>>
>>> Thanks for the proposal, Brian.  I also commented on the issue.
>>>
>>> On 04/16/2018 09:41 AM, Brian Bouterse wrote:
>>>
>>> I wrote up a description of the opportunity I see here [0]. I put a high
>>> level pro/con analysis below. I would like feedback on (a) if this
>>> adequately addresses the problem statements, (b) if there are alternatives,
>>> and (c) does this improve the plugin wrtier's experience enough to adopt
>>> this?
>>>
>>> pros:
>>> * significantly less plugin code to write. Compare the Thing example
>>> code versus the current docs.
>>>
>>> +1
>>>
>>> * Higher performing with metadata downloading and parsing being included
>>> in stream processing. This causes sync's for pulp_ansible to start 6+ min
>>> earlier.
>>>
>>>
>>> This could also be done currently with the ChangeSet as-is.
>>>
>>>
>>> cons:
>>> * Progress reporting doesn't know how many things it's processing (it's
>>> a stream). So user's would see progress as "X things completed", not "X of
>>> Y things completed". Y can't be known until just before the stream
>>> processing completes otherwise it's not stream processing.
>>>
>>>
>>> I'm not a fan of the SizedIterator either.
>>> I contemplated this when designing the ChangeSet.  An alternative I
>>> considered was to report progress like OSTree does.  It reports progress by
>>> periodically updating the expected TOTAL.  It's better than nothing.
>>>
>>>
>> What if we allow plugin writers to optionally provide a total number when
>> instantiating the ChangeSet? I bet there will be cases where the number of
>> items in the repository version will be known without having to fully parse
>> all the metadata. In these cases the progress reporting could be more
>> informative.
>>
>>
>
> Here is another idea for progress reporting for stream processing: have
> ChangeSet create a separate progress report for downloads. The total could
> by dynamically updated as downloads are scheduled. The complete count can
> be updated after each successful download.
>
> Any limitations in progress reporting are outweighed by the efficiency
> gained by having plugins always use stream processing. Just imagine not
> having to wait for the RPM plugin to finish "processing metadata" to start
> downloading content.
>
>
>>
>>>
>>> [0]: https://pulp.plan.io/issues/3570
>>>
>>> Thanks!
>>> Brian
>>>
>>>
>>>
>>> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel  wrote:
>>>


 On 04/12/2018 04:00 PM, Brian Bouterse wrote:


 On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel  wrote:

>
>
> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>
>
>
> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel  wrote:
>
>>
>>
>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>
>> I think we should look into this in the near-term. Changing an
>> interface on an object used by all plugins will be significantly easier,
>> earlier.
>>
>>
>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel 
>> wrote:
>>
>>>
>>>
>>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>
>>>
>>>
>>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel 
>>> wrote:
>>>












 On 04/06/2018 09:15 AM, Brian Bouterse wrote:

 Several plugins have started using the Changesets including
 pulp_ansible, pulp_python, pulp_file, and perhaps others. The 
 Changesets
 provide several distinct points of value which are great, but there 
 are two
 challenges I want to bring up. I want to focus only on the problem
 statements first.

 1. There is redundant "differencing" code in all plugins. The

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-16 Thread Dennis Kliban
On Mon, Apr 16, 2018 at 2:13 PM, Dennis Kliban  wrote:

> On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel  wrote:
>
>> Thanks for the proposal, Brian.  I also commented on the issue.
>>
>> On 04/16/2018 09:41 AM, Brian Bouterse wrote:
>>
>> I wrote up a description of the opportunity I see here [0]. I put a high
>> level pro/con analysis below. I would like feedback on (a) if this
>> adequately addresses the problem statements, (b) if there are alternatives,
>> and (c) does this improve the plugin wrtier's experience enough to adopt
>> this?
>>
>> pros:
>> * significantly less plugin code to write. Compare the Thing example code
>> versus the current docs.
>>
>> +1
>>
>> * Higher performing with metadata downloading and parsing being included
>> in stream processing. This causes sync's for pulp_ansible to start 6+ min
>> earlier.
>>
>>
>> This could also be done currently with the ChangeSet as-is.
>>
>>
>> cons:
>> * Progress reporting doesn't know how many things it's processing (it's a
>> stream). So user's would see progress as "X things completed", not "X of Y
>> things completed". Y can't be known until just before the stream processing
>> completes otherwise it's not stream processing.
>>
>>
>> I'm not a fan of the SizedIterator either.
>> I contemplated this when designing the ChangeSet.  An alternative I
>> considered was to report progress like OSTree does.  It reports progress by
>> periodically updating the expected TOTAL.  It's better than nothing.
>>
>>
> What if we allow plugin writers to optionally provide a total number when
> instantiating the ChangeSet? I bet there will be cases where the number of
> items in the repository version will be known without having to fully parse
> all the metadata. In these cases the progress reporting could be more
> informative.
>
>

Here is another idea for progress reporting for stream processing: have
ChangeSet create a separate progress report for downloads. The total could
by dynamically updated as downloads are scheduled. The complete count can
be updated after each successful download.

Any limitations in progress reporting are outweighed by the efficiency
gained by having plugins always use stream processing. Just imagine not
having to wait for the RPM plugin to finish "processing metadata" to start
downloading content.


>
>>
>> [0]: https://pulp.plan.io/issues/3570
>>
>> Thanks!
>> Brian
>>
>>
>>
>> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel  wrote:
>>
>>>
>>>
>>> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>>>
>>>
>>> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel  wrote:
>>>


 On 04/12/2018 10:01 AM, Brian Bouterse wrote:



 On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel  wrote:

>
>
> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>
> I think we should look into this in the near-term. Changing an
> interface on an object used by all plugins will be significantly easier,
> earlier.
>
>
> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel 
> wrote:
>
>>
>>
>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>
>>
>>
>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel 
>> wrote:
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>
>>> Several plugins have started using the Changesets including
>>> pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
>>> provide several distinct points of value which are great, but there are 
>>> two
>>> challenges I want to bring up. I want to focus only on the problem
>>> statements first.
>>>
>>> 1. There is redundant "differencing" code in all plugins. The
>>> Changeset interface requires the plugin writer to determine what units 
>>> need
>>> to be added and those to be removed. This requires all plugin writers to
>>> write the same non-trivial differencing code over and over. For example,
>>> you can see the same non-trivial differencing code present in
>>> pulp_ansible
>>> ,
>>> pulp_file
>>> ,
>>> and pulp_python
>>> .
>>> Line-wise, this "differencing" code makes up a large portion (maybe 
>>> 50%) of
>>> the sync code itself in each plugin.
>>>
>>>
>>> Ten lines of trivial set logic hardly seems like a big deal but any
>>> duplication is worth exploring.
>>>
>> 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-16 Thread Dennis Kliban
On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel  wrote:

> Thanks for the proposal, Brian.  I also commented on the issue.
>
> On 04/16/2018 09:41 AM, Brian Bouterse wrote:
>
> I wrote up a description of the opportunity I see here [0]. I put a high
> level pro/con analysis below. I would like feedback on (a) if this
> adequately addresses the problem statements, (b) if there are alternatives,
> and (c) does this improve the plugin wrtier's experience enough to adopt
> this?
>
> pros:
> * significantly less plugin code to write. Compare the Thing example code
> versus the current docs.
>
> +1
>
> * Higher performing with metadata downloading and parsing being included
> in stream processing. This causes sync's for pulp_ansible to start 6+ min
> earlier.
>
>
> This could also be done currently with the ChangeSet as-is.
>
>
> cons:
> * Progress reporting doesn't know how many things it's processing (it's a
> stream). So user's would see progress as "X things completed", not "X of Y
> things completed". Y can't be known until just before the stream processing
> completes otherwise it's not stream processing.
>
>
> I'm not a fan of the SizedIterator either.
> I contemplated this when designing the ChangeSet.  An alternative I
> considered was to report progress like OSTree does.  It reports progress by
> periodically updating the expected TOTAL.  It's better than nothing.
>
>
What if we allow plugin writers to optionally provide a total number when
instantiating the ChangeSet? I bet there will be cases where the number of
items in the repository version will be known without having to fully parse
all the metadata. In these cases the progress reporting could be more
informative.


>
>
> [0]: https://pulp.plan.io/issues/3570
>
> Thanks!
> Brian
>
>
>
> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel  wrote:
>
>>
>>
>> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>>
>>
>> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel  wrote:
>>
>>>
>>>
>>> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>>
>>>
>>>
>>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel  wrote:
>>>


 On 04/11/2018 03:29 PM, Brian Bouterse wrote:

 I think we should look into this in the near-term. Changing an
 interface on an object used by all plugins will be significantly easier,
 earlier.


 On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel  wrote:

>
>
> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>
>
>
> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel 
> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>
>> Several plugins have started using the Changesets including
>> pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
>> provide several distinct points of value which are great, but there are 
>> two
>> challenges I want to bring up. I want to focus only on the problem
>> statements first.
>>
>> 1. There is redundant "differencing" code in all plugins. The
>> Changeset interface requires the plugin writer to determine what units 
>> need
>> to be added and those to be removed. This requires all plugin writers to
>> write the same non-trivial differencing code over and over. For example,
>> you can see the same non-trivial differencing code present in
>> pulp_ansible
>> ,
>> pulp_file
>> ,
>> and pulp_python
>> .
>> Line-wise, this "differencing" code makes up a large portion (maybe 50%) 
>> of
>> the sync code itself in each plugin.
>>
>>
>> Ten lines of trivial set logic hardly seems like a big deal but any
>> duplication is worth exploring.
>>
> It's more than ten lines. Take pulp_ansible for example. By my count
> (the linked to section) it's 89 lines, which out of 306 lines of plugin
> code for sync is 29% of extra redundant code. The other plugins have
> similar numbers. So with those numbers in mind, what do you think?
>
>
> I was counting the lines (w/o comments) in find_delta() based on the
> linked code.  Which functions are you counting?
>

 I was counting the find_delta, build_additions, and build_removals
 methods. Regardless of how the lines are counted, that differencing code is
 the duplication I'm talking about. There isn't a way to use the changesets
 without duplicating that differencing code 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-16 Thread Jeff Ortel

Thanks for the proposal, Brian.  I also commented on the issue.

On 04/16/2018 09:41 AM, Brian Bouterse wrote:
I wrote up a description of the opportunity I see here [0]. I put a 
high level pro/con analysis below. I would like feedback on (a) if 
this adequately addresses the problem statements, (b) if there are 
alternatives, and (c) does this improve the plugin wrtier's experience 
enough to adopt this?


pros:
* significantly less plugin code to write. Compare the Thing example 
code versus the current docs.

+1

* Higher performing with metadata downloading and parsing being 
included in stream processing. This causes sync's for pulp_ansible to 
start 6+ min earlier.


This could also be done currently with the ChangeSet as-is.



cons:
* Progress reporting doesn't know how many things it's processing 
(it's a stream). So user's would see progress as "X things completed", 
not "X of Y things completed". Y can't be known until just before the 
stream processing completes otherwise it's not stream processing.


I'm not a fan of the SizedIterator either.
I contemplated this when designing the ChangeSet.  An alternative I 
considered was to report progress like OSTree does.  It reports progress 
by periodically updating the expected TOTAL.  It's better than nothing.




[0]: https://pulp.plan.io/issues/3570

Thanks!
Brian



On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel > wrote:




On 04/12/2018 04:00 PM, Brian Bouterse wrote:


On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel > wrote:



On 04/12/2018 10:01 AM, Brian Bouterse wrote:



On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel
> wrote:



On 04/11/2018 03:29 PM, Brian Bouterse wrote:

I think we should look into this in the near-term.
Changing an interface on an object used by all plugins
will be significantly easier, earlier.


On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
> wrote:



On 04/11/2018 10:59 AM, Brian Bouterse wrote:



On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
> wrote:














On 04/06/2018 09:15 AM, Brian Bouterse wrote:

Several plugins have started using the
Changesets including pulp_ansible,
pulp_python, pulp_file, and perhaps others.
The Changesets provide several distinct
points of value which are great, but there
are two challenges I want to bring up. I want
to focus only on the problem statements first.

1. There is redundant "differencing" code in
all plugins. The Changeset interface requires
the plugin writer to determine what units
need to be added and those to be removed.
This requires all plugin writers to write the
same non-trivial differencing code over and
over. For example, you can see the same
non-trivial differencing code present in
pulp_ansible

,
pulp_file

,
and pulp_python

.
Line-wise, this "differencing" code makes up
a large portion (maybe 50%) of the sync code
itself in each plugin.


Ten lines of trivial set logic hardly seems
like a big deal but any duplication is worth
exploring.

It's more than ten lines. Take pulp_ansible for
example. By my count (the linked to section) it's
89 lines, which out of 306 lines of plugin code
for sync is 29% of extra redundant code. The other
plugins have similar numbers. So with those
numbers in mind, what do you think?


I was counting the lines (w/o comments) in
find_delta() based on the linked code. Which
functions are you counting?


I was counting the find_delta, build_additions, and
build_removals 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-16 Thread Brian Bouterse
I wrote up a description of the opportunity I see here [0]. I put a high
level pro/con analysis below. I would like feedback on (a) if this
adequately addresses the problem statements, (b) if there are alternatives,
and (c) does this improve the plugin wrtier's experience enough to adopt
this?

pros:
* significantly less plugin code to write. Compare the Thing example code
versus the current docs.
* Higher performing with metadata downloading and parsing being included in
stream processing. This causes sync's for pulp_ansible to start 6+ min
earlier.

cons:
* Progress reporting doesn't know how many things it's processing (it's a
stream). So user's would see progress as "X things completed", not "X of Y
things completed". Y can't be known until just before the stream processing
completes otherwise it's not stream processing.

[0]: https://pulp.plan.io/issues/3570

Thanks!
Brian



On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel  wrote:

>
>
> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>
>
> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel  wrote:
>
>>
>>
>> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>
>>
>>
>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel  wrote:
>>
>>>
>>>
>>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>
>>> I think we should look into this in the near-term. Changing an interface
>>> on an object used by all plugins will be significantly easier, earlier.
>>>
>>>
>>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel  wrote:
>>>


 On 04/11/2018 10:59 AM, Brian Bouterse wrote:



 On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel  wrote:

>
>
>
>
>
>
>
>
>
>
>
>
> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>
> Several plugins have started using the Changesets including
> pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
> provide several distinct points of value which are great, but there are 
> two
> challenges I want to bring up. I want to focus only on the problem
> statements first.
>
> 1. There is redundant "differencing" code in all plugins. The
> Changeset interface requires the plugin writer to determine what units 
> need
> to be added and those to be removed. This requires all plugin writers to
> write the same non-trivial differencing code over and over. For example,
> you can see the same non-trivial differencing code present in
> pulp_ansible
> ,
> pulp_file
> ,
> and pulp_python
> .
> Line-wise, this "differencing" code makes up a large portion (maybe 50%) 
> of
> the sync code itself in each plugin.
>
>
> Ten lines of trivial set logic hardly seems like a big deal but any
> duplication is worth exploring.
>
 It's more than ten lines. Take pulp_ansible for example. By my count
 (the linked to section) it's 89 lines, which out of 306 lines of plugin
 code for sync is 29% of extra redundant code. The other plugins have
 similar numbers. So with those numbers in mind, what do you think?


 I was counting the lines (w/o comments) in find_delta() based on the
 linked code.  Which functions are you counting?

>>>
>>> I was counting the find_delta, build_additions, and build_removals
>>> methods. Regardless of how the lines are counted, that differencing code is
>>> the duplication I'm talking about. There isn't a way to use the changesets
>>> without duplicating that differencing code in a plugin.
>>>
>>>
>>> The differencing code is limited to find_delta() and perhaps
>>> build_removals().  Agreed, the line count is less useful than specifically
>>> identifying duplicate code.  Outside of find_delta(), I see similar code
>>> (in part because it got copied from file plugin) but not seeing actual
>>> duplication.  Can you be more specific?
>>>
>>
>> Very similar code or identical code, I think it begs the question why are
>> we having plugin writer's do this at all? What value are they creating with
>> it? I don't have a reasonable answer to that question, so the requirement
>> for plugin writer's to write that code brings me back to the problem
>> statement: "plugin writers have redundant differencing code when using
>> Changesets". More info on why it is valuable for the plugin writer to do
>> the differencing code versus the Changesets would be helpful.
>>
>>
>> The ChangeSet abstraction (and API) is based on following division of
>> responsibility:
>>
>> The 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-15 Thread Jeff Ortel



On 04/11/2018 10:59 AM, Brian Bouterse wrote:



On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel > wrote:















On 04/06/2018 09:15 AM, Brian Bouterse wrote:

Several plugins have started using the Changesets including
pulp_ansible, pulp_python, pulp_file, and perhaps others. The
Changesets provide several distinct points of value which are
great, but there are two challenges I want to bring up. I want to
focus only on the problem statements first.

1. There is redundant "differencing" code in all plugins. The
Changeset interface requires the plugin writer to determine what
units need to be added and those to be removed. This requires all
plugin writers to write the same non-trivial differencing code
over and over. For example, you can see the same non-trivial
differencing code present in pulp_ansible

,
pulp_file

,
and pulp_python

.
Line-wise, this "differencing" code makes up a large portion
(maybe 50%) of the sync code itself in each plugin.


Ten lines of trivial set logic hardly seems like a big deal but
any duplication is worth exploring.

It's more than ten lines. Take pulp_ansible for example. By my count 
(the linked to section) it's 89 lines, which out of 306 lines of 
plugin code for sync is 29% of extra redundant code. The other plugins 
have similar numbers. So with those numbers in mind, what do you think?


I was counting the lines (w/o comments) in find_delta() based on the 
linked code.  Which functions are you counting?






2. Plugins can't do end-to-end stream processing. The Changesets
themselves do stream processing, but when you call into
changeset.apply_and_drain() you have to have fully parsed the
metadata already. Currently when fetching all metadata from
Galaxy, pulp_ansible takes about 380 seconds (6+ min). This means
that the actual Changeset content downloading starts 380 seconds
later than it could. At the heart of the problem, the
fetching+parsing of the metadata is not part of the stream
processing.


The additions/removals can be any interable (like generator) and
by using ChangeSet.apply() and iterating the returned object, the
pluign can "turn the crank" while downloading and processing the
metadata.  The ChangeSet.apply_and_drain() is just a convenience
method.  I don't see how this is a limitation of the ChangeSet.


That is new info for me (and maybe everyone). OK so Changesets have 
two interfaces. apply() and apply_and_drain(). Why do we have two 
interfaces when apply() can support all existing use cases (that I 
know of) and do end-to-end stream processing but apply_and_drain() 
cannot? I see all of our examples (and all of our new plugins) using 
apply_and_drain().


The ChangeSet.apply() was how I designed (and documented) it.  Not sure 
when/who added the apply_and_drain().  +1 for removing it.






Do you see the same challenges I do? Are these the right problem
statements? I think with clear problem statements a solution will
be easy to see and agree on.


I'm not convinced that these are actual problems/challenges that
need to be addressed in the near term.



Thanks!
Brian


___
Pulp-dev mailing list
Pulp-dev@redhat.com 
https://www.redhat.com/mailman/listinfo/pulp-dev




___
Pulp-dev mailing list
Pulp-dev@redhat.com 
https://www.redhat.com/mailman/listinfo/pulp-dev





___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-15 Thread Brian Bouterse
On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel  wrote:

>
>
>
>
>
>
>
>
>
>
>
>
> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>
> Several plugins have started using the Changesets including pulp_ansible,
> pulp_python, pulp_file, and perhaps others. The Changesets provide several
> distinct points of value which are great, but there are two challenges I
> want to bring up. I want to focus only on the problem statements first.
>
> 1. There is redundant "differencing" code in all plugins. The Changeset
> interface requires the plugin writer to determine what units need to be
> added and those to be removed. This requires all plugin writers to write
> the same non-trivial differencing code over and over. For example, you can
> see the same non-trivial differencing code present in pulp_ansible
> ,
> pulp_file
> ,
> and pulp_python
> .
> Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
> the sync code itself in each plugin.
>
>
> Ten lines of trivial set logic hardly seems like a big deal but any
> duplication is worth exploring.
>
It's more than ten lines. Take pulp_ansible for example. By my count (the
linked to section) it's 89 lines, which out of 306 lines of plugin code for
sync is 29% of extra redundant code. The other plugins have similar
numbers. So with those numbers in mind, what do you think?

>
>
> 2. Plugins can't do end-to-end stream processing. The Changesets
> themselves do stream processing, but when you call into
> changeset.apply_and_drain() you have to have fully parsed the metadata
> already. Currently when fetching all metadata from Galaxy, pulp_ansible
> takes about 380 seconds (6+ min). This means that the actual Changeset
> content downloading starts 380 seconds later than it could. At the heart of
> the problem, the fetching+parsing of the metadata is not part of the stream
> processing.
>
>
> The additions/removals can be any interable (like generator) and by using
> ChangeSet.apply() and iterating the returned object, the pluign can "turn
> the crank" while downloading and processing the metadata.  The
> ChangeSet.apply_and_drain() is just a convenience method.  I don't see how
> this is a limitation of the ChangeSet.
>

That is new info for me (and maybe everyone). OK so Changesets have two
interfaces. apply() and apply_and_drain(). Why do we have two interfaces
when apply() can support all existing use cases (that I know of) and do
end-to-end stream processing but apply_and_drain() cannot? I see all of our
examples (and all of our new plugins) using apply_and_drain().


>
>
> Do you see the same challenges I do? Are these the right problem
> statements? I think with clear problem statements a solution will be easy
> to see and agree on.
>
>
> I'm not convinced that these are actual problems/challenges that need to
> be addressed in the near term.
>
>
> Thanks!
> Brian
>
>
> ___
> Pulp-dev mailing 
> listPulp-dev@redhat.comhttps://www.redhat.com/mailman/listinfo/pulp-dev
>
>
>
> ___
> Pulp-dev mailing list
> Pulp-dev@redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-12 Thread Jeff Ortel



On 04/12/2018 04:00 PM, Brian Bouterse wrote:


On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel > wrote:




On 04/12/2018 10:01 AM, Brian Bouterse wrote:



On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel > wrote:



On 04/11/2018 03:29 PM, Brian Bouterse wrote:

I think we should look into this in the near-term. Changing
an interface on an object used by all plugins will be
significantly easier, earlier.


On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
> wrote:



On 04/11/2018 10:59 AM, Brian Bouterse wrote:



On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
> wrote:














On 04/06/2018 09:15 AM, Brian Bouterse wrote:

Several plugins have started using the Changesets
including pulp_ansible, pulp_python, pulp_file,
and perhaps others. The Changesets provide several
distinct points of value which are great, but
there are two challenges I want to bring up. I
want to focus only on the problem statements first.

1. There is redundant "differencing" code in all
plugins. The Changeset interface requires the
plugin writer to determine what units need to be
added and those to be removed. This requires all
plugin writers to write the same non-trivial
differencing code over and over. For example, you
can see the same non-trivial differencing code
present in pulp_ansible

,
pulp_file

,
and pulp_python

.
Line-wise, this "differencing" code makes up a
large portion (maybe 50%) of the sync code itself
in each plugin.


Ten lines of trivial set logic hardly seems like a
big deal but any duplication is worth exploring.

It's more than ten lines. Take pulp_ansible for
example. By my count (the linked to section) it's 89
lines, which out of 306 lines of plugin code for sync
is 29% of extra redundant code. The other plugins have
similar numbers. So with those numbers in mind, what do
you think?


I was counting the lines (w/o comments) in find_delta()
based on the linked code. Which functions are you counting?


I was counting the find_delta, build_additions, and
build_removals methods. Regardless of how the lines are
counted, that differencing code is the duplication I'm
talking about. There isn't a way to use the changesets
without duplicating that differencing code in a plugin.


The differencing code is limited to find_delta() and perhaps
build_removals().  Agreed, the line count is less useful than
specifically identifying duplicate code.  Outside of
find_delta(), I see similar code (in part because it got
copied from file plugin) but not seeing actual duplication. 
Can you be more specific?


Very similar code or identical code, I think it begs the question
why are we having plugin writer's do this at all? What value are
they creating with it? I don't have a reasonable answer to that
question, so the requirement for plugin writer's to write that
code brings me back to the problem statement: "plugin writers
have redundant differencing code when using Changesets". More
info on why it is valuable for the plugin writer to do the
differencing code versus the Changesets would be helpful.


The ChangeSet abstraction (and API) is based on following division
of responsibility:

The plugin  (with an understanding of the remote and its content):
  - Download metadata.
  - Parse metadata
  - Based on the metadata:
    - determine content to be added to the repository.
  - define how artifacts are downloaded.
  - construct content
    - determine content to be removed to the repository.

Core (without understand of specific remote or its content):
  - Provide low level API for plugin to affect the changes it has

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-12 Thread Brian Bouterse
On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel  wrote:

>
>
> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>
>
>
> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel  wrote:
>
>>
>>
>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>
>> I think we should look into this in the near-term. Changing an interface
>> on an object used by all plugins will be significantly easier, earlier.
>>
>>
>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel  wrote:
>>
>>>
>>>
>>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>
>>>
>>>
>>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel  wrote:
>>>












 On 04/06/2018 09:15 AM, Brian Bouterse wrote:

 Several plugins have started using the Changesets including
 pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
 provide several distinct points of value which are great, but there are two
 challenges I want to bring up. I want to focus only on the problem
 statements first.

 1. There is redundant "differencing" code in all plugins. The Changeset
 interface requires the plugin writer to determine what units need to be
 added and those to be removed. This requires all plugin writers to write
 the same non-trivial differencing code over and over. For example, you can
 see the same non-trivial differencing code present in pulp_ansible
 ,
 pulp_file
 ,
 and pulp_python
 .
 Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
 the sync code itself in each plugin.


 Ten lines of trivial set logic hardly seems like a big deal but any
 duplication is worth exploring.

>>> It's more than ten lines. Take pulp_ansible for example. By my count
>>> (the linked to section) it's 89 lines, which out of 306 lines of plugin
>>> code for sync is 29% of extra redundant code. The other plugins have
>>> similar numbers. So with those numbers in mind, what do you think?
>>>
>>>
>>> I was counting the lines (w/o comments) in find_delta() based on the
>>> linked code.  Which functions are you counting?
>>>
>>
>> I was counting the find_delta, build_additions, and build_removals
>> methods. Regardless of how the lines are counted, that differencing code is
>> the duplication I'm talking about. There isn't a way to use the changesets
>> without duplicating that differencing code in a plugin.
>>
>>
>> The differencing code is limited to find_delta() and perhaps
>> build_removals().  Agreed, the line count is less useful than specifically
>> identifying duplicate code.  Outside of find_delta(), I see similar code
>> (in part because it got copied from file plugin) but not seeing actual
>> duplication.  Can you be more specific?
>>
>
> Very similar code or identical code, I think it begs the question why are
> we having plugin writer's do this at all? What value are they creating with
> it? I don't have a reasonable answer to that question, so the requirement
> for plugin writer's to write that code brings me back to the problem
> statement: "plugin writers have redundant differencing code when using
> Changesets". More info on why it is valuable for the plugin writer to do
> the differencing code versus the Changesets would be helpful.
>
>
> The ChangeSet abstraction (and API) is based on following division of
> responsibility:
>
> The plugin  (with an understanding of the remote and its content):
>   - Download metadata.
>   - Parse metadata
>   - Based on the metadata:
> - determine content to be added to the repository.
>   - define how artifacts are downloaded.
>   - construct content
> - determine content to be removed to the repository.
>
> Core (without understand of specific remote or its content):
>   - Provide low level API for plugin to affect the changes it has
> determined need to be made to the repository.  This is downloaders, models
> etc.
>   - Provide high(er) level API for plugin to affect the changes it has
> determined need to be made to the repository.  This is the ChangeSet.
>
> Are you proposing that this is not the correct division?
>

Yes I believe these problem statements suggest we should adjust the plugin
writer's responsibilities when interacting with the Changesets in two
specific ways. It's not exactly the language you used, but I believe the
following two responsibilities could be moved into the Changesets entirely:

- determining if any given Artifact or Content unit is already present in
Pulp (aka computing what needs 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-12 Thread Jeff Ortel



On 04/12/2018 10:01 AM, Brian Bouterse wrote:



On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel > wrote:




On 04/11/2018 03:29 PM, Brian Bouterse wrote:

I think we should look into this in the near-term. Changing an
interface on an object used by all plugins will be significantly
easier, earlier.


On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel > wrote:



On 04/11/2018 10:59 AM, Brian Bouterse wrote:



On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
> wrote:














On 04/06/2018 09:15 AM, Brian Bouterse wrote:

Several plugins have started using the Changesets
including pulp_ansible, pulp_python, pulp_file, and
perhaps others. The Changesets provide several distinct
points of value which are great, but there are two
challenges I want to bring up. I want to focus only on
the problem statements first.

1. There is redundant "differencing" code in all
plugins. The Changeset interface requires the plugin
writer to determine what units need to be added and
those to be removed. This requires all plugin writers
to write the same non-trivial differencing code over
and over. For example, you can see the same non-trivial
differencing code present in pulp_ansible

,
pulp_file

,
and pulp_python

.
Line-wise, this "differencing" code makes up a large
portion (maybe 50%) of the sync code itself in each plugin.


Ten lines of trivial set logic hardly seems like a big
deal but any duplication is worth exploring.

It's more than ten lines. Take pulp_ansible for example. By
my count (the linked to section) it's 89 lines, which out of
306 lines of plugin code for sync is 29% of extra redundant
code. The other plugins have similar numbers. So with those
numbers in mind, what do you think?


I was counting the lines (w/o comments) in find_delta() based
on the linked code.  Which functions are you counting?


I was counting the find_delta, build_additions, and
build_removals methods. Regardless of how the lines are counted,
that differencing code is the duplication I'm talking about.
There isn't a way to use the changesets without duplicating that
differencing code in a plugin.


The differencing code is limited to find_delta() and perhaps
build_removals().  Agreed, the line count is less useful than
specifically identifying duplicate code.  Outside of find_delta(),
I see similar code (in part because it got copied from file
plugin) but not seeing actual duplication.  Can you be more specific?


Very similar code or identical code, I think it begs the question why 
are we having plugin writer's do this at all? What value are they 
creating with it? I don't have a reasonable answer to that question, 
so the requirement for plugin writer's to write that code brings me 
back to the problem statement: "plugin writers have redundant 
differencing code when using Changesets". More info on why it is 
valuable for the plugin writer to do the differencing code versus the 
Changesets would be helpful.


The ChangeSet abstraction (and API) is based on following division of 
responsibility:


The plugin  (with an understanding of the remote and its content):
  - Download metadata.
  - Parse metadata
  - Based on the metadata:
    - determine content to be added to the repository.
  - define how artifacts are downloaded.
  - construct content
    - determine content to be removed to the repository.

Core (without understand of specific remote or its content):
  - Provide low level API for plugin to affect the changes it has 
determined need to be made to the repository.  This is downloaders, 
models etc.
  - Provide high(er) level API for plugin to affect the changes it has 
determined need to be made to the repository.  This is the ChangeSet.


Are you proposing that this is not the correct division?





So a shorter, simpler problem statement is: "to use the
changesets plugin writers have to do extra work to compute
additions and removals parameters".


This statement ^ is better but still too vague to actually solve. 
 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-12 Thread Brian Bouterse
On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel  wrote:

>
>
> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>
> I think we should look into this in the near-term. Changing an interface
> on an object used by all plugins will be significantly easier, earlier.
>
>
> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel  wrote:
>
>>
>>
>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>
>>
>>
>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel  wrote:
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>
>>> Several plugins have started using the Changesets including
>>> pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
>>> provide several distinct points of value which are great, but there are two
>>> challenges I want to bring up. I want to focus only on the problem
>>> statements first.
>>>
>>> 1. There is redundant "differencing" code in all plugins. The Changeset
>>> interface requires the plugin writer to determine what units need to be
>>> added and those to be removed. This requires all plugin writers to write
>>> the same non-trivial differencing code over and over. For example, you can
>>> see the same non-trivial differencing code present in pulp_ansible
>>> ,
>>> pulp_file
>>> ,
>>> and pulp_python
>>> .
>>> Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
>>> the sync code itself in each plugin.
>>>
>>>
>>> Ten lines of trivial set logic hardly seems like a big deal but any
>>> duplication is worth exploring.
>>>
>> It's more than ten lines. Take pulp_ansible for example. By my count (the
>> linked to section) it's 89 lines, which out of 306 lines of plugin code for
>> sync is 29% of extra redundant code. The other plugins have similar
>> numbers. So with those numbers in mind, what do you think?
>>
>>
>> I was counting the lines (w/o comments) in find_delta() based on the
>> linked code.  Which functions are you counting?
>>
>
> I was counting the find_delta, build_additions, and build_removals
> methods. Regardless of how the lines are counted, that differencing code is
> the duplication I'm talking about. There isn't a way to use the changesets
> without duplicating that differencing code in a plugin.
>
>
> The differencing code is limited to find_delta() and perhaps
> build_removals().  Agreed, the line count is less useful than specifically
> identifying duplicate code.  Outside of find_delta(), I see similar code
> (in part because it got copied from file plugin) but not seeing actual
> duplication.  Can you be more specific?
>

Very similar code or identical code, I think it begs the question why are
we having plugin writer's do this at all? What value are they creating with
it? I don't have a reasonable answer to that question, so the requirement
for plugin writer's to write that code brings me back to the problem
statement: "plugin writers have redundant differencing code when using
Changesets". More info on why it is valuable for the plugin writer to do
the differencing code versus the Changesets would be helpful.


>
>
> So a shorter, simpler problem statement is: "to use the changesets plugin
> writers have to do extra work to compute additions and removals parameters".
>
>
> This statement ^ is better but still too vague to actually solve.  Can we
> elaborate on specifically what "to do extra work" means?
>

Sure. Removing that vague language is one way to resolve its vagueness.
Here's a revised problem statement: "to use the changesets plugin writers
have to compute additions and removals parameters". This problem statement
would be resolved by a solution that causes the plugin writer to never have
to produce these parameters and be replaced by an interface that would
require less effort from a plugin writer.


>
>
>
>>
>>
>>>
>>> 2. Plugins can't do end-to-end stream processing. The Changesets
>>> themselves do stream processing, but when you call into
>>> changeset.apply_and_drain() you have to have fully parsed the metadata
>>> already. Currently when fetching all metadata from Galaxy, pulp_ansible
>>> takes about 380 seconds (6+ min). This means that the actual Changeset
>>> content downloading starts 380 seconds later than it could. At the heart of
>>> the problem, the fetching+parsing of the metadata is not part of the stream
>>> processing.
>>>
>>>
>>> The additions/removals can be any interable (like generator) and by
>>> using ChangeSet.apply() and iterating the returned object, the pluign can
>>> "turn the crank" while downloading and processing the 

Re: [Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-11 Thread Brian Bouterse
I think we should look into this in the near-term. Changing an interface on
an object used by all plugins will be significantly easier, earlier.


On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel  wrote:

>
>
> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>
>
>
> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel  wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>
>> Several plugins have started using the Changesets including pulp_ansible,
>> pulp_python, pulp_file, and perhaps others. The Changesets provide several
>> distinct points of value which are great, but there are two challenges I
>> want to bring up. I want to focus only on the problem statements first.
>>
>> 1. There is redundant "differencing" code in all plugins. The Changeset
>> interface requires the plugin writer to determine what units need to be
>> added and those to be removed. This requires all plugin writers to write
>> the same non-trivial differencing code over and over. For example, you can
>> see the same non-trivial differencing code present in pulp_ansible
>> ,
>> pulp_file
>> ,
>> and pulp_python
>> .
>> Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
>> the sync code itself in each plugin.
>>
>>
>> Ten lines of trivial set logic hardly seems like a big deal but any
>> duplication is worth exploring.
>>
> It's more than ten lines. Take pulp_ansible for example. By my count (the
> linked to section) it's 89 lines, which out of 306 lines of plugin code for
> sync is 29% of extra redundant code. The other plugins have similar
> numbers. So with those numbers in mind, what do you think?
>
>
> I was counting the lines (w/o comments) in find_delta() based on the
> linked code.  Which functions are you counting?
>

I was counting the find_delta, build_additions, and build_removals methods.
Regardless of how the lines are counted, that differencing code is the
duplication I'm talking about. There isn't a way to use the changesets
without duplicating that differencing code in a plugin.

So a shorter, simpler problem statement is: "to use the changesets plugin
writers have to do extra work to compute additions and removals parameters".


>
>
>>
>> 2. Plugins can't do end-to-end stream processing. The Changesets
>> themselves do stream processing, but when you call into
>> changeset.apply_and_drain() you have to have fully parsed the metadata
>> already. Currently when fetching all metadata from Galaxy, pulp_ansible
>> takes about 380 seconds (6+ min). This means that the actual Changeset
>> content downloading starts 380 seconds later than it could. At the heart of
>> the problem, the fetching+parsing of the metadata is not part of the stream
>> processing.
>>
>>
>> The additions/removals can be any interable (like generator) and by using
>> ChangeSet.apply() and iterating the returned object, the pluign can "turn
>> the crank" while downloading and processing the metadata.  The
>> ChangeSet.apply_and_drain() is just a convenience method.  I don't see how
>> this is a limitation of the ChangeSet.
>>
>
> That is new info for me (and maybe everyone). OK so Changesets have two
> interfaces. apply() and apply_and_drain(). Why do we have two interfaces
> when apply() can support all existing use cases (that I know of) and do
> end-to-end stream processing but apply_and_drain() cannot? I see all of our
> examples (and all of our new plugins) using apply_and_drain().
>
>
> The ChangeSet.apply() was how I designed (and documented) it.  Not sure
> when/who added the apply_and_drain().  +1 for removing it.
>

I read through the changeset docs. I think this stream processing thing is
still a problem but perhaps in how we're presenting the Changeset with it's
arguments. I don't think apply() versus apply_and_drain() are at all
related. Regardless of if you are using apply() or apply_and_drain(), the
Changeset requires an 'additions' and 'removals' arguments. This sends a
clear message to the plugin writer that they need to compute additions and
removals. They will fetch the metadata to compute these which is mostly how
the changeset documentation reads. To know that they could present a
generator that would correctly allow the metdata from inside the Changeset
is I feel as non-obvious. I want the high-performing implementation to be
the obvious one.

So what about a problem statement like this: "Changesets are presented such
that when you call into them you should already have fetched the metadata"?


>
>
>
>>
>>
>> Do you see the same challenges I do? Are these the 

[Pulp-dev] Fwd: Re: Changesets Challenges

2018-04-10 Thread Jeff Ortel














On 04/06/2018 09:15 AM, Brian Bouterse wrote:
Several plugins have started using the Changesets including 
pulp_ansible, pulp_python, pulp_file, and perhaps others. The 
Changesets provide several distinct points of value which are great, 
but there are two challenges I want to bring up. I want to focus only 
on the problem statements first.


1. There is redundant "differencing" code in all plugins. The 
Changeset interface requires the plugin writer to determine what units 
need to be added and those to be removed. This requires all plugin 
writers to write the same non-trivial differencing code over and over. 
For example, you can see the same non-trivial differencing code 
present in pulp_ansible 
, 
pulp_file 
, 
and pulp_python 
. 
Line-wise, this "differencing" code makes up a large portion (maybe 
50%) of the sync code itself in each plugin.


Ten lines of trivial set logic hardly seems like a big deal but any 
duplication is worth exploring.




2. Plugins can't do end-to-end stream processing. The Changesets 
themselves do stream processing, but when you call into 
changeset.apply_and_drain() you have to have fully parsed the metadata 
already. Currently when fetching all metadata from Galaxy, 
pulp_ansible takes about 380 seconds (6+ min). This means that the 
actual Changeset content downloading starts 380 seconds later than it 
could. At the heart of the problem, the fetching+parsing of the 
metadata is not part of the stream processing.


The additions/removals can be any interable (like generator) and by 
using ChangeSet.apply() and iterating the returned object, the pluign 
can "turn the crank" while downloading and processing the metadata.  The 
ChangeSet.apply_and_drain() is just a convenience method.  I don't see 
how this is a limitation of the ChangeSet.




Do you see the same challenges I do? Are these the right problem 
statements? I think with clear problem statements a solution will be 
easy to see and agree on.


I'm not convinced that these are actual problems/challenges that need to 
be addressed in the near term.




Thanks!
Brian


___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev


___
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev