from:"Nikola Đipanov"

[openstack-dev] [Nova] SR-IOV IRC meeting and sub-team - passing the torch

2016-04-18 Thread Nikola Đipanov

Hi team,

As I'll be focusing on different things going forward, I was wondering
if someone from the group of people who were normally working in this
area would want to step up and take over the sub-team IRC meeting.

It is normally not a massive overhead, mostly tracking ongoing efforts
and patches and making sure there's a list of things that need reviews,
so don't be shy :)

Cheers,
Nikola

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] bug 1550250: delete_on_termination" flag: how to proceed?

2016-03-29 Thread Nikola Đipanov

On 03/29/2016 09:14 AM, Markus Zoeller wrote:
> The discussion around Bug 1550250 [1] doesn't show a clear path on how
> to proceed. In short, it's about if we want to keep the flag
> "delete_on_termination" for volumes which get migrated. Right now that
> flag gets reset during the migration process. The patch [2] is stalled
> by the missing decision. Would be great if we could come to a conclusion.
> 

I have left more details in the gerrit comments, but in short: I changed
my mind after a bit of back and forth with YaoZheng (the author of the
fix), and I think we should accept the fix.

As for this being an API breaking change - this has come up several
times before, and I don't think it is. We need to be able to fix broken
behavior without requiring microversion changes, as long as they don't
change the response format. Of course we should consider each case on
it's own based on how big the change is, but we should err on the side
of allowing broken behavior to be fixed without adding (useless) versions.

I don't think it's reasonable for Nova to strive to be bug compatible,
and this has been agreed on several other occasions. It seems to come up
from time to time so we may want to codify it somehow.

Thanks,
N.

> References:
> [1] https://bugs.launchpad.net/nova/+bug/1550250
> [2] https://review.openstack.org/#/c/288433/
> 
> Regards, Markus Zoeller (markus_z)
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] NUMA + SR-IOV

2016-03-24 Thread Nikola Đipanov

On 03/24/2016 04:18 PM, Sergey Nikitin wrote:
> 
> Hi, folks.
> 
> I want to start a discussion about NUMA + SR-IOV environment. I have a
> two-sockets server. It has two NUMA nodes and only one SR-IOV PCI
> device. This device is associated with the first NUMA node. I booted a
> set of VMs with SR-IOV support. Each of these VMs was booted on the
> first NUMA node. As I understand it happened for better performance (VM
> should be booted in NUMA node which has PCI device for this VM) [1]. 
> 
> But this behavior leaves my 2-sockets machines half-populated. What if I
> don't care about SR-IOV performance? I just want every VM from *any* of
> NUMA nodes to use this single SR-IOV PCI device.
> 
> But I can't do it because of behavior of numa_topology_filter. In this
> filter we want to know if current host has required PCI device [2]. But
> we want to have this device *only* in some numa cell on this host. It is
> hardcoded here [3]. If we do *not* pass variable "cells" to the method
> support_requests() [4] we will boot VM on the current host, if it has
> required PCI device *on host* (maybe not in the same NUMA node). 
> 
> So my question is:
> Is it correct that we *always* want to boot VM in NUMA node associated
> with requested PCI device and user has no choice?
> Or should we give a choice to the user and let him boot a VM with PCI
> device, associated with another NUMA node?
> 

This has come up before, and the fact that it keeps coming up tells me
that we should probably do something about it.

Potentially it makes sense to be lax by default unless user specifies
that they want to make sure that the device is on the same NUMA node,
but that is not backwards compatible.

It does not make sense to ask user to specify that they don't care IMHO,
as unless you know there is a problem (and users have nowhere near
enough information to tell), there is no reason for you to specify it -
it's just not sensible UI IMHO.

My 0.02 cents.

N.

> 
> [1]
> https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/input-output-based-numa-scheduling.html
> [2]
> https://github.com/openstack/nova/blob/master/nova/scheduler/filters/numa_topology_filter.py#L85
> [3]
> https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1246-L1247
> [4] https://github.com/openstack/nova/blob/master/nova/pci/stats.py#L277


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Grant FFE to "Host-state level locking" BP

2016-03-04 Thread Nikola Đipanov

On 03/04/2016 02:06 PM, John Garbutt wrote:
> tl;dr
> As on IRC, I don't think this should get an FFE this cycle.
> 
> On 4 March 2016 at 10:56, Nikola Đipanov <ndipa...@redhat.com> wrote:
>> Hi,
>>
>> The actual BP that links to the approved spec is here: [1] and 2
>> outstanding patches are [2][3].
>>
>> Apart from the usual empathy-inspired reasons to allow this (code's been
>> up for a while, yet only had real review on the last day etc.) which are
>> not related to the technical merit of the work, there is also the fact
>> that two initial patches that add locking around updates of the
>> in-memory host map ([4] and [5]) have already been merged.
>>
>> They add the overhead of locking to the scheduler, but without the final
>> work they don't provide any benefits (races will not be detected,
>> without [2]).
> 
> We could land a patch to drop the synchronized decorators, but it
> seemed like it might still help (the possibly theoretical issue?) of
> two greenlets competing decrementing the same resource counts.
> 
>> I don't have any numbers on this but the result is likely that we made
>> things worse, for the sake of adhering to random and made-up dates.
> 
> For details on the reasons behind our process, please see:
> http://docs.openstack.org/developer/nova/process.html
> 
>> With
>> this in mind I think it only makes sense to do our best to merge the 2
>> outstanding patches.
> 
> Looking at the feature freeze exception criteria:
> https://wiki.openstack.org/wiki/FeatureFreeze
> 
> The code is not ready to merge right now, so its hard to asses the
> risk of merging it, and hard to asses how long it will take to merge.
> It seems medium-ish risk, given the existing patches.
> 
> We have had 2 FFEs, just for things that were +Wed but merged when we
> cut mitaka-3. They are all merged now.
> 
> Time is much tighter this cycle than usual. We also seem to have less
> reviewers doing reviews than normal for this point in the cycle, and a
> much bigger backlog of bug fixes to review. We only have about 7 more
> working days between now and tagging RC1, at which point master opens
> for Newton, and these patches are free to merge again.
> 
> While this is useful, its not a regression. It would help us detect
> races in the scheduler sooner. It does not feel release critical.
> 

Thanks for the response John,

If we take "release critical" to mean "Nova not able to start VMs if we
don't have this", then no - it's not release critical.

But it does mean that people consuming releases will not get to use this
and consequently find and report bugs for another 6 months.

On a more personal note - this is the second thing that I was involved
with this cycle that got accepted, only to get half merged over a random
deadline. The other one being [1], which was just integration work that
would make a lot of other work that went in this cycle (in both Nova and
Neutron) usable. Again - the result is, we have the code in tree, but no
one can use it and test it.

Even if I try to keep my personal feelings out of this - I still feel
that this is a massive waste we are happy to accept for practically 0 gain.

N.

[1]
https://blueprints.launchpad.net/openstack/?searchtext=sriov-pf-passthrough-neutron-port

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova] Grant FFE to "Host-state level locking" BP

2016-03-04 Thread Nikola Đipanov

Hi,

The actual BP that links to the approved spec is here: [1] and 2
outstanding patches are [2][3].

Apart from the usual empathy-inspired reasons to allow this (code's been
up for a while, yet only had real review on the last day etc.) which are
not related to the technical merit of the work, there is also the fact
that two initial patches that add locking around updates of the
in-memory host map ([4] and [5]) have already been merged.

They add the overhead of locking to the scheduler, but without the final
work they don't provide any benefits (races will not be detected,
without [2]).

I don't have any numbers on this but the result is likely that we made
things worse, for the sake of adhering to random and made-up dates. With
this in mind I think it only makes sense to do our best to merge the 2
outstanding patches.

Cheers,
N.

[1]
https://blueprints.launchpad.net/openstack/?searchtext=host-state-level-locking
[2] https://review.openstack.org/#/c/262938/
[3] https://review.openstack.org/#/c/262939/

[4] https://review.openstack.org/#/c/259891/
[5] https://review.openstack.org/#/c/259892/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Re-booting the SR-IOV meeting

2016-02-29 Thread Nikola Đipanov

On 02/26/2016 10:35 PM, Shinobu Kinjo wrote:
> Hello,
> 
> Thank you for your message.
> Is there any list of only SR-IOV related bugfixes? If there is any
> pointer, that would be very useful.
> 

Hi - so I think that one of the main goals of the meeting tomorrow will
be to figure out a list of fixes we want to have reviewed and hopefully
merged in Mitaka.

I don't there is a straightforward and exhaustive way to find these (one
trick I use is to look for changes that touch a certain file, so
everything under nova/pci/ would be suspect, but there may be more).
This is why I think it would be good to meet and figure this out together.

N.

> Cheers,
> Shinobu
> 
> 
> On Fri, Feb 26, 2016 at 11:58 PM, Nikola Đipanov <ndipa...@redhat.com> wrote:
>> Hello folks,
>>
>> We are closing in on Mitaka-3 and it would be good to have a meeting and
>> see which SR-IOV related bugfixes we want to try to land for Mitaka.
>>
>> The next meeting slot is next week on Tuesday (March 1st) at 13:00 UTC,
>> so if there are any bugs you would like to discuss - that would be a
>> good place.
>>
>> The meeting times haven't changed - but there was not a lot of
>> interested parties for the last couple of meetings, so I thought I
>> should send an email to remind folks that the meeting is still on :)
>>
>> Talk to you then,
>>
>> Cheers,
>> N.
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova] Re-booting the SR-IOV meeting

2016-02-26 Thread Nikola Đipanov

Hello folks,

We are closing in on Mitaka-3 and it would be good to have a meeting and
see which SR-IOV related bugfixes we want to try to land for Mitaka.

The next meeting slot is next week on Tuesday (March 1st) at 13:00 UTC,
so if there are any bugs you would like to discuss - that would be a
good place.

The meeting times haven't changed - but there was not a lot of
interested parties for the last couple of meetings, so I thought I
should send an email to remind folks that the meeting is still on :)

Talk to you then,

Cheers,
N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-18 Thread Nikola Đipanov

On 02/15/2016 09:27 AM, Sylvain Bauza wrote:
> 
> 
> Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
>>
>> Hi,
>>
>>  
>>
>> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
>>  to testify its design goals
>> in accuracy, performance, reliability and compatibility improvements.
>> It will also be an Austin Summit Session if elected:
>> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>>
>>
>>  
>>
>> I want to gather opinions about this idea:
>>
>> 1. Is this feature possible to be accepted in the Newton release?
>>
> 
> Such feature requires a spec file to be written
> http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
> 
> Ideally, I'd like to see your below ideas written in that spec file so
> it would be the best way to discuss on the design.
> 
> 

I really cannot help but protest this!

There is actual code posted, and we go back and ask people to write
documents without even bothering to look at the code. That makes no
sense to me!

I'll go and comment on the proposed code:

https://review.openstack.org/#/c/280047/

Which has infinitely more information about the idea than a random text
document.

>> 2. Suggestions to improve its design and compatibility.
>>
> 
> I don't want to go into details here (that's rather the goal of the spec
> for that), but my biggest concerns would be when reviewing the spec :
>  - how this can meet the OpenStack mission statement (ie. ubiquitous
> solution that would be easy to install and massively scalable)
>  - how this can be integrated with the existing (filters, weighers) to
> provide a clean and simple path for operators to upgrade
>  - how this can be supporting rolling upgrades (old computes sending
> updates to new scheduler)
>  - how can we test it
>  - can we have the feature optional for operators
> 

This is precisely how we make sure there is no innovation happening in
Nova ever.

Not all of the above have to be answered for the idea to have technical
merit and be useful to some users. We should be happy to have feature
branches like this available for people to try out and use and iterate
on before we slam developers with our "you need to be this tall to ride"
list.

N.

> 
>> 3. Possibilities to integrate with resource-provider bp series: I know
>> resource-provider is the major direction of Nova scheduler, and there
>> will be fundamental changes in the future, especially according to the
>> bp
>> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
>> However, this prototype proposes a much faster and compatible way to
>> make schedule decisions based on scheduler caches. The in-memory
>> decisions are made at the same speed with the caching scheduler, but
>> the caches are kept consistent with compute nodes as quickly as
>> possible without db refreshing.
>>
>>  
>>
> 
> That's the key point, thanks for noticing our priorities. So, you know
> that our resource modeling is drastically subject to change in Mitaka
> and Newton. That is the new game, so I'd love to see how you plan to
> interact with that.
> Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
> your ideas because all of you are having great ideas to improve a
> current frustrating solution.
> 
> -Sylvain
> 
> 
>> Here is the detailed design of the mentioned prototype:
>>
>>  
>>
>> >>
>>
>> Background:
>>
>> The host state cache maintained by host manager is the scheduler
>> resource view during schedule decision making. It is updated whenever
>> a request is received[1], and all the compute node records are
>> retrieved from db every time. There are several problems in this
>> update model, proven in experiments[3]:
>>
>> 1. Performance: The scheduler performance is largely affected by db
>> access in retrieving compute node records. The db block time of a
>> single request is 355ms in average in the deployment of 3 compute
>> nodes, compared with only 3ms in in-memory decision-making. Imagine
>> there could be at most 1k nodes, even 10k nodes in the future.
>>
>> 2. Race conditions: This is not only a parallel-scheduler problem, but
>> also a problem using only one scheduler. The detailed analysis of
>> one-scheduler-problem is located in bug analysis[2]. In short, there
>> is a gap between the scheduler makes a decision in host state cache
>> and the
>>
>> compute node updates its in-db resource record according to that
>> decision in resource tracker. A recent scheduler resource consumption
>> in cache can be lost and overwritten by compute node data because of
>> it, result in cache inconsistency and unexpected retries. In a
>> one-scheduler experiment using 3-node deployment, there are 7 retries
>> out of 31 concurrent schedule requests recorded, results in 22.6%
>> extra performance overhead.
>>
>> 3. Parallel scheduler support: The design of filter scheduler leads to
>>

Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

2016-01-27 Thread Nikola Đipanov

Top posting since better scheduler testing just got brought up during
the midcycle meetup, so it might be useful to re-kindle this thread.

Sean (Dague) brought up that there is some infrastructure already that
could help us do what you propose bellow, but work may be needed to make
it viable for proper reasource accounting tests.

Yingxin - in case you are still interested in doing some of this stuff,
we can discuss here or on IRC.

Thanks,
Nikola

On 12/15/2015 03:33 AM, Cheng, Yingxin wrote:
> 
>> -Original Message-
>> From: Nikola Đipanov [mailto:ndipa...@redhat.com]
>> Sent: Monday, December 14, 2015 11:11 PM
>> To: OpenStack Development Mailing List (not for usage questions)
>> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race
>> conditions)?
>>
>> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
>>> Hi All,
>>>
>>>
>>>
>>> When I was looking at bugs related to race conditions of scheduler
>>> [1-3], it feels like nova scheduler lacks sanity checks of schedule
>>> decisions according to different situations. We cannot even make sure
>>> that some fixes successfully mitigate race conditions to an acceptable
>>> scale. For example, there is no easy way to test whether server-group
>>> race conditions still exists after a fix for bug[1], or to make sure
>>> that after scheduling there will be no violations of allocation ratios
>>> reported by bug[2], or to test that the retry rate is acceptable in
>>> various corner cases proposed by bug[3]. And there will be much more
>>> in this list.
>>>
>>>
>>>
>>> So I'm asking whether there is a plan to add those tests in the
>>> future, or is there a design exist to simplify writing and executing
>>> those kinds of tests? I'm thinking of using fake databases and fake
>>> interfaces to isolate the entire scheduler service, so that we can
>>> easily build up a disposable environment with all kinds of fake
>>> resources and fake compute nodes to test scheduler behaviors. It is
>>> even a good way to test whether scheduler is capable to scale to 10k
>>> nodes without setting up 10k real compute nodes.
>>>
>>
>> This would be a useful effort - however do not assume that this is going to 
>> be an
>> easy task. Even in the paragraph above, you fail to take into account that in
>> order to test the scheduling you also need to run all compute services since
>> claims work like a kind of 2 phase commit where a scheduling decision gets
>> checked on the destination compute host (through Claims logic), which 
>> involves
>> locking in each compute process.
>>
> 
> Yes, the final goal is to test the entire scheduling process including 2PC. 
> As scheduler is still in the process to be decoupled, some parts such as RT 
> and retry mechanism are highly coupled with nova, thus IMO it is not a good 
> idea to
> include them in this stage. Thus I'll try to isolate filter-scheduler as the 
> first step,
> hope to be supported by community.
> 
> 
>>>
>>>
>>> I'm also interested in the bp[4] to reduce scheduler race conditions
>>> in green-thread level. I think it is a good start point in solving the
>>> huge racing problem of nova scheduler, and I really wish I could help on 
>>> that.
>>>
>>
>> I proposed said blueprint but am very unlikely to have any time to work on 
>> it this
>> cycle, so feel free to take a stab at it. I'd be more than happy to 
>> prioritize any
>> reviews related to the above BP.
>>
>> Thanks for your interest in this
>>
>> N.
>>
> 
> Many thanks nikola! I'm still looking at the claim logic and try to find a 
> way to merge
> it with scheduler host state, will upload patches as soon as I figure it out. 
> 
> 
>>>
>>>
>>>
>>>
>>> [1] https://bugs.launchpad.net/nova/+bug/1423648
>>>
>>> [2] https://bugs.launchpad.net/nova/+bug/1370207
>>>
>>> [3] https://bugs.launchpad.net/nova/+bug/1341420
>>>
>>> [4]
>>> https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>>
>>> -Yingxin
>>>
> 
> 
> 
> Regards,
> -Yingxin
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova] PCI CI is down

2016-01-20 Thread Nikola Đipanov

Hey Nova,

It seems the a bug [1] sneaked in that made the PCI CI jobs fail 100% of
the time so they got turned off.

Fix for the bug should be making it's way through the queue soon, but it
was hinted on the review that there may be further problems. I'd like to
help fix these issues ASAP as the regression seems fairly fresh, but
debugging is hard since the CI is offline (instead of just non-voting)
so I can't really access any logs.

It'd be great if someone from the team helping out with the Intel PCI CI
would bring it back online as a non-voting job so that we can have
feedback which will surely help us fix it more quickly.

Cheers,
Nikola

[1] https://bugs.launchpad.net/nova/+bug/1535367

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] PCI CI is down

2016-01-20 Thread Nikola Đipanov

On 01/20/2016 03:03 PM, Lenny Verkhovsky wrote:
> Nikola,
> I guess Mellanox CI ( running, failing , non-voting ) can assist on this 
> issue.
> Since we found it and opened a bug.
> 

Ah - I was not aware that they use the same hw/configuration. Will take
a look!

Thanks,
N.

> 
> -Original Message-
> From: Nikola Đipanov [mailto:ndipa...@redhat.com] 
> Sent: Wednesday, January 20, 2016 4:41 PM
> To: OpenStack Development Mailing List <openstack-dev@lists.openstack.org>
> Subject: [openstack-dev] [Nova] PCI CI is down
> 
> Hey Nova,
> 
> It seems the a bug [1] sneaked in that made the PCI CI jobs fail 100% of the 
> time so they got turned off.
> 
> Fix for the bug should be making it's way through the queue soon, but it was 
> hinted on the review that there may be further problems. I'd like to help fix 
> these issues ASAP as the regression seems fairly fresh, but debugging is hard 
> since the CI is offline (instead of just non-voting) so I can't really access 
> any logs.
> 
> It'd be great if someone from the team helping out with the Intel PCI CI 
> would bring it back online as a non-voting job so that we can have feedback 
> which will surely help us fix it more quickly.
> 
> Cheers,
> Nikola
> 
> [1] https://bugs.launchpad.net/nova/+bug/1535367
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

2015-12-15 Thread Nikola Đipanov

On 12/15/2015 03:33 AM, Cheng, Yingxin wrote:
> 
>> -Original Message-
>> From: Nikola Đipanov [mailto:ndipa...@redhat.com]
>> Sent: Monday, December 14, 2015 11:11 PM
>> To: OpenStack Development Mailing List (not for usage questions)
>> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race
>> conditions)?
>>
>> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
>>> Hi All,
>>>
>>>
>>>
>>> When I was looking at bugs related to race conditions of scheduler
>>> [1-3], it feels like nova scheduler lacks sanity checks of schedule
>>> decisions according to different situations. We cannot even make sure
>>> that some fixes successfully mitigate race conditions to an acceptable
>>> scale. For example, there is no easy way to test whether server-group
>>> race conditions still exists after a fix for bug[1], or to make sure
>>> that after scheduling there will be no violations of allocation ratios
>>> reported by bug[2], or to test that the retry rate is acceptable in
>>> various corner cases proposed by bug[3]. And there will be much more
>>> in this list.
>>>
>>>
>>>
>>> So I'm asking whether there is a plan to add those tests in the
>>> future, or is there a design exist to simplify writing and executing
>>> those kinds of tests? I'm thinking of using fake databases and fake
>>> interfaces to isolate the entire scheduler service, so that we can
>>> easily build up a disposable environment with all kinds of fake
>>> resources and fake compute nodes to test scheduler behaviors. It is
>>> even a good way to test whether scheduler is capable to scale to 10k
>>> nodes without setting up 10k real compute nodes.
>>>
>>
>> This would be a useful effort - however do not assume that this is going to 
>> be an
>> easy task. Even in the paragraph above, you fail to take into account that in
>> order to test the scheduling you also need to run all compute services since
>> claims work like a kind of 2 phase commit where a scheduling decision gets
>> checked on the destination compute host (through Claims logic), which 
>> involves
>> locking in each compute process.
>>
> 
> Yes, the final goal is to test the entire scheduling process including 2PC. 
> As scheduler is still in the process to be decoupled, some parts such as RT 
> and retry mechanism are highly coupled with nova, thus IMO it is not a good 
> idea to
> include them in this stage. Thus I'll try to isolate filter-scheduler as the 
> first step,
> hope to be supported by community.
> 
> 
>>>
>>>
>>> I'm also interested in the bp[4] to reduce scheduler race conditions
>>> in green-thread level. I think it is a good start point in solving the
>>> huge racing problem of nova scheduler, and I really wish I could help on 
>>> that.
>>>
>>
>> I proposed said blueprint but am very unlikely to have any time to work on 
>> it this
>> cycle, so feel free to take a stab at it. I'd be more than happy to 
>> prioritize any
>> reviews related to the above BP.
>>
>> Thanks for your interest in this
>>
>> N.
>>
> 
> Many thanks nikola! I'm still looking at the claim logic and try to find a 
> way to merge
> it with scheduler host state, will upload patches as soon as I figure it out. 
> 

Great!

Note that that step is not necessary - and indeed it may not be the best
place to start. We already have code duplication between the claims and
(what is only recently been renamed) consume_from_request, so removing
it is a nice to have but really not directly related to fixing the races.

Also after Sylvain's work here https://review.openstack.org/#/c/191251/
it will be trickoer to do as the scheduler side now used the RequestSpec
object instead of Instance, which is not sent over to compute nodes.

I'd personally leave that for last.

M.

> 
>>>
>>>
>>>
>>>
>>> [1] https://bugs.launchpad.net/nova/+bug/1423648
>>>
>>> [2] https://bugs.launchpad.net/nova/+bug/1370207
>>>
>>> [3] https://bugs.launchpad.net/nova/+bug/1341420
>>>
>>> [4]
>>> https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>>
>>> -Yingxin
>>>
> 
> 
> 
> Regards,
> -Yingxin
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

2015-12-14 Thread Nikola Đipanov

On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
> Hi All,
> 
>  
> 
> When I was looking at bugs related to race conditions of scheduler
> [1-3], it feels like nova scheduler lacks sanity checks of schedule
> decisions according to different situations. We cannot even make sure
> that some fixes successfully mitigate race conditions to an acceptable
> scale. For example, there is no easy way to test whether server-group
> race conditions still exists after a fix for bug[1], or to make sure
> that after scheduling there will be no violations of allocation ratios
> reported by bug[2], or to test that the retry rate is acceptable in
> various corner cases proposed by bug[3]. And there will be much more in
> this list.
> 
>  
> 
> So I'm asking whether there is a plan to add those tests in the future,
> or is there a design exist to simplify writing and executing those kinds
> of tests? I'm thinking of using fake databases and fake interfaces to
> isolate the entire scheduler service, so that we can easily build up a
> disposable environment with all kinds of fake resources and fake compute
> nodes to test scheduler behaviors. It is even a good way to test whether
> scheduler is capable to scale to 10k nodes without setting up 10k real
> compute nodes.
>

This would be a useful effort - however do not assume that this is going
to be an easy task. Even in the paragraph above, you fail to take into
account that in order to test the scheduling you also need to run all
compute services since claims work like a kind of 2 phase commit where a
scheduling decision gets checked on the destination compute host
(through Claims logic), which involves locking in each compute process.

>  
> 
> I'm also interested in the bp[4] to reduce scheduler race conditions in
> green-thread level. I think it is a good start point in solving the huge
> racing problem of nova scheduler, and I really wish I could help on that.
> 

I proposed said blueprint but am very unlikely to have any time to work
on it this cycle, so feel free to take a stab at it. I'd be more than
happy to prioritize any reviews related to the above BP.

Thanks for your interest in this

N.

>  
> 
>  
> 
> [1] https://bugs.launchpad.net/nova/+bug/1423648
> 
> [2] https://bugs.launchpad.net/nova/+bug/1370207
> 
> [3] https://bugs.launchpad.net/nova/+bug/1341420
> 
> [4] https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
> 
>  
> 
>  
> 
> Regards,
> 
> -Yingxin
> 
>  
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo][messaging] configurable ack-then-process (at least/most once) behavior

2015-12-01 Thread Nikola Đipanov

On 11/30/2015 01:28 PM, Bogdan Dobrelya wrote:
> Hello.
> Please let's make this change [0] happen to the Oslo messaging.
> This is reasonable, straightforward and backwards compatible change. And
> it is required for OpenStack applications - see [1] - to implement a
> sane HA. The only thing left is to cover this change by unit tests.
> 
> [0] https://review.openstack.org/229186
> [1]
> http://lists.openstack.org/pipermail/openstack-dev/2015-October/076217.html
> 

I've also looked into doing something like this for a use case very
similar to Mistral a few months back, and my investigation came to a
similar conclusion to what Medhi (sileht) commented on the above patch -
you can't do this because it changes the semantics of current cast and
call methods based on how the server is declared without the client
being aware - that's a bad way to design an API.

What I came up with as a possible design back then, in case you want to
use oslo.messaging to dispatch async tasks that should be done at least
once (for example to be safe against worker crashing), is add a new
method to the oslo.messaging client interface.

For example - we may want to call it something like ensure() or similar
so that it is clear what the semantics are, and we want to be careful to
not tie it's semantics to the AMQP model too much.

I haven't looked into how it can be implemented in depth, but it would
surely be more than your above patch, as you would need to evolve
several interfaces in oslo.messaging to make this happen.

I'm not an oslo.messaging maintainer, and am actually a total bystander
in this matter so feel free to disregard this as an irrelevant opinion,
however maybe Mehdi and some of the oslo.messaging folks will comment
further.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] [Neutron] SR-IOV subteam

2015-11-12 Thread Nikola Đipanov

Top posting since I wanted to just add the [Neutron] tag to the subject
as I imagine there are a few folks in Neutron-land who will be
interested in this.

We had the first meeting this week [1] and there were some cross project
topics mentioned (especially around scheduling) so feel free to review
and comment.

[1]
http://eavesdrop.openstack.org/meetings/sriov/2015/sriov.2015-11-10-13.09.log.html

On 11/10/2015 01:42 AM, Nikola Đipanov wrote:
> On 11/04/2015 07:56 AM, Moshe Levi wrote:
>> Maybe we can you use the pci- passthrough meeting slot 
>> http://eavesdrop.openstack.org/#PCI_Passthrough_Meeting 
>> It been a long time since we had a meeting. 
>>
> 
> I think that slot works well (at least for me). I'd maybe change the
> cadence to bi-weekly in the beginning and see if we need to increase it
> as the cycle progresses.
> 
> Here's the patch proposing the said changes:
> 
> https://review.openstack.org/243382
> 
> On 11/09/2015 06:33 PM, Beliveau, Ludovic wrote:
>> Is there a meeting planned for this week ?
>>
>> Thanks,
>> /ludovic
> 
> Why not - let's have it today 13:00 UTC as the above patch suggests and
> chat more on there.
> 
> Thanks,
> N.
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] SR-IOV subteam

2015-11-09 Thread Nikola Đipanov

On 11/04/2015 07:56 AM, Moshe Levi wrote:
> Maybe we can you use the pci- passthrough meeting slot 
> http://eavesdrop.openstack.org/#PCI_Passthrough_Meeting 
> It been a long time since we had a meeting. 
> 

I think that slot works well (at least for me). I'd maybe change the
cadence to bi-weekly in the beginning and see if we need to increase it
as the cycle progresses.

Here's the patch proposing the said changes:

https://review.openstack.org/243382

On 11/09/2015 06:33 PM, Beliveau, Ludovic wrote:
> Is there a meeting planned for this week ?
>
> Thanks,
> /ludovic

Why not - let's have it today 13:00 UTC as the above patch suggests and
chat more on there.

Thanks,
N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Proposal to add Alex Xu to nova-core

2015-11-09 Thread Nikola Đipanov

On 11/06/2015 03:32 PM, John Garbutt wrote:
> Hi,
> 
> I propose we add Alex Xu[1] to nova-core.
> 

+1


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Proposal to add Sylvain Bauza to nova-core

2015-11-09 Thread Nikola Đipanov

On 11/06/2015 03:32 PM, John Garbutt wrote:
> Hi,
> 
> I propose we add Sylvain Bauza[1] to nova-core.
> 

+1


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova] SR-IOV subteam

2015-11-02 Thread Nikola Đipanov

Hello Nova,

Looking at Mitaka specs, but also during the Tokyo design summit
sessions, we've seen several discussions and requests for enhancements
to the Nova SR-IOV functionality.

It has been brought up during the Summit that we may want to organize as
a subteam to track all of the efforts better and make sure we get all
the expert reviews on stuff more quickly.

I have already added an entry on the subteams page [1] and on the
reviews etherpad for Mitaka [2]. We may also want to have a meeting
slot. As I am out for the week, I'll let others propose a time for it
(that will hopefully work for all interested parites and their
timezones) and we can take it from there next week.

As always - comments and suggestions much appreciated.

Many thanks,
Nikola

[1] https://wiki.openstack.org/wiki/Nova#Nova_subteams
[2] https://etherpad.openstack.org/p/mitaka-nova-priorities-tracking

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Migration state machine proposal.

2015-10-22 Thread Nikola Đipanov

On 10/21/2015 10:17 PM, Joshua Harlow wrote:
> Question on some things seen in the below paste.
> 
> What is with 'finished' -> 'reverted' and 'finished' -> 'confirmed'?
> 
> Why does it jump over 'reverting' or 'confirming'? Should it?
> 
> The other question is the difference between 'failed' and 'error' in the
> first diagram, any idea on why/how these are semantically different? The
> difference between 'done' and 'finished' are also in my mind
> semantically confusing.
> 
> Overall I'm very much inclined to have three state machines (one for
> each type), vs the mix-mash of all three into one state machine (which
> causes the confusion around states in the first diagram in that paste).
> 

So the problem here is that they (as you point out) grew organically,
and we are exposing these through the API. We need to keep them, and I
see this BP as simply documenting them with automaton thrown in for it's
validation and documenting features.

So - we _do not_ want to change these. Think of them as information for
human consumption.

What we may want to do is add an additional field (called state instead
of status maybe), that we can use to re-boot states, and define better
state machines that are easier to write tooling against. This is a
separate effort, that will surely need a spec and a discussion to get
the states right.

That's what we (or at least I) were talking about.
N.

> Josh
> 
> Tang Chen wrote:
>> Hi,
>>
>> Please help to take a look at this problem. I was trying to raise it in
>> the spec discussion.
>> But since we don't need a spec on this problem, so I want to discuss it
>> here.
>> It is about what the new state machine will be.
>>
>> http://paste.openstack.org/show/476954/
>>
>> Thanks.
>>
>>
>>
>>
>> __
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Migration state machine proposal.

2015-10-19 Thread Nikola Đipanov

On 10/19/2015 11:13 AM, Tang Chen wrote:
> Hi, all,
> 
> If you don't mind, how about approve the BP, and I can start this work.
> 

This is IMHO the biggest drawback of the current spec process (as I've
written before).

There is no reason why you should doubt that this particular spec will
get approved, yet due to the combination of extremely limited review
bandwidth and very aggressive deadlines, there is a good chance your
spec will miss Mitaka release purely on process grounds.

This makes you even further dissuaded from actually putting in the
development effort.

N.

> Thanks. 
> 
> 
> On 10/15/2015 04:53 PM, Tang Chen wrote:
>> Hi all,
>>
>> The spec is now available here:
>> https://review.openstack.org/#/c/235169/
>>
>> Please help to review.
>>
>> Thanks.
>>
>> On 10/14/2015 10:05 AM, Tang Chen wrote:
>>> Hi, all,
>>>
>>> Please help to review this BP.
>>>
>>> https://blueprints.launchpad.net/nova/+spec/live-migration-state-machine
>>>
>>>
>>> Currently, the migration_status field in Migration object is
>>> indicating the
>>> status of migration process. But in the current code, it is represented
>>> by pure string, like 'migrating', 'finished', and so on.
>>>
>>> The strings could be confusing to different developers, e.g. there are 3
>>> statuses representing the migration process is over successfully:
>>> 'finished', 'completed' and 'done'.
>>> And 2 for migration in process: 'running' and 'migrating'.
>>>
>>> So I think we should use constants or enum for these statuses.
>>>
>>>
>>> Furthermore, Nikola has proposed to create a state machine for the
>>> statuses,
>>> which is part of another abandoned BP. And this is also the work I'd
>>> like to go
>>> on with. Please refer to:
>>> https://review.openstack.org/#/c/197668/
>>> 
>>> https://review.openstack.org/#/c/197669/
>>> 
>>>
>>>
>>> Another proposal is: introduce a new member named "state" into Migration.
>>> Use a state machine to handle this Migration.state, and leave
>>> migration_status
>>> field a descriptive human readable free-form.
>>>
>>>
>>> So how do you think ?
>>>
>>> Thanks.
>>>
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Migration state machine proposal.

2015-10-14 Thread Nikola Đipanov

On 10/14/2015 04:29 AM, Tang Chen wrote:
>>
>> On Wed, Oct 14, 2015 at 10:05 AM, Tang Chen > > wrote:
>>
>> Hi, all,
>>
>> Please help to review this BP.
>>
>> https://blueprints.launchpad.net/nova/+spec/live-migration-state-machine
>>
>>
>> Currently, the migration_status field in Migration object is
>> indicating the
>> status of migration process. But in the current code, it is
>> represented
>> by pure string, like 'migrating', 'finished', and so on.
>>
>> The strings could be confusing to different developers, e.g. there
>> are 3
>> statuses representing the migration process is over successfully:
>> 'finished', 'completed' and 'done'.
>> And 2 for migration in process: 'running' and 'migrating'.
>>
>> So I think we should use constants or enum for these statuses.
>>
>>
>> Furthermore, Nikola has proposed to create a state machine for the
>> statuses,
>> which is part of another abandoned BP. And this is also the work
>> I'd like to go
>> on with. Please refer to:
>> https://review.openstack.org/#/c/197668/
>> https://review.openstack.org/#/c/197669/
>>

This is IMHO a worthwhile effort on it's own. I'd like to see it use a
defined state machine in addition to being a simple enum so that
transitions are clearly defined as well.

>>
>> Another proposal is: introduce a new member named "state" into
>> Migration.
>> Use a state machine to handle this Migration.state, and leave
>> migration_status
>> field a descriptive human readable free-form.
>>

This is a separate effort IMHO - we should do both if possible.

>
> On 10/14/2015 11:14 AM, Zhenyu Zheng wrote:
>> I think it will be better if you can submit a spec for your proposal,
>> it will be easier for people to give comment.
>
> OK, will submit one soon.

If you plan to just enumerate the possible states - that should not
require a spec. Adding automaton in the mix, and especially adding a new
'state' field probably does deserve some discussion so in that case feel
free to write up a spec.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][mistral] Automatic evacuation as a long running task

2015-10-12 Thread Nikola Đipanov

On 10/06/2015 04:34 PM, Matthew Booth wrote:
> Hi, Roman,
> 
> Evacuated has been on my radar for a while and this post has prodded me
> to take a look at the code. I think it's worth starting by explaining
> the problems in the current solution. Nova client is currently
> responsible for doing this evacuate. It does:
>



> 
> I believe we can solve this problem, but I think that without fixing
> single-instance evacuate we're just pushing the problem around (or
> creating new places for it to live). I would base the robustness of my
> implementation on a single principal:
> 
>   An instance has a single owner, which is exclusively responsible for
> rebuilding it.
> 
> In outline, I would redefine the evacuate process to do:
> 
> API:
> 1. Call the scheduler to get a destination for the evacuate if none was
> given.
> 2. Atomically update instance.host to this destination, and task state
> to rebuilding.
> 

We can't do this because of resource tracking - the host switch has to
be done after the claim is done which can happen only on the target
compute, otherwise we don't track the resources properly (*).

That does not invalidate your more general point which is that we need a
way to make sure that started evacuations can be picked up and resumed
in case of any failures along the way (even a rebuild failure of the
target host that may have failed during the process).

Some work that dansmith did [1] and I later built upon some of that work
[2]. I think our assumption was that we would use the migration record
for this, which _I think_ gives us all the stuff you talk about further
below, apart of course from there being a need for an external task to
actually see the evacuation through to the end. I think this is in-line
with most HA design proposals, where we make sure our control plane is
redundant while we really don't care about individual compute nodes
(apart from the instances they host).

I am also not sure that leaving the actual building of the instance up
to a periodic task is a good choice if we want to minimize downtime
which seem to me to be the point of the instance HA proposals.

N.

(*) We could "solve" this by checkin instance.task_state for example but
IMHO we shouldn't go down that route as it becomes way more difficult to
reason about resource tracking once you introduce one more free-variable.

[1]
https://github.com/openstack/nova/blob/02b7e64b29dd707c637ea7026d337e5cb196f337/nova/compute/api.py#L3303
[2]
https://github.com/openstack/nova/blob/02b7e64b29dd707c637ea7026d337e5cb196f337/nova/compute/manager.py#L2702

> Compute:
> 3. Rebuild the instance.
> 
> This would be supported by a periodic task on the compute host which
> looks for rebuilding instances assigned to this host which aren't
> currently rebuilding, and kicks off a rebuild for them. This would cover
> the compute going down during a rebuild, or the api going down before
> messaging the compute.
> 
> Implementing this gives us several things:
> 
> 1. The list instances, evacuate all instances process becomes
> idempotent, because as soon as the evacuate is initiated, the instance
> is removed from the source host.
> 2. We get automatic recovery of failure of the target compute. Because
> we atomically moved the instance to the target compute immediately, if
> the target compute also has to be evacuated, our instance won't fall
> through the gap.
> 3. We don't need an additional place for the code to run, because it
> will run on the compute. All the work has to be done by the compute
> anyway. By farming the evacuates out directly and immediately to the
> target compute we reduce both overhead and complexity.
> 
> The coordination becomes very simple. If we've run the nova client
> evacuation anywhere at least once, the actual evacuations are now
> Sombody Else's Problem (to quote h2g2), and will complete eventually. As
> evacuation in any case involves a forced change of owner it requires
> fencing of the source and implies an external agent such as pacemaker.
> The nova client evacuation can run in pacemaker.
> 
> Matt
> 
> On Fri, Oct 2, 2015 at 2:05 PM, Roman Dobosz  > wrote:
> 
> Hi all,
> 
> The case of automatic evacuation (or resurrection currently), is a topic
> which surfaces once in a while, but it isn't yet fully supported by
> OpenStack and/or by the cluster services. There was some attempts to
> bring the feature into OpenStack, however it turns out it cannot be
> easily integrated with. On the other hand evacuation may be executed
> from the outside using Nova client or Nova API calls for evacuation
> initiation.
> 
> I did some research regarding the ways how it could be designed, based
> on Russel Bryant blog post[1] as a starting point. Apart from it, I've
> also taken high availability and reliability into consideration when
> designing the solution.
> 
> Together with coworker, we did first

Re: [openstack-dev] [nova] how to address boot from volume failures

2015-10-01 Thread Nikola Đipanov

On 09/30/2015 10:45 PM, Andrew Laski wrote:
> On 09/30/15 at 05:03pm, Sean Dague wrote:
>> Today we attempted to branch devstack and grenade for liberty, and are
>> currently blocked because in liberty with openstack client and
>> novaclient, it's not possible to boot a server from volume using just
>> the volume id.
>>
>> That's because of this change in novaclient -
>> https://review.openstack.org/#/c/221525/
>>
>> That was done to resolve the issue that strong schema validation in Nova
>> started rejecting the kinds of calls that novaclient was making for boot
>> from volume, because the bdm 1 and 2 code was sharing common code and
>> got a bit tangled up. So 3 bdm 2 params were being sent on every request.
>>
>> However, https://review.openstack.org/#/c/221525/ removed the ==1 code
>> path. If you pass in just {"vda": "$volume_id"} the code falls through,
>> volume id is lost, and nothing is booted. This is how the devstack
>> exercises and osc recommends booting from volume. I expect other people
>> might be doing that as well.
>>
>> There seem to be a few options going forward:
>>
>> 1) fix the client without a revert
>>
>> This would bring back a ==1 code path, which is basically just setting
>> volume_id, and move on. This means that until people upgrade their
>> client they loose access to this function on the server.
>>
>> 2) revert the client and loose up schema validation
>>
>> If we revert the client to the old code, we also need to accept the fact
>> that novaclient has been sending 3 extra parameters to this API call
>> since as long as people can remember. We'd need a nova schema relax to
>> let those in and just accept that people are going to pass those.
>>
>> 3) fix osc and novaclient cli to not use this code path. This will also
>> require everyone upgrades both of those to not explode in the common
>> case of specifying boot from volume on the command line.
>>
>> I slightly lean towards #2 on a compatibility front, but it's a chunk of
>> change at this point in the cycle, so I don't think there is a clear win
>> path. It would be good to collect opinions here. The bug tracking this
>> is - https://bugs.launchpad.net/python-openstackclient/+bug/1501435
> 
> I have a slight preference for #1.  Nova is not buggy here novaclient is
> so I think we should contain the fix there.
> 

+1 - this is obviously a client bug

> Is using the v2 API an option?  That should also allow the 3 extra
> parameters mentioned in #2.
> 

This could be a short term solution I guess, but long term we want to be
testing the code that is there to stay so really we want to fix the
client ASAP.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Scheduler hints, API and Objects

2015-09-04 Thread Nikola Đipanov

On 09/04/2015 02:31 PM, Sylvain Bauza wrote:
> 
> 
> Le 04/09/2015 14:57, Nikola Đipanov a écrit :
>> On 06/25/2015 04:50 PM, Monty Taylor wrote:
>>> On 06/25/2015 10:22 AM, Andrew Laski wrote:
>>>> I have been growing concerned recently with some attempts to formalize
>>>> scheduler hints, both with API validation and Nova objects defining
>>>> them, and want to air those concerns and see if others agree or can
>>>> help
>>>> me see why I shouldn't worry.
>>>>
>>>> Starting with the API I think the strict input validation that's being
>>>> done, as seen in
>>>> http://git.openstack.org/cgit/openstack/nova/tree/nova/api/openstack/compute/schemas/v3/scheduler_hints.py?id=53677ebba6c86bd02ae80867028ed5f21b1299da,
>>>>
>>>> is unnecessary, and potentially problematic.
>>>>
>>>> One problem is that it doesn't indicate anything useful for a client.
>>>> The schema indicates that there are hints available but can make no
>>>> claim about whether or not they're actually enabled.  So while a
>>>> microversion bump would typically indicate a new feature available
>>>> to an
>>>> end user, in the case of a new scheduler hint a microversion bump
>>>> really
>>>> indicates nothing at all.  It does ensure that if a scheduler hint is
>>>> used that it's spelled properly and the data type passed is correct,
>>>> but
>>>> that's primarily useful because there is no feedback mechanism to
>>>> indicate an invalid or unused scheduler hint.  I think the API
>>>> schema is
>>>> a poor proxy for that deficiency.
>>>>
>>>> Since the exposure of a hint means nothing as far as its usefulness, I
>>>> don't think we should be codifying them as part of our API schema at
>>>> this time.  At some point I imagine we'll evolve a more useful API for
>>>> passing information to the scheduler as part of a request, and when
>>>> that
>>>> happens I don't think needing to support a myriad of meaningless hints
>>>> in older API versions is going to be desirable.
>>> I totally agree.
>>>
>>> If hints are to become an object, then need to be _real_ resources that
>>> can be listed, and that have structured metadata that has an API.
>>> Flavors are a great example of this. From an end user perspective, I can
>>> ask the cloud what flavors exist, those flavors tell me information that
>>> I can use to make a decision, and I can pass in a reference to those
>>> things. If I pass in an invalid flavor, I get a meaningful error
>>> message.
>>>
>>>> Finally, at this time I'm not sure we should take the stance that only
>>>> in-tree scheduler hints are supported.  While I completely agree with
>>>> the desire to expose things in cross-cloud ways as we've done and are
>>>> looking to do with flavor and image properties I think scheduling is an
>>>> area where we want to allow some flexibility for deployers to write and
>>>> expose scheduling capabilities that meet their specific needs.  Over
>>>> time I hope we will get to a place where some standardization can
>>>> happen, but I don't think locking in the current scheduling hints is
>>>> the
>>>> way forward for that.  I would love to hear from multi-cloud users here
>>>> and get some input on whether that's crazy and they are expecting
>>>> benefits from validation on the current scheduler hints.
>>> As a multi-cloud user, I do not use scheduler hints because there is no
>>> API to discover that they exist, and also no shared sense of semantics.
>>> (I know a flavor that claims 8G of RAM will give me, you guessed it, 8G
>>> of ram) So I consider scheduler hints currently to be COMPLETE vendor
>>> lock-in and/or only things to be used by private cloud folks who are
>>> also admins of their clouds.
>>>
>>> I would not touch them with a 10 foot pole until such a time as there is
>>> an actual API for listing, describing and selecting them.
>>>
>>> I would suggest that if we make one of those, we should quickly
>>> formalize meanings of fields - so that cloud can have specific hints
>>> that seem like cloud content - but that the way I learn about them is
>>> the same, and if there are two hints that do the same thing I can expect
>>> them to look the same in two different clouds.
>>>
>> So this kind of

Re: [openstack-dev] Scheduler hints, API and Objects

2015-09-04 Thread Nikola Đipanov

On 06/25/2015 04:50 PM, Monty Taylor wrote:
> On 06/25/2015 10:22 AM, Andrew Laski wrote:
>> I have been growing concerned recently with some attempts to formalize
>> scheduler hints, both with API validation and Nova objects defining
>> them, and want to air those concerns and see if others agree or can help
>> me see why I shouldn't worry.
>>
>> Starting with the API I think the strict input validation that's being
>> done, as seen in
>> http://git.openstack.org/cgit/openstack/nova/tree/nova/api/openstack/compute/schemas/v3/scheduler_hints.py?id=53677ebba6c86bd02ae80867028ed5f21b1299da,
>> is unnecessary, and potentially problematic.
>>
>> One problem is that it doesn't indicate anything useful for a client. 
>> The schema indicates that there are hints available but can make no
>> claim about whether or not they're actually enabled.  So while a
>> microversion bump would typically indicate a new feature available to an
>> end user, in the case of a new scheduler hint a microversion bump really
>> indicates nothing at all.  It does ensure that if a scheduler hint is
>> used that it's spelled properly and the data type passed is correct, but
>> that's primarily useful because there is no feedback mechanism to
>> indicate an invalid or unused scheduler hint.  I think the API schema is
>> a poor proxy for that deficiency.
>>
>> Since the exposure of a hint means nothing as far as its usefulness, I
>> don't think we should be codifying them as part of our API schema at
>> this time.  At some point I imagine we'll evolve a more useful API for
>> passing information to the scheduler as part of a request, and when that
>> happens I don't think needing to support a myriad of meaningless hints
>> in older API versions is going to be desirable.
> 
> I totally agree.
> 
> If hints are to become an object, then need to be _real_ resources that
> can be listed, and that have structured metadata that has an API.
> Flavors are a great example of this. From an end user perspective, I can
> ask the cloud what flavors exist, those flavors tell me information that
> I can use to make a decision, and I can pass in a reference to those
> things. If I pass in an invalid flavor, I get a meaningful error message.
> 
>> Finally, at this time I'm not sure we should take the stance that only
>> in-tree scheduler hints are supported.  While I completely agree with
>> the desire to expose things in cross-cloud ways as we've done and are
>> looking to do with flavor and image properties I think scheduling is an
>> area where we want to allow some flexibility for deployers to write and
>> expose scheduling capabilities that meet their specific needs.  Over
>> time I hope we will get to a place where some standardization can
>> happen, but I don't think locking in the current scheduling hints is the
>> way forward for that.  I would love to hear from multi-cloud users here
>> and get some input on whether that's crazy and they are expecting
>> benefits from validation on the current scheduler hints.
> 
> As a multi-cloud user, I do not use scheduler hints because there is no
> API to discover that they exist, and also no shared sense of semantics.
> (I know a flavor that claims 8G of RAM will give me, you guessed it, 8G
> of ram) So I consider scheduler hints currently to be COMPLETE vendor
> lock-in and/or only things to be used by private cloud folks who are
> also admins of their clouds.
> 
> I would not touch them with a 10 foot pole until such a time as there is
> an actual API for listing, describing and selecting them.
> 
> I would suggest that if we make one of those, we should quickly
> formalize meanings of fields - so that cloud can have specific hints
> that seem like cloud content - but that the way I learn about them is
> the same, and if there are two hints that do the same thing I can expect
> them to look the same in two different clouds.
> 

So this kind of argumentation keeps confusing me TBH. Unless I am not
understanding some basic things about how Nova works, the above argument
cleanly applies to flavors as well. Flavor '42' is not going to be the
same thing across clouds, but that's not where this ends. Once you throw
in extra_specs, in particular related to PCI devices and NUMA/CPU
pinning features. There is really no discoverability there whatsoever (*).

What I am trying to get to is not whether this is right or wrong, but to
point out the fact that Flavors are simply not a good abstraction that
can have reasonable meaning "across cloud boundaries" (i.e. different
Nova deployments), at least the way they are implemented at the moment.
We should not pretend that they are, and try to demonize useful code
making use of them, but come up with a better abstraction that can have
reasonable meaning across different deployments.

I think this is what Andrew was hinting at when he said that scheduling
is an area that cannot reasonably be standardized in this way.

I recently spoke to John briefly about this and got

Re: [openstack-dev] [cinder][nova] snapshot and cloning for NFS backend

2015-08-26 Thread Nikola Đipanov

On 07/28/2015 09:00 AM, Kekane, Abhishek wrote:
 Hi Devs,
 
  
 
 There is an NFS backend driver for cinder, which supports only limited
 volume handling features. Specifically, snapshot and cloning
 features are missing.
 
  
 
 Eric Harney has proposed a feature of NFS driver snapshot [1][2][3],
 which was approved on Dec 2014 but not implemented yet.
 
  
 
 [1] blueprint https://blueprints.launchpad.net/cinder/+spec/nfs-snapshots
 
 [2] cinder-specs https://review.openstack.org/#/c/133074/  - merged for
 Kilo but moved to Liberty
 
 [3] implementation https://review.openstack.org/#/c/147186/  - WIP
 
  
 
 As of now [4] nova patch is a blocker for this feature.
 
 I have tested this feature by applying [4] nova patch and it is working
 as per expectation.
 
  
 
 [4] https://review.openstack.org/#/c/149037/
 

so [4] is actually related to the following bug (it is linked on the
review):

https://bugs.launchpad.net/nova/+bug/1416132

The proposed patch is, as was discussed in some details on the patch -
not the right approach for several reasons.

I have added a comment on the bug [1] outlining what I think is the
right solution here, however - it is far from a trivial change.

Let me know if the comment on the bug makes sense and if I need to add
more information.

I will try to devote some time to fixing this, as I believe this is
causing us a lot more problems in the gate on an ongoing basis (see [2]
for example), but the discussion in the bug should be enough to get
anyone else who may want to pick it up on the right path to making progress!

Best,
N.

[1] https://bugs.launchpad.net/nova/+bug/1416132/comments/8
[2] https://bugs.launchpad.net/nova/+bug/1445021


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] should we allow overcommit for a single VM?

2015-08-18 Thread Nikola Đipanov

On 08/17/2015 08:22 PM, Chris Friesen wrote:
 
 I tried bringing this up on the irc channel, but nobody took the bait.
 Hopefully this will generate some discussion.
 
 I just filed bug 1485631.  Nikola suggested one way of handling it, but
 there are some complications that I thought I should highlight so we're
 all on the same page.
 
 The basic question is, if a host has X CPUs in total for VMs, and a
 single instance wants X+1 vCPUs, should we allow it?  (Regardless of
 overcommit ratio.)  There is also an equivalent question for RAM.
 
 Currently we have two different answers depending on whether numa
 topology is involved or not.  Should we change one of them to make it
 consistent with the other?  If so, a) which one should we change, and b)
 how would we do that given that it results in a user-visible behaviour
 change?  (Maybe a microversion, even though the actual API doesn't
 change, just whether the request passes the scheduler filter or not?)
 

I would say that the correct behavior is what NUMA fitting logic does,
and that is to not allow instance to over-commit against itself, and we
should fix normal (non-NUMA) over-commit. Allowing the instance to
over-commit against itself does not make a lot of sense, however it is
not something that is likely to happen that often in real world usage -
I would imagine operators are unlikely to create flavors larger than
compute hosts.

I am not sure that this has anything to do with the API thought. This is
mostly a Nova internal implementation detail. Any nova deployment can
fail to boot an instance for any number of reasons, and this does not
affect the API response of the actual boot request.

Hope it helps,
N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Device names supplied to the boot request

2015-07-16 Thread Nikola Đipanov

On 07/16/2015 05:47 PM, Nikola Đipanov wrote:
 
 Also, not being able to specify device names would make it impossible to
 implement certain features that EC2 API can provide, such as overriding
 the image block devices without significant effort.
 

I forgot to add links that explain this in more detail [1][2]

[1] https://review.openstack.org/#/c/190324/
[2] https://bugs.launchpad.net/nova/+bug/1370250




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Device names supplied to the boot request

2015-07-16 Thread Nikola Đipanov

On 07/16/2015 11:24 AM, Sean Dague wrote:
 On 07/15/2015 01:41 PM, Andrew Laski wrote:
 On 07/15/15 at 12:19pm, Matt Riedemann wrote:
 snip
 The other part of the discussion is around the API changes, not just
 for libvirt, but having a microversion that removes the device from
 the request so it's no longer optional and doesn't provide some false
 sense that it works properly all of the time.  We talked about this in
 the nova channel yesterday and I think the thinking was we wanted to
 get agreement on dropping that with a microversion before moving
 forward with the libvirt change you have to ignore the requested
 device name.

 From what I recall, this was supposed to really only work reliably for
 xen but now it actually might not, and would need to be tested again.
 Seems we could start by checking the xen CI to see if it is running
 the test_minimum_basic scenario test or anything in
 test_attach_volume.py in Tempest.

 This doesn't really work reliably for xen either, depending on what is
 being done.  For the xenapi driver Nova converts the device name
 provided into an integer based on the trailing letter, so 'vde' becomes
 4, and asks xen to mount the device based on that int.  Xen does honor
 that integer request so you'll get an 'e' device, but you could be
 asking for hde and get an xvde or vice versa.
 
 So this sounds like it's basically not working today. For Linux guests
 it really can't work without custom in guest code anyway, given how
 device enumeration works.
 
 That feels to me like we remove it from the API with a microversion, and
 when we do that just comment that trying to use this before that
 microversion is highly unreliable (possibly dangerous) and may just
 cause tears.
 

The problem with outright banning it is that we still have to support
people who want to use the older version meaning all of the code would
have to support it indefinitely (3.0 is not even on the horizon), given
the shady gains, I can't help but feel that this is needless complexity.

Also, not being able to specify device names would make it impossible to
implement certain features that EC2 API can provide, such as overriding
the image block devices without significant effort.

 ...
 
 On a slight tangent, probably a better way to provide mount stability to
 the guest is with FS labels. libvirt is already labeling the filesystems
 it creates, and xenserver probably could as well. The infra folks ran
 into an issue yesterday
 http://status.openstack.org//elastic-recheck/#1475012 where using that
 info was their fix.
 

I think the reason device_names are exposed in the API is that that was
the quickest way to provide a sort of an ID of a block device attached
to a certain instance that further API calls can then act upon.

 It's not the same thing as deterministic devices, but deterministic
 devices really aren't a thing on first boot unless you have guest agent
 code, or only boot with one disk and hot plug the rest carefully.
 Neither are really fun answers.
 
   -Sean
 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Device names supplied to the boot request

2015-07-16 Thread Nikola Đipanov

On 07/16/2015 06:35 PM, Matt Riedemann wrote:
 
 
 On 7/16/2015 11:47 AM, Nikola Đipanov wrote:
 On 07/16/2015 11:24 AM, Sean Dague wrote:
 On 07/15/2015 01:41 PM, Andrew Laski wrote:
 On 07/15/15 at 12:19pm, Matt Riedemann wrote:
 snip
 The other part of the discussion is around the API changes, not just
 for libvirt, but having a microversion that removes the device from
 the request so it's no longer optional and doesn't provide some false
 sense that it works properly all of the time.  We talked about this in
 the nova channel yesterday and I think the thinking was we wanted to
 get agreement on dropping that with a microversion before moving
 forward with the libvirt change you have to ignore the requested
 device name.

  From what I recall, this was supposed to really only work reliably
 for
 xen but now it actually might not, and would need to be tested again.
 Seems we could start by checking the xen CI to see if it is running
 the test_minimum_basic scenario test or anything in
 test_attach_volume.py in Tempest.

 This doesn't really work reliably for xen either, depending on what is
 being done.  For the xenapi driver Nova converts the device name
 provided into an integer based on the trailing letter, so 'vde' becomes
 4, and asks xen to mount the device based on that int.  Xen does honor
 that integer request so you'll get an 'e' device, but you could be
 asking for hde and get an xvde or vice versa.

 So this sounds like it's basically not working today. For Linux guests
 it really can't work without custom in guest code anyway, given how
 device enumeration works.

 That feels to me like we remove it from the API with a microversion, and
 when we do that just comment that trying to use this before that
 microversion is highly unreliable (possibly dangerous) and may just
 cause tears.


 The problem with outright banning it is that we still have to support
 people who want to use the older version meaning all of the code would
 have to support it indefinitely (3.0 is not even on the horizon), given
 the shady gains, I can't help but feel that this is needless complexity.
 
 Huh?  That's what the microversion in the v2.1 API is for - we add a
 microversion that drops support for the device name in the API request,
 if you're using a version of the API before that we log a warning that
 it's unreliable and probably shouldn't be used.  With the microversion
 you're opting in to using it.
 

so are you saying that we don't have to support actually persisting the
user supplied device names for request that ask for version  N? If so
than my change can be accompanied with a version bump and we're good to go.

If we have to support both and somehow notify the compute that it should
persist the requested device names some of the time, then I am very mich
against that.

IMHO microversions should not be used for fixing utter brokenness, it
should just be fixed. Keeping bug compatibility is not something we
should do IMHO but that's a different discussion.


 Also, not being able to specify device names would make it impossible to
 implement certain features that EC2 API can provide, such as overriding
 the image block devices without significant effort.
 
 Huh? (x2)  With your change you're ignoring the requested device name
 anyway, so how does this matter?  Also, the ec2 API is moving out of
 tree so do we care what that means for the openstack compute API?


Please look at the patch and the bug I link in the follow up email
(copied here for your convenience). It should be clearer then which
features cannot possibly work [1][2].

As for supporting the EC2 API - I don't know the answer to that if we
decide we don't care about them - that's cool with me. Even without that
as a consideration, I still think the current proposed patch is the best
way forward.

[1] https://review.openstack.org/#/c/190324/
[2] https://bugs.launchpad.net/nova/+bug/1370250


 ...

 On a slight tangent, probably a better way to provide mount stability to
 the guest is with FS labels. libvirt is already labeling the filesystems
 it creates, and xenserver probably could as well. The infra folks ran
 into an issue yesterday
 http://status.openstack.org//elastic-recheck/#1475012 where using that
 info was their fix.


 I think the reason device_names are exposed in the API is that that was
 the quickest way to provide a sort of an ID of a block device attached
 to a certain instance that further API calls can then act upon.

 It's not the same thing as deterministic devices, but deterministic
 devices really aren't a thing on first boot unless you have guest agent
 code, or only boot with one disk and hot plug the rest carefully.
 Neither are really fun answers.

 -Sean



 __

 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin

[openstack-dev] [Nova] Device names supplied to the boot request

2015-07-15 Thread Nikola Đipanov

I'll keep this email brief since this has been a well known issue for
some time now.

Problem: Libvirt can't honour device names specified at boot for any
volumes requested as part of block_device_mapping. What we currently do
is in case they do get specified, we persist them as is, so that we can
return them from the API, even though libvirt can't honour them (this
leads to a number of issues when we do really on the data in the DB, a
very common one comes up when attaching further devices which follow up
patches to [1] try to address).

There is a proposed patch [1] that will make libvirt disregard what was
passed and persist the values it defaults and can honour. This seems
contentious because it will change the API behaviour (instance show will
potentially return device names other than the ones requested).

My take on this is that this is broken and we should fix it. All other
ways to fix it, namely:

  * reject the request if libvirt is the driver in the API (we can't
know where the request will end up really and blocking in the API is
bad, plus we would still have to keep backwards compatibility for a long
time which means the bug is not really solved, we just have more code
for bugs to fester)
  * fail the request at the scheduler level (very disruptive , and the
question is how do we tell users that this is a legit change, we can't
really bump the API version for a compute change)

are way more disruptive for little gain.

  * There is one more thing we could do that hasn't been discussed - we
could store requested_device_name, and always return that from the API.
This too adds needless complexity IMO.

I think the patch in [1] is a pragmatic solution to a long standing
issue that only changes the API behaviour for an already broken
interaction. I'd like to avoid needless complexity if it gives us nothing.

It would be awesome to get some discussion around this and hopefully get
some resolution to this long standing issue. Do let me know if more
information/clarification is required.

[1] https://review.openstack.org/#/c/189632/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] schedule instance based on CPU frequency ?

2015-06-30 Thread Nikola Đipanov

On 06/30/2015 07:42 AM, ChangBo Guo wrote:
 CPU frequency  is an import performance parameter,  currently  nova
 drivers just  report cpu_info without frequency.   we stored the compute
 node cpu_info in database with colum compute_nodes.cpu_info,  we can add
 the frequency  easily.
 
 The usage of  cpu frequency  I  can think is used to schedule to meet
 applications which need high frequency.  add a frequency based filter ?
 if we need this , I would like to propose  a spec for this .
 

Would it be possible to give more details on the type of app that will
have this _specific_ requirement.

I don't think I have all the details in my head, but it seems to me that
the frequency of the hypervisor CPU is just not something that carries
enough information for users about how most applications will perform. I
would imagine they would either want the fastest or some specialized
HW for specific applications.

 
 There are two steps to leverage cpu frequency:
 1.  report cpu frequency  and record the value,  nova hypervisor-show
 will include the value .
 
 2.  filter compute nodes based  cpu frequency.
 add a new scheduler filter to do that
 
 before I start to do these stuff.  I would like to your  input .
 
 Do we need leverage CPU frequency  in Nova ?
 if yes, do we need a new filter  or  leverage existing filter to use
 frequency ?
 

I don't think we do personally - but I may not understand what problem
this is trying to solve.

But even if we do - the most important thing IMHO would be _how_ to
expose it to users (do we allow them to request a minimum frequency, or
a specific one or something else). API contract is extremely important
here because we want to make sure that we are exposing the right
semantics for users - as we would want this to be usable to as big a
group of people as possible.

If it's just about having a high performance tier - can we do this with
host aggregates and flavors? These are the questions we want to answer
first IMHO.

N.

 -- 
 ChangBo Guo(gcb)
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-29 Thread Nikola Đipanov

Top-posting since I am writing this as summary email, with some (very)
rough proposals on improvements going forward (*)

* Specs have a number of positives that we should not discount:
 ** Absolutely necessary to sign off on the idea and direction before
writing code
 ** Serve as a way for operators to give feedback
 ** Serve as documentation once the work has landed (provided that they
are kept up to date by

* Current process has a number of shortcomings too (some of these are my
own comments, and some are thought that people brought up that I
incorporated):
 ** Approval process that creates a bottleneck on a (needlessly?) small
team of people
 ** Tools and review culture that does not work well for the kind of
communication that needs to happen
 ** Requesting the same format and process for all proposed work causing
delays where they are not necessary and in turn exacerbating the load on
the spec-core. Some problems require a lot less written design
discussion than others, but we treat them all the same. Also, due to the
general design of Nova, it is significantly harder to decide on some
design decisions without looking at the code too, which the current spec
process discourages.
 ** Coupling the spec review process to a particular release - this has
a number of drawbacks that are probably worth it's own email, some of
which are technical in nature and some of which are social. (A good
point was also made that this makes the already poor tooling even worse
as the previous discussion is lost)

We should also take into account history behind the current Nova process
and that it was meant to also give people more confidence about the
prospects of their code landing in a certain release. This might be
something we want to consider in parallel to figuring out the changes in
the release cycles that are also happening.

Going forward - some ideas on first steps we could take to improve
(purely my own, not a digest from the thread):

* Default to no spec, and be clear on what grounds we are asking for
one. Currently it is hard to do in part I believe because posting a spec
in Gerrit carries far more weight than just opening a BP in Launchpad.
One idea could be to have a BP repository (that gets mirrored in LP
maybe) that requires only a subset of info, and require 2 cores (or a
certain number of contributors) to vote negatively before a full blown
spec is required.
* Consider specs approved indefinitely when they are merged, and if
they miss a release - no big deal, but reserve the right to block the
patches should circumstances change. Do release planning separately.
* Start to talk about improvement to tooling. I feel it has been our
(OpenStack) desire to stick to what we know even when it's clear that
the tools are sub-par for the job. The integrated release dictate a lot
of that and it might be a time to start those discussion.

N.

(*) I feel more discussions on this list could benefit from one

On 06/24/2015 01:42 PM, Daniel P. Berrange wrote:
 On Wed, Jun 24, 2015 at 11:28:59AM +0100, Nikola Đipanov wrote:
 Hey Nova,

 I'll cut to the chase and keep this email short for brevity and clarity:

 Specs don't work! They do nothing to facilitate good design happening,
 if anything they prevent it. The process layered on top with only a
 minority (!) of cores being able to approve them, yet they are a prereq
 of getting any work done, makes sure that the absolute minimum that
 people can get away with will be proposed. This in turn goes and
 guarantees that no good design collaboration will happen. To add insult
 to injury, Gerrit and our spec template are a horrible tool for
 discussing design. Also the spec format itself works for only a small
 subset of design problems Nova development is faced with.
 
 I'd like to see some actual evidence to backup a sweeping statement
 as Specs dont work. They do nothing to facilitate good design happening,
 if anything they prevent it.
 
 Comparing Nova today, with Nova before specs were introduced, I think
 that specs have had a massive positive impact on the amount of good
 design and critique that is happening.
 
 Before specs, the average blueprint had no more than 3 lines of text
 in its description. Occassionally a blueprint would link to a wiki
 page or google doc with some design information, but that was very
 much the exception.
 
 When I was reviewing features in Nova before specs came along, I spent
 alot of time just trying to figure out what on earth the code was
 actually attempting to address, because there was rarely any statement
 of the problem being addressed, or any explanation of the design that
 motivated the code.  This made life hard for reviewers trying to figure
 out if the code was acceptable to merge.  It is pretty bad for contributors
 trying to implement new features too, as they could spend weeks or months
 writing and submitting code, only to be rejected at the end because the
 (lack of any design discussions) meant they missed some

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-29 Thread Nikola Đipanov

On 06/29/2015 11:32 AM, Thierry Carrez wrote:
 Nikola Đipanov wrote:
 It's not only about education - I think Gerrit is the wrong medium to
 have a design discussion and do design work. Maybe you disagree as you
 seem to imply that it worked well in some cases?

 I've recently seen on more than a few cases how a spec review can
 easily spiral into a collection of random comments that are hard to put
 together in a coherent discussion that you could call design work.

 If you throw in the expectation of approval into the mix, I think it
 basically causes the opposite of good design collaboration to happen.
 
 On Gerrit not being the right tool for specs...
 
 Using code review tools to iterate on specs creates two issues:
 
 * Minor comments
 Line-by-line code review tools are excellent for reviewing the
 correctness of lines of code. When switching to specs, you retain some
 of that review correctness of all lines mindset and tend to spot
 mistakes in the details more than mistakes in the general idea. That, in
 turn, results in -1 votes that don't really mean the same thing.
 
 * Extra process
 Code review tools are designed to produce final versions of documents.
 For specs we use a template to enforce a minimal amount of details, but
 those are already too much for most small features. To solve that issue,
 we end up having to binary-decide when something is significant enough
 to warrant a full spec. As with any line in the sand, the process end up
 being too much for things that are just beyond the line, and too little
 for things that are just before.
 
 IMHO the ideal tool would allow you to start with a very basic
 description of what feature you want to push. Then a discussion can
 start, and the spec can be refined to answer new questions or detail
 the already-sketched-out answers. Simple features can be approved really
 quickly using a one-sentence spec, while more complex features will
 develop into a full-fledged detailed document before they get approved.
 One size definitely doesn't fit all. And the discussion-based review
 (opposed to line-by-line review) discourages nitpicking on style.
 
 You *can* do this with Gerrit: discourage detail review + encourage idea
 review, and start small and develop the document in future patchsets
 as-needed. It's just not really encouraging that behavior for the job,
 and the overhead for simple features still means we can't track smallish
 features with it. As we introduce new tools we might switch the feature
 approval process to something else. In the mean time, my suggestion
 would be to use smaller templates, start small and go into details only
 if needed, and discourage nitpicking -1s.
 

I fully agree with the above FWIW.

This is *exactly* what I hinted at in the summary email, when I
suggested a BP repository, with a problem statement patch that could
then potentially evolve into a full blown spec if needed.

I feel that Gerrit is bad at keeping an easily review-able history of a
discussion even for code reviews, and this problem is worse for written
text (as you point out), so looking at other tools might be useful at
some point.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-29 Thread Nikola Đipanov

On 06/26/2015 08:15 PM, Tim Bell wrote:
 
 Limiting those who give input to the people who can analyse python and 
 determine the impacts of the change has significant risks. Many of those 
 running OpenStack clouds can give their feedback as part of the specs 
 process. While this may not be as fully structured as you would like, 
 ignoring the input from those who are running clouds when proposing a change 
 is likely to cause problems later on. 
 
 The specs process was developed jointly to allow exactly this kind of early 
 input ... people writing the code wanted input from those who were using this 
 code to deliver new functions and improvements to the end users of the cloud. 
 No problem to discuss how to improve the process but it is important to allow 
 all the people affected by a change to be involved in the solution and 
 contribute, not just the ones writing the code.
 

These are very valid points. Input from users/deployers is extremely
important. One of the main points of the agile way of producing
software is about shortening the feedback loop by producing working code
to comment on, as opposed to defining requirements fully before writing
code.

I think that in the case of certain problems having as much information
up front and solid feedback from the operator's community is very
valuable, but I also feel that there are cases where after a point,
prototyping can give better results (partly due to the nature of the
tools we use and our reviewing culture as you mention above).

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-26 Thread Nikola Đipanov

On 06/25/2015 05:39 PM, Tim Bell wrote:
 
 
 On 25/06/15 09:49, Thierry Carrez thie...@openstack.org wrote:
 
 Maxim Nestratov wrote:
 24.06.2015 20:21, Daniel P. Berrange пишет:
 On Wed, Jun 24, 2015 at 04:46:57PM +, Michael Krotscheck wrote:
 First: Overhead
 - 1 week for vacation
 - 1 week for holidays.
 - 4 weeks for feature freeze.
 - 4 weeks of pre-summit roadmap planning.
 - 1 week of summit.
 Remaining: 15 weeks.

 Second: Writing, discussing, and landing the spec.
 Remaining: 9 weeks.

 Third: Role conflicts and internal overhead.
 Remaining time: 4.5 weeks

 Writing the code:
 Remaining time: 3.5 weeks.

 The last step: Getting the cores to agree with your approach.
 Remaining time: -0.5 weeks.
 The problem is how long it takes.
 [...]

 At a minimum I'd like to see the specs review  approval completely
 de-couple from the development cycle. There is really no compelling
 reason why design reviews have to be put in a box against a specific
 release. In doing so we create a big crunch at the start of each cycle,
 which is what we're particularly suffering under this week and last.
 We should be happy to review and approve specs at any time whatsoever,
 and allow approval to last for at least 1 year (with caveat that it
 can be revoked if something in nova changes to invalidate a design
 decision).
 Absolutely agree. There is no use in waiting for another cycle to start
 if you missed deadline for your spec in current cycle. Why not to review
 specs and approve them setting next release cycle milestone and allow
 people to start coding and get code review for next release cycle?

 I totally agree that there is no reason to tie specs drafting, review 
 approval to the development cycle. In fact, most project teams don't.

 Now, Michael's example is a bit unrealistic -- cross-project specs
 aren't tied to release cycle at all, and you can certainly work on them
 during the 4 weeks of feature freeze or 4 weeks of pre-summit roadmap
 planning.

 I would even argue that those 8 weeks are the ideal time to draft and
 get early reviews on a spec : you can use the design summit at the end
 of them to close the deal if it still needs discussion, and start
 working on code the week after.
 
 The operator community has also been generally positive on the specs
 process. It has allowed a possibility for people without python skills to
 give input on the overall approach being taken rather than needing to
 review code. The approach, after all, from an operator mid cycle meetup
 (blueprints on blueprints) combined with the nova specs proposal.
 
 I’ve certainly had a few specs where the approach needed in-depth
 discussion (one I remember clearly was the re-assign a project spec) and
 to have waited till the code was written would have been a waste.
 

No doubt doing design prior to coding is extremely useful in a lot of
cases, and having documents/artifacts of that process in a well-known
place. Some problems don't need that much of design outside of code
itself though.

This is what I was referring to elsewhere on the thread when I said we
are coupling together the process of designing a feature with it's
approval for a release and release planning etc. and then blanket apply
it to everything that resembles a feature.

 One of the problems that I’ve seen is with specs etiquette where people -1
 because they have a question. This is a question of education rather than
 a fundamental issue with the process.
 

It's not only about education - I think Gerrit is the wrong medium to
have a design discussion and do design work. Maybe you disagree as you
seem to imply that it worked well in some cases?

I've recently seen on more than a few cases how a spec review can
easily spiral into a collection of random comments that are hard to put
together in a coherent discussion that you could call design work.

If you throw in the expectation of approval into the mix, I think it
basically causes the opposite of good design collaboration to happen.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-26 Thread Nikola Đipanov

On 06/25/2015 09:46 PM, Joe Gordon wrote:
 On Thu, Jun 25, 2015 at 1:39 AM, Nikola Đipanov ndipa...@redhat.com
  
 
 
  As someone who does a lot of spec reviews, I take +1s from the right
  people (not always nova-cores) to mean a lot, so much that I regularly
  will simply skim the spec myself before +2ing it. If a subject matter
  expert who I trust +1s a spec, that is usually all I need.
 
  * +1/-1s from the right people have a lot of power on specs. So the
  review burden isn't just on the people with '+W' power.  We may not have
  done a great job of making this point clear.
  * There are many subject matter experts outside nova-core who's vote
  means a lot. For example PTL's of other projects the spec impacts.
 
 
 This is exactly the kind of cognitive dissonance I find hard to not get
 upset about :)
 
 Code is what matters ultimately - the devil _is_ in the details, and I
 can bet you that it is extremely unlikely that a PTL of any other
 project is also going to go and do a detailed review of a feature branch
 in Nova, and have the understanding of the state of the surrounding
 codebase needed to do it properly. That's what's up to the nova
 reviewers to deal with. I too wish Nova code was in a place where it was
 possible to just do architecture, but I think we all agree it's
 nowhere near that.
 
 
 This goes back to the point(s) that was brought up over and over again
 in this thread. I guess we have to agree to disagree.
 
 I think saying 'code' is what ultimately matters is misleading.  'Code'
 is the implementation of an idea. If an idea is bad, so what if the code
 is good?
 
 I wouldn't ask the PTL of say Keystone to review the implementation of
 some idea in nova, but I do want their opinion on an idea that impacts
 how nova and keystone interact. Why make someone write a bunch of code,
 only to find out that the design itself was fundamentally flawed and
 they have to go back to the drawing board and throw everything out. On
 top of that now the reviewers has to mentally decouple the idea and the
 code (unless the feature has good documentation explaining that -- sort
 of like a spec).
 
 That being said, I do think there is definitely room for improvement.
 

It really goes both way - it's important to state the problem and the
general direction the implementation plans to take, but anything more
than that is a distraction from getting to a prototype that will tell us
more about if the design was in fact fundamentally flawed. We have
examples of that in tree right now - stuff was written and re-written
and got better. Spec will never remove the need for that and we should
not try to make them.

Also throwing code away is a bit of a straw-man, good code will almost
never be thrown out entirely, some bits of it - sure (that's software) -
you may change an interface to a class, or a DB schema detail here and
there - but if your code is written to be modular and does not leak
abstractions - you'll end up keeping huge bits of it through rewrites.
We need more of that!

 
 With all due respect to you Joe (and I do have a lot of respect for you)
 - I can't get behind how Nova specs puts process and documents over
 working and maintainable code. I will never be able to get behind that!
 
 
 
 So what are you proposing ultimately?  It sounds like the broad
 consensus here is: specs have made things better, but there is room for
 improvement (this is my opinion as well). Are you saying just drop specs
 all together? Because based on the discussion here, there isn't anything
 near consensus for doing that. So if we aren't going to just revert to
 how things were before specs, what do you think we should do?
 

I will follow up with a more detailed email, but in short - acknowledge
that some problems are fundamentally different than others, decide what
kind of work absolutely requires an up front discussion (API seems like
a solid candidate) and drop the blanket requirement for a detailed spec
for any work (do still require a problem statement though, maybe in a
lighter format as part of the branch).

A lot of it comes back to our release mechanism too, and is definitely
something we need to work on incrementally.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-25 Thread Nikola Đipanov

On 06/24/2015 10:17 PM, Joe Gordon wrote:
 
 
 On Wed, Jun 24, 2015 at 11:42 AM, Kashyap Chamarthy kcham...@redhat.com
 mailto:kcham...@redhat.com wrote:
 
 On Wed, Jun 24, 2015 at 10:02:27AM -0500, Matt Riedemann wrote:
 
 
  On 6/24/2015 9:09 AM, Kashyap Chamarthy wrote:
  On Wed, Jun 24, 2015 at 02:51:38PM +0100, Nikola Đipanov wrote:
  On 06/24/2015 02:33 PM, Matt Riedemann wrote:
 
 [. . .]
 
  This is one of the _baffling_ aspects -- that a so-called super core
  has to approve specs with *no* obvious valid reasons.  As Jay Pipes
  mentioned once, this indeed seems like a vestigial remnant from old
  times.
  
  FWIW, I agree with others on this thread, Nova should get rid of this
  specific senseless non-process.  At least a couple of cycles ago.
 
  Specs were only added a couple of cycles ago... :)  And they were added 
 to
  fill a gap, which has already been pointed out in this thread.  So if we
  remove them without a replacement for that gap, we regress.
 
 Oops, I didn't mean to say that Specs as a concept should be gone.
 Sorr for poor phrasing.
 
 My question was answred by Joe Gordon with this review:
 
 https://review.openstack.org/#/c/184912/
 
 
 
 A bit more context:
 
 We discussed the very issue of adjusting the review rules for nova-specs
 to give all cores +2 power. But in the end we decided not to in the end.
 

I was expecting to also read a why here, since I was not at the summit.

 As someone who does a lot of spec reviews, I take +1s from the right
 people (not always nova-cores) to mean a lot, so much that I regularly
 will simply skim the spec myself before +2ing it. If a subject matter
 expert who I trust +1s a spec, that is usually all I need. 
 
 * +1/-1s from the right people have a lot of power on specs. So the
 review burden isn't just on the people with '+W' power.  We may not have
 done a great job of making this point clear.
 * There are many subject matter experts outside nova-core who's vote
 means a lot. For example PTL's of other projects the spec impacts.


This is exactly the kind of cognitive dissonance I find hard to not get
upset about :)

Code is what matters ultimately - the devil _is_ in the details, and I
can bet you that it is extremely unlikely that a PTL of any other
project is also going to go and do a detailed review of a feature branch
in Nova, and have the understanding of the state of the surrounding
codebase needed to do it properly. That's what's up to the nova
reviewers to deal with. I too wish Nova code was in a place where it was
possible to just do architecture, but I think we all agree it's
nowhere near that.

With all due respect to you Joe (and I do have a lot of respect for you)
- I can't get behind how Nova specs puts process and documents over
working and maintainable code. I will never be able to get behind that!

I honestly think Nova is today worse off then it could have been, just
because of that mindset. You can't process away the hard things in
coding, sorry.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][Keystone] The unbearable lightness of specs

2015-06-25 Thread Nikola Đipanov

On 06/24/2015 09:00 PM, Adam Young wrote:
 On 06/24/2015 12:25 PM, Daniel P. Berrange wrote:
 Which happened repeatedly. You could say that
 the first patch submitted to the code repository should simply be a doc
 file addition, that describes the feature proposal and we should discuss
 that before then submitting code patches, but then that's essentially
 just the specs again, but with the spec doc in the main nova.git instead
 of nova-specs.git.
 
 Something like this, yes.   I do not like the fact that the spec and the
 code are likely to be out of sync, and that the target audience for the
 spec after the feature is implemented is vanishingly small. We should
 put the effort into docs that is currently going in to specs.
 
 But, I stand by what I said before: Gerrit is not the right tool for
 design, and specs are correspondingly owned by one person.  I think it
 is the approval part that really bugs me; the pedantry is its defining
 feature.  These are details much better hashed out in the code itself.
 
 Specs prevent code from being written.  If you think too much code is
 written, then, yes, you will like specs. If, on the other hand, you
 think that things should be implemented and tested before being posted
 to the central repo, then specs are not nearly as valuable as end user
 docs.  I think and design in Code, not in specs.  There are too many
 details that you don't discover until you actually write the code, and
 thus the specs often do not reflect the reality of the implementation
 anyway.
 

I would add here that getting said written and tested code in the
hands of users is also extremely important, that's how you get the
information on where to go next, and how real bugs get found. You wanna
do that as soon as possible!

We heavily de-prioritize that with the specs process.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova] The unbearable lightness of specs

2015-06-24 Thread Nikola Đipanov

Hey Nova,

I'll cut to the chase and keep this email short for brevity and clarity:

Specs don't work! They do nothing to facilitate good design happening,
if anything they prevent it. The process layered on top with only a
minority (!) of cores being able to approve them, yet they are a prereq
of getting any work done, makes sure that the absolute minimum that
people can get away with will be proposed. This in turn goes and
guarantees that no good design collaboration will happen. To add insult
to injury, Gerrit and our spec template are a horrible tool for
discussing design. Also the spec format itself works for only a small
subset of design problems Nova development is faced with.

That's only a subset of problems. Some more, you ask? OK. No clear
guidelines as to what needs a spec, that defaults to everything does.
And spec being the absolute worst way to judge the validity of some of
the things that do require them.

Examples of the above are everywhere if you care to look for them but
some that I've hit _this week_ are [1] (spec for a quick and dirty fix?!
really?!) [2] (spec stuck waiting for a single person to comment
something that is an implementation detail, and to make matter worse the
spec is for a bug fix) [3] (see how ill suited the format is for a
discussion + complaints about grammar and spelling instead of actual
point being made).

Nova's problem is not that it's big, it's that it's big _and_ tightly
coupled. This means no one can be trusted to navigate the mess
successfully, so we add the process to stop them. What we should be
doing is fixing the mess, and the very process is preventing that.

Don't take my word for it - ask the Gantt subteam who have been trying
to figure out the scheduler interface for almost 4 cycles now. Folks
doing Cells might have a comment on this too.

The good news is that we have a lot of stuff in place already to help us
reduce this massive coupling of everything. We have versioned objects,
we have versioned RPC. Our internal APIs are terrible, and provide no
isolation, but we have the means to iterate and figure it out.

I don't expect this process issues will get solved quickly, If it were
up to me I'd drop the whole thing, but I understand that it's not how
it's done.

I do hope this makes people think, discuss and move things into the
direction of facilitating quality software development instead of
outright preventing it. I'll follow up with some ideas on how to go
forward once a few people have commented back.

N.

PS - Before you ask - splitting out virt drivers will relieve the
pressure but won't fix the tight coupling of everything else in Nova.

[1] https://review.openstack.org/#/c/84048/
[2] https://review.openstack.org/#/c/193576/
[3] https://review.openstack.org/#/c/165838/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-24 Thread Nikola Đipanov

On 06/24/2015 01:42 PM, Daniel P. Berrange wrote:
 On Wed, Jun 24, 2015 at 11:28:59AM +0100, Nikola Đipanov wrote:
 Hey Nova,

 I'll cut to the chase and keep this email short for brevity and clarity:

 Specs don't work! They do nothing to facilitate good design happening,
 if anything they prevent it. The process layered on top with only a
 minority (!) of cores being able to approve them, yet they are a prereq
 of getting any work done, makes sure that the absolute minimum that
 people can get away with will be proposed. This in turn goes and
 guarantees that no good design collaboration will happen. To add insult
 to injury, Gerrit and our spec template are a horrible tool for
 discussing design. Also the spec format itself works for only a small
 subset of design problems Nova development is faced with.
 
 I'd like to see some actual evidence to backup a sweeping statement
 as Specs dont work. They do nothing to facilitate good design happening,
 if anything they prevent it.
 
 Comparing Nova today, with Nova before specs were introduced, I think
 that specs have had a massive positive impact on the amount of good
 design and critique that is happening.
 
 Before specs, the average blueprint had no more than 3 lines of text
 in its description. Occassionally a blueprint would link to a wiki
 page or google doc with some design information, but that was very
 much the exception.
 
 When I was reviewing features in Nova before specs came along, I spent
 alot of time just trying to figure out what on earth the code was
 actually attempting to address, because there was rarely any statement
 of the problem being addressed, or any explanation of the design that
 motivated the code.  This made life hard for reviewers trying to figure
 out if the code was acceptable to merge.  It is pretty bad for contributors
 trying to implement new features too, as they could spend weeks or months
 writing and submitting code, only to be rejected at the end because the
 (lack of any design discussions) meant they missed some aspect of the
 problem which in turn meant all their work was in vain. That was a collosal
 waste of everyone's time and resulted in some of the very horrible code
 impl decisions we're still living with today.
 

Yes there are good reasons to have a place to discuss implementation
before doing it, especially if you are new to the project.

The behemoth we have now is not that. It's miles away as you point out
below.

 That's only a subset of problems. Some more, you ask? OK. No clear
 guidelines as to what needs a spec, that defaults to everything does.
 And spec being the absolute worst way to judge the validity of some of
 the things that do require them.

 Examples of the above are everywhere if you care to look for them but
 some that I've hit _this week_ are [1] (spec for a quick and dirty fix?!
 really?!) [2] (spec stuck waiting for a single person to comment
 something that is an implementation detail, and to make matter worse the
 spec is for a bug fix) [3] (see how ill suited the format is for a
 discussion + complaints about grammar and spelling instead of actual
 point being made).

 Nova's problem is not that it's big, it's that it's big _and_ tightly
 coupled. This means no one can be trusted to navigate the mess
 successfully, so we add the process to stop them. What we should be
 doing is fixing the mess, and the very process is preventing that.

 Don't take my word for it - ask the Gantt subteam who have been trying
 to figure out the scheduler interface for almost 4 cycles now. Folks
 doing Cells might have a comment on this too.

 The good news is that we have a lot of stuff in place already to help us
 reduce this massive coupling of everything. We have versioned objects,
 we have versioned RPC. Our internal APIs are terrible, and provide no
 isolation, but we have the means to iterate and figure it out.

 I don't expect this process issues will get solved quickly, If it were
 up to me I'd drop the whole thing, but I understand that it's not how
 it's done.

 I do hope this makes people think, discuss and move things into the
 direction of facilitating quality software development instead of
 outright preventing it. I'll follow up with some ideas on how to go
 forward once a few people have commented back.
 
 I will agree that the specs process has a number of flaws - in particular
 I think we've treated it as too much of a rigid process resulting it in
 being very beurocractic. In particular I think are missing an ability to
 be pragmmatic in decisions about the level of design required and whether
 specs are required. The idea of allowing blueprints without specs in
 some cases was an attempt to address this, but I don't feel it has been
 very successful - there is still too much stuff being forced through
 the specs process unncessarily imho.
 
 I've repeatedly stated that the fact that we created an even smaller
 clique of people to approve specs (nova-drivers which is a tiny

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-24 Thread Nikola Đipanov

On 06/24/2015 03:08 PM, Dan Smith wrote:
 Why do cores need approved specs for example - and indeed for many of us
 - it's just a dance we do. I refuse to believe that a core can be
 trusted to approve patches but not to write any code other than a bugfix
 without a written document explaining themselves, and then have a yet
 more exclusive group of super cores approve that. It makes no sense.
 Document it - sure. Discuss on ML/patches - by all means, but this is
 just senseless.
 
 I completely disagree with this (and find it offensive). Cores are not
 gods. They review things and try to do their best to keep quality high.
 However, that does not mean that they can single-handedly design and
 implement a large complex feature on their own without feedback. It's
 the same reason that cores need other cores to review their code.


So I disagree that Gerrit/specs is good a way to do this. Also - cores
have no problem figuring out how to validate designs wihtout the
rigidity of the process, they know each other personally, and all other
people (domain experts involved, stakeholres) who could help.

Given unlimited resources, sure - but there is ton of stuff cores/long
term lurkers could be doing without having to go through the same rigid
process.

 As a core, I rarely get patches in without iterating at least once due
 to feedback, and I certainly don't land blueprints without scrutiny from
 others. To me, cores having their code and specs reviewed is not a
 dance we do. Is that your main complaint? That you, a core, have to
 have your specs reviewed?
 

Code - by all means. Be prepared to rewrite everything at least once (I
have on several occasions).

But for example refactoring stuff (that cores as maintainers do) just
makes no sense to go through this rigid process that as part of it's
output has artifacts for release planning. It's wasteful. Guess what
falls into this category? All the tech debt we've been raving about
previous cycle.

 Next - why do priority features need an approved spec? We all know we
 want to do it, just design it up on an etherpad/wiki/trello/whatever if
 needed, write code and discuss there.
 
 Because review of the design is important?
 

I feel this discussions is not very productive at this point.

Of course documentation is good and review of design is important - no
one is questioning that!

I am complaining about the fact that we try to stick all of this into a
rigid process that also tries to help with 10 other things and thus
creating a massive bottleneck that ends up being a downward spiral.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-24 Thread Nikola Đipanov

On 06/24/2015 02:33 PM, Matt Riedemann wrote:
 
 
 On 6/24/2015 8:23 AM, Sahid Orentino Ferdjaoui wrote:
 On Wed, Jun 24, 2015 at 11:28:59AM +0100, Nikola Đipanov wrote:
 Hey Nova,

 I'll cut to the chase and keep this email short for brevity and clarity:

 Specs don't work! They do nothing to facilitate good design happening,
 if anything they prevent it. The process layered on top with only a
 minority (!) of cores being able to approve them, yet they are a prereq
 of getting any work done, makes sure that the absolute minimum that
 people can get away with will be proposed. This in turn goes and
 guarantees that no good design collaboration will happen. To add insult
 to injury, Gerrit and our spec template are a horrible tool for
 discussing design. Also the spec format itself works for only a small
 subset of design problems Nova development is faced with.

 I do not consider specs don't work, personnaly I refer myself to this
 relatively good documentation [1] instead of to dig in code to
 remember how work a feature early introduced.

 I guess we have some efforts to do about the level of details we want
 before a spec is approved. We should just consider the general
 idea/design, options introduced, API changed and keep in mind the
 contributors who will implement the feature can/have to update it
 during the developpement phase.

 [1] http://specs.openstack.org/openstack/nova-specs/specs/kilo/

 s.

 __

 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 
 I agree completely. The nicely rendered feature docs which is a
 byproduct of the specs process in gerrit is a great part of it. So when
 someone is trying to use a new feature or trying to fix a bug in said
 feature 1-2 years later and trying to understand the big picture idea,
 they can refer to the original design spec - assuming it was accurate at
 the time that the code was actually merged. Like you said, it's
 important to keep the specs up to date based on what was actually
 approved in the code.
 

Of course documentation is good. Make that kind of docs a requirement
for merging a feature, by all means.

But the approval process we have now is just backwards. It's only result
is preventing useful work getting done.

In addition to what Daniel mentioned elsewhere:

Why do cores need approved specs for example - and indeed for many of us
- it's just a dance we do. I refuse to believe that a core can be
trusted to approve patches but not to write any code other than a bugfix
without a written document explaining themselves, and then have a yet
more exclusive group of super cores approve that. It makes no sense.
Document it - sure. Discuss on ML/patches - by all means, but this is
just senseless.

Next - why do priority features need an approved spec? We all know we
want to do it, just design it up on an etherpad/wiki/trello/whatever if
needed, write code and discuss there.

Instead we try to shoehorn PM, design docs, design discussion, release
planning, and a kitchen sink into a rigid inflexible process.

Docs - YES, process over anything - No, thanks!

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] The unbearable lightness of specs

2015-06-24 Thread Nikola Đipanov

On 06/24/2015 04:42 PM, Andrew Laski wrote:
 On 06/24/15 at 02:38pm, Nikola Đipanov wrote:
 On 06/24/2015 01:42 PM, Daniel P. Berrange wrote:
 On Wed, Jun 24, 2015 at 11:28:59AM +0100, Nikola Đipanov wrote:
 Hey Nova,

 I'll cut to the chase and keep this email short for brevity and
 clarity:

 Specs don't work! They do nothing to facilitate good design happening,
 if anything they prevent it. The process layered on top with only a
 minority (!) of cores being able to approve them, yet they are a prereq
 of getting any work done, makes sure that the absolute minimum that
 people can get away with will be proposed. This in turn goes and
 guarantees that no good design collaboration will happen. To add insult
 to injury, Gerrit and our spec template are a horrible tool for
 discussing design. Also the spec format itself works for only a small
 subset of design problems Nova development is faced with.

 I'd like to see some actual evidence to backup a sweeping statement
 as Specs dont work. They do nothing to facilitate good design
 happening,
 if anything they prevent it.

 Comparing Nova today, with Nova before specs were introduced, I think
 that specs have had a massive positive impact on the amount of good
 design and critique that is happening.

 Before specs, the average blueprint had no more than 3 lines of text
 in its description. Occassionally a blueprint would link to a wiki
 page or google doc with some design information, but that was very
 much the exception.

 When I was reviewing features in Nova before specs came along, I spent
 alot of time just trying to figure out what on earth the code was
 actually attempting to address, because there was rarely any statement
 of the problem being addressed, or any explanation of the design that
 motivated the code.  This made life hard for reviewers trying to figure
 out if the code was acceptable to merge.  It is pretty bad for
 contributors
 trying to implement new features too, as they could spend weeks or
 months
 writing and submitting code, only to be rejected at the end because the
 (lack of any design discussions) meant they missed some aspect of the
 problem which in turn meant all their work was in vain. That was a
 collosal
 waste of everyone's time and resulted in some of the very horrible code
 impl decisions we're still living with today.


 Yes there are good reasons to have a place to discuss implementation
 before doing it, especially if you are new to the project.
 
 This has been a huge boon.  And I don't think the benefits are realized
 only for developers new to the project.
 
 It seems to me that the challenge going forward is how to keep this
 benefit while reducing/eliminating some of the drawbacks.  But dropping
 specs without a replacement would eliminate both the benefit and the
 drawbacks.
 

I can agree with this. But...

The problem is that changing process can really be done only two times a
year. This is nowhere near flexible enough and once a bad process is in
place we seem to stick to it almost like it's a public API :) to Nova
development.

This lack of flexibility due to communication overhead is why we should
err on the side of as little formal process as possible, which is the
opposite of what we've been doing.


 The behemoth we have now is not that. It's miles away as you point out
 below.

 That's only a subset of problems. Some more, you ask? OK. No clear
 guidelines as to what needs a spec, that defaults to everything does.
 And spec being the absolute worst way to judge the validity of some of
 the things that do require them.

 Examples of the above are everywhere if you care to look for them but
 some that I've hit _this week_ are [1] (spec for a quick and dirty
 fix?!
 really?!) [2] (spec stuck waiting for a single person to comment
 something that is an implementation detail, and to make matter worse
 the
 spec is for a bug fix) [3] (see how ill suited the format is for a
 discussion + complaints about grammar and spelling instead of actual
 point being made).

 Nova's problem is not that it's big, it's that it's big _and_ tightly
 coupled. This means no one can be trusted to navigate the mess
 successfully, so we add the process to stop them. What we should be
 doing is fixing the mess, and the very process is preventing that.

 Don't take my word for it - ask the Gantt subteam who have been trying
 to figure out the scheduler interface for almost 4 cycles now. Folks
 doing Cells might have a comment on this too.

 The good news is that we have a lot of stuff in place already to
 help us
 reduce this massive coupling of everything. We have versioned objects,
 we have versioned RPC. Our internal APIs are terrible, and provide no
 isolation, but we have the means to iterate and figure it out.

 I don't expect this process issues will get solved quickly, If it were
 up to me I'd drop the whole thing, but I understand that it's not how
 it's done.

 I do hope this makes people think, discuss and move things

Re: [openstack-dev] [nova] Availability of device names for operations with volumes and BDM and other features.

2015-06-11 Thread Nikola Đipanov

On 06/02/2015 01:39 PM, Alexandre Levine wrote:
Thank you Nikola.

We'll be adding the required tickets and will follow your reviews,
however the person working primarily on this subject (Feodor Tersin) is
out for his vacation for a couple of weeks so some of our responses
might be delayed until then. Still we'll try to do whatever can be done
without him at the time being.

Best regards,
Alex Levine

Hi guys - I have some fixes posted [1]

As said below - reviews, and especially testing would be greatly
appreciated (even before they are merged).

NB: there is still some work needed to fix [2] on your side even if we
decide that my proposed approach is something we want to go with. See
comments on the bug and the related proposed patch for more information.

Thanks,
N.

[1]
https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bug/1370250,n,z
[2] https://bugs.launchpad.net/nova/+bug/1370250

On 5/29/15 11:32 PM, Nikola Đipanov wrote:
On 05/29/2015 12:55 AM, Feodor Tersin wrote:
Nicola, i would add some words to Alexandre repsonse.

We (standalone ec2api project guys) have filed some bugs (the main is
[1]), but we don't know how to fix them since the way Nova's device
names are moved on is unclear for us. Neither BP, nor wiki you've
mentioned above don't explain what was happened with device names in
images.
Other bug which we filed by results of bdm v2 implementation [2] was
resolved, but the fix returns only two devices (even if more than two
volumes are defined in the image) instead of to write device names to
the image and to return full bdm.

I hope you will clarify this question (Alexandre referred to the patch
with explicit elimination of device names for images).

Also you mentioned that we can still use bdm v1. We do it now for
instance launch, but we would like to switch to v2 to use new features
like blank volumes which are provided by AWS as well. However v2 based
launch has a suspicious feature which i asked about in ML [3], but no
one answered me. It would be great if you clarify that question too.

[1] https://bugs.launchpad.net/nova/+bug/1370177
[2] https://bugs.launchpad.net/nova/+bug/1370265
[3]
http://lists.openstack.org/pipermail/openstack-dev/2015-May/063769.html

Hey Feodor and Alexandre - Thanks for the detailed information!

I have already commented on some of the bugs above and provided a small
patch that I think fixes one bit of it. As described on the bug - device
names might be a bit trickier, but I hope to have something posted next
week.

Help with testing (while patches are in review) would be hugely
appreciated!

On 05/28/2015 02:24 PM, Alexandre Levine wrote:
1. RunInstance. Change parameters of devices during instance booting
from image. In Grizzly it worked so we could specify changed BDM in
parameters, it overwrote in nova DB the one coming from image and then
started the instance with new parameters. The only key for addressing
devices in this use case is the very device name. And now we don't have
it for the volumes in BDM coming from the image, because nova stopped
putting this information into the image.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_RunInstances.html

2. Devices names for Xen-backed instances to work fully. We should be
able to specify required device names during initial instance creation,
they should be stored into an image when the instance is shapshotted, we
can fetch info and change parameters of such volume during subsequent
operations, and the device names inside the instance should be named
exactly.

3. DescribeInstances and DescribeInstanceAttributes to return BDM with
device names ideally corresponding to actual device naming in instance.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstances.html

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceAttribute.html

4. DescribeImages and DescribeImageAttributes to return BDM with device
names ideally corresponding to the ones in instance before snapshotting.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeImages.html

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeImageAttribute.html

I think all of the above is pretty much covered by
https://bugs.launchpad.net/nova/+bug/1370177

5. AttachVolume with the specified device name.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_AttachVolume.html

6. ModifyInstanceAttribute with BDM as parameter.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ModifyInstanceAttribute.html

7. ModifyImageAttribute with BDM as parameter.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ModifyImageAttribute.html

I am not sure about these 3 cases - would it be possible to actually
report bugs for them as I don't think I have enough information this way.

Re: [openstack-dev] [nova][scheduler] Updating Our Concept of Resources

2015-06-03 Thread Nikola Đipanov

On 06/03/2015 02:13 PM, John Garbutt wrote:
 On 3 June 2015 at 13:53, Ed Leafe e...@leafe.com wrote:
 On Jun 2, 2015, at 5:58 AM, Alexis Lee alex...@hp.com wrote:

 If you allocate all the memory of a box to high-mem instances, you may
 not be billing for all the CPU and disk which are now unusable. That's
 why flavors were introduced, afaik, and it's still a valid need.

 So we had a very good discussion at the weekly IRC meeting for the 
 Scheduler, and we agreed to follow that up here on the ML. One thing that 
 came up, noted in the quote above, is that I gave the impression in my first 
 email that I thought flavors were useless. I think I did a better job in the 
 original blog post of explaining that flavors are a great way to handle the 
 sane division of a resource like a compute node. The issue I have with 
 flavors is that we seem to be locked into the everything that can be 
 requested has to fit into the flavor, and that really doesn't make sense.

 Another concern was from the cloud provider's POV, which makes a flavor a 
 convenient way of packaging cloud resources for sale. The customer can 
 simply say give me one of these to specify a complex combination of 
 virtualized resources. That's great, but it means that there has to be a 
 flavor for every possible permutation of resources. If you restricted 
 flavors to only represent the sane ways of dividing up compute nodes, any 
 other features could be add-ons to the request. Something like ordering a 
 pizza: offer the customer a fixed choice of sizes, but then let them specify 
 any toppings in whatever combination they want. That's certainly more sane 
 than presenting them with a menu with hundreds of pizza flavors, each 
 representing a different size/topping combination.
 
 I feel there is a lot to be said for treating consumable resources
 very separately to free options.
 
 For example grouping the vCPUs into sockets can be free in terms of
 capacity planning, so is a valid optional add on (assuming you are not
 doing some level of pinning to match that).
 
 For things where you are trying to find a specific compute node, that
 kind of attribute has clear capacity planning concerns, and is likely
 to have a specific cost associated with it. So we need to make sure
 its clear how that cost concept can be layered on top of the Nova API.
 For example os_type often changes the cost, and is implemented on
 top of flavors using a combination of protected image properties on
 glance and the way snapshots inherit image properties.
 
 I totally agree the scheduler doesn't have to know anything about
 flavors though. We should push them out to request validation in the
 Nova API. This can be considered part of cleaning up the scheduler API.

 This idea was also discussed and seemed to get a lot of support. Basically, 
 it means that by the time the request hits the scheduler, there is no 
 flavor anymore; instead, the scheduler gets a request for so much RAM, so 
 much disk, etc., and these amounts have already been validated at the API 
 layer. So a customer requests a flavor just like they do now, and the API 
 has the responsibility to verify that the flavor is valid, but then 
 unpacks the flavor into its components and passes that on to compute. The 
 end result is the same, but there would be no more need to store flavors 
 anywhere but the front end. This has the added benefit of eliminating the 
 problem with new flavors being propagated down to cells, since they would no 
 longer need to have to translate what flavor X means. Don Dugger 
 volunteered to write up a spec for removing flavors from the scheduler.

 
 +1 for Nova translating the incoming request to a resource request
 the scheduler understands, given the resources it knows about.
 
 I would look at scoping that to compute resources, so its easier to
 add volume and network into that request at a later date.
 

I also agree with this pretty much completely. I feel that the single
thing that made some of the scheduler discussions drag on for months is
our lack of willingness to bite of the big chunk that is coming up with
a solid API to the scheduler.

Starting from nouns and verbs - it definitely seems like a good idea to
pass in the _requested_ resources to a scheduler that knows about
_avalible_ resources. [1] seems like an excellent start.

I seem to remember Jay discussing at one point that not all of the
things we want the scheduler to know about make sense to be modelled as
resources (running instances for example) and it made a lot of sense to
me, but it seems like it's the kind of thing that would be the easiest
to figure out once you see the code (I also don't see it mentioned in
[1] but I assume Jay dropped it to keep the scope of that BP manageable).

N.

[1]
https://review.openstack.org/#/c/184534/1/specs/liberty/approved/resource-objects.rst

PS. I feel that exactly this type of work such as figuring out an API
for a component, would get done way quicker if

Re: [openstack-dev] [nova] RequestSpec object and Instance model

2015-06-03 Thread Nikola Đipanov

On 06/02/2015 03:14 PM, Sylvain Bauza wrote:
 Hi,
 
 Currently working on implementing the RequestSpec object BP [1], I had
 some cool comments on my change here :
 https://review.openstack.org/#/c/145528/12/nova/objects/request_spec.py,cm
 
 Since we didn't discussed on how to persist that RequestSpec object, I
 think the comment is valuable.
 
 For the moment, the only agreed spec for persisting the object that we
 have is [2] but there is also a corollar here that means that we would
 have to persist more than the current fields
 https://review.openstack.org/#/c/169901/3/specs/liberty/approved/add-buildrequest-obj.rst,cm
 
 
 So, there are 2 possibilities :
  #1, we only persist the RequestSpec for the sole Scheduler and in that
 case, we can leave as it is - only a few fields from Instance are stored
  #2, we consider that RequestSpec can be used for more than just the
 Scheduler, and then we need to make sure that we will have all the
 instance fields then.
 

So these are 2 possibilities if we agree that we need to make progress
on the spec as is defined and merged now. What I was complaining
yesterday is that we don't seem to have done enough of high level
investigation into this stuff before embarking on writing a set of specs
that then due to their format obscure the problems we are actually
trying to solve.

Work around the scheduler touches on a lot of issues that have only
recently been noticed. While I am all for the incremental approach, it
seems silly to completely disregard the issues we already know about. We
should have a high level overview of the problems we know we want to
solve, and then come up with an incremental way of solving them, but not
without keeping an eye on the big picture at all times.

An ad-hoc list of individual issues that we know about and should be
trying to solve (in no particular order) that all seem related to the
data model design problem we are trying to take a stab at here:

1/ RequestSpec is an unversioned dict even though it's the central piece
of a placement request for the scheduler
2/ There are scheduler_hints that are needed throughout the lifecycle of
an instance but are never persisted so are lost after boot
3/ We have the Migration objects that are used both for resource
tracking for instances being migrated, and as an indication of an
instance being moved, but are not used in all the places we need this
kind of book keeping (live migration, rebuild)
4/ Evacuate (an orchestrated rebuild) is especially problematic because
it usually involves failure modes, which are difficult to identify and
handle properly without a consistently used data model.
5/ Some of the recently added constraints that influence resource
tracking (NUMA, CPU pinning) cannot simply be calculated from the flavor
on the fly when tracking resources, but need to be persisted after a
successful claim as they are dependent on the state of the host at that
very moment (see [1])
6/ Related to the previous one - there is data related to the instance
in addition to the flavor that need to follow the '_old' and '_new'
pattern (needs the values related to both source and destination host
persisted during a migration/resize/live migration/)
7/ The issues cells v2 folks are hitting (mentioned above) where they
don't want to have any Instances in the top level cell but still need to
persist stuff.
8/ Issues with having no access to individual instance UUIDs in the
scheduler, but a lot of data access for more complex filtering revolves
around it being present.

Most of the above have individual bugs that I can try to find and link
here too.

[1] https://bugs.launchpad.net/nova/+bug/1417667

The overall theme of all the above is (to paraphrase alaski from IRC)
how to organize the big blob of data that is an instance in all of it's
possible states, in such a way that it makes sense, nothing is missing,
there is as little duplication as possible, and access patterns of
different services that require different bits can work without massive
overhead.

 
 I'm not strongly opiniated on that, I maybe consider that #2 is probably
 the best option but there is a tie in my mind. Help me figuring out
 what's the best option.
 

If we want to keep things moving forward on this particular BP - I'd go
with adding the RequestSpec object and make sure the code that uses it
is migrated. I believe that spike alone will leave us with much better
idea about the problem.

In addition - writing a high level spec/wiki that we can refer back to
in individual BPs and see how they solve it would be massively helpful too.

N.

 -Sylvain
 
 [1] :
 http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/request-spec-object.html
 
 [2] :
 http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/persist-request-spec.html
 
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:

Re: [openstack-dev] [all][infra][tc][ptl] Scaling up code review process (subdir cores)

2015-06-03 Thread Nikola Đipanov

On 06/03/2015 02:43 PM, Boris Pavlovic wrote:
 
 I don't believe even my self, because I am human and I make mistakes. 
 My goal on the PTL position is to make such process that stops human
 mistakes before they land in master. In other words  everything should be
 automated and pre not post checked. 
 

I used to believe exactly this some time ago - but I don't anymore. Lack
of bugs is not what makes good software (tho you don't want too many of
them :) ).

Focusing on bugs and automation to avoid them is misguided, and so is
the idea that code review is there to spot bugs before they land in
tree. Code reviewers should make sure that the abstractions are solid,
the code is modular, readable and maintainable - exactly the stuff
machines (still?) can't do (*).

This was one of the arguments against doing exactly what you propose in
Nova - we want the same (high?) level of reviews in all parts of the
code, and strong familiarity with the whole.

But I think it's failing - Nova is just too big - and there is not
enough skilled people to do the work without a massive scope reduction.

I am not sure how to fix it TBH (tho my gut feeling says we should
loosen not tighten the constraints).

N.

(*) Machines can run automated tests to find bugs, but tests are also
software that needs reviewing, maintaining and testing... so you want to
make sure you spend your finite resources catching the right kind of bugs.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [all][infra][tc][ptl] Scaling up code review process (subdir cores)

2015-06-03 Thread Nikola Đipanov

On 06/03/2015 05:57 PM, John Garbutt wrote:
 +1 to ttx and Jame's points on trust and relationships, indeed
 referencing the summit session that ttx mentioned:
 https://etherpad.openstack.org/p/liberty-cross-project-in-team-scaling
 
 On 3 June 2015 at 16:01, Nikola Đipanov ndipa...@redhat.com wrote:
 On 06/03/2015 02:43 PM, Boris Pavlovic wrote:

 I don't believe even my self, because I am human and I make mistakes.
 My goal on the PTL position is to make such process that stops human
 mistakes before they land in master. In other words  everything should be
 automated and pre not post checked.


 I used to believe exactly this some time ago - but I don't anymore. Lack
 of bugs is not what makes good software (tho you don't want too many of
 them :) ).

 Focusing on bugs and automation to avoid them is misguided
 
 Before we did this, most times I pulled master, I was unable to boot a
 VM. Its nice we fixed that.
 History has proven that anything we don't test gets broken very quickly.
 

Obviously I am not against automated tests. The point I tried to
(poorly) make is that the the reason to use CI should not be to end bugs
forever since that is never going to happen, and that it also has a cost
which needs to be considered.

All of this is actually off topic.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Availability of device names for operations with volumes and BDM and other features.

2015-05-29 Thread Nikola Đipanov

On 05/29/2015 12:55 AM, Feodor Tersin wrote:
 Nicola, i would add some words to Alexandre repsonse.
 
 We (standalone ec2api project guys) have filed some bugs (the main is
 [1]), but we don't know how to fix them since the way Nova's device
 names are moved on is unclear for us. Neither BP, nor wiki you've
 mentioned above don't explain what was happened with device names in images.
 Other bug which we filed by results of bdm v2 implementation [2] was
 resolved, but the fix returns only two devices (even if more than two
 volumes are defined in the image) instead of to write device names to
 the image and to return full bdm.
 
 I hope you will clarify this question (Alexandre referred to the patch
 with explicit elimination of device names for images).
 
 Also you mentioned that we can still use bdm v1. We do it now for
 instance launch, but we would like to switch to v2 to use new features
 like blank volumes which are provided by AWS as well. However v2 based
 launch has a suspicious feature which i asked about in ML [3], but no
 one answered me. It would be great if you clarify that question too.
 
 [1] https://bugs.launchpad.net/nova/+bug/1370177
 [2] https://bugs.launchpad.net/nova/+bug/1370265
 [3] http://lists.openstack.org/pipermail/openstack-dev/2015-May/063769.html
 

Hey Feodor and Alexandre - Thanks for the detailed information!

I have already commented on some of the bugs above and provided a small
patch that I think fixes one bit of it. As described on the bug - device
names might be a bit trickier, but I hope to have something posted next
week.

Help with testing (while patches are in review) would be hugely appreciated!

On 05/28/2015 02:24 PM, Alexandre Levine wrote:
 1. RunInstance. Change parameters of devices during instance booting
 from image. In Grizzly it worked so we could specify changed BDM in
 parameters, it overwrote in nova DB the one coming from image and then
 started the instance with new parameters. The only key for addressing
 devices in this use case is the very device name. And now we don't have
 it for the volumes in BDM coming from the image, because nova stopped
 putting this information into the image.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_RunInstances.html

 2. Devices names for Xen-backed instances to work fully. We should be
 able to specify required device names during initial instance creation,
 they should be stored into an image when the instance is shapshotted, we
 can fetch info and change parameters of such volume during subsequent
 operations, and the device names inside the instance should be named
 exactly.

 3. DescribeInstances and DescribeInstanceAttributes to return BDM with
 device names ideally corresponding to actual device naming in instance.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstances.html


http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceAttribute.html


 4. DescribeImages and DescribeImageAttributes to return BDM with device
 names ideally corresponding to the ones in instance before snapshotting.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeImages.html


http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeImageAttribute.html


I think all of the above is pretty much covered by
https://bugs.launchpad.net/nova/+bug/1370177


 5. AttachVolume with the specified device name.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_AttachVolume.html

 6. ModifyInstanceAttribute with BDM as parameter.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ModifyInstanceAttribute.html


 7. ModifyImageAttribute with BDM as parameter.

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ModifyImageAttribute.html

I am not sure about these 3 cases - would it be possible to actually
report bugs for them as I don't think I have enough information this way.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Availability of device names for operations with volumes and BDM and other features.

2015-05-27 Thread Nikola Đipanov

On 05/27/2015 09:47 AM, Alexandre Levine wrote:
 Hi all,
 
 I'd like to bring up this matter again, although it was at some extent
 discussed during the recent summit.
 
 The problem arises from the fact that the functionality exposing device
 names for usage through public APIs is deteriorating in nova. It's being
 deliberately removed because as I understand, it doesn't universally and
 consistently work in all of the backends. It happens  since IceHouse and
 introduction of bdm v2. The following very recent review is one of the
 ongoing efforts in this direction:
 https://review.openstack.org/#/c/185438/
 

I've abandoned the change as it is clear we need to discuss how to go
about this some more.

But first let me try to give a bit more detailed explanation and
background to what the deal is with device names. Supplying device names
that will be honoured by the guests is really only possible with Xen PV
guests (meaning the guest needs to be running PV-enabled kernel and
drivers).

Back in Havana, when we were working on [1] (see [2] for more details)
the basic idea was that we will still accept device names because
removing them from the public API is not likely to happen (mostly
because of the EC2 compatibility), but in case of libvirt driver, we
will treat them as hints only, and provide our own (by mostly
replicating the logic libvirt uses to order devices [3]). We also
allowed for device names to not be specified by the user as this is
really what anyone not using the EC2 API should be doing (users using
the EC2 API do however need to be aware the fact that i may not be
honoured).

[1]
https://blueprints.launchpad.net/nova/+spec/improve-block-device-handling
[2] https://wiki.openstack.org/wiki/BlockDeviceConfig
[3]
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/blockinfo.py

 The reason for my concern is that EC2 API have some important cases
 relying on this information (some of them have no workarounds). Namely:
 1. Change of parameters set by image for instance booting.
 2. Showing instance's devices information by euca2ools.
 3. Providing additional volumes for instance booting
 4. Attaching volume
 etc...
 

So based on the above - it seems to me that you think we are removing
the information about device names completely. That's not the case -
currently it is simply not mandatory for the Nova boot API call (it was
never mandatory for volume attach afaict) - you can still pass it in,
though libvirt may not honour it. It will still be tracked by the Nova
DB and available for users to refer to.

 Related to device names and additional related features we have troubles
 with now:
 1. All device name related features

As I said - they are not removed, in addition, you can still completely
disregard the BDMv2 syntax as Nova should transparently handle old-style
syntax when passed in (actually since BDM info is stored with images
when snapshotting and it may have been v1 syntax - it is likely that we
will never remove this support). If you are seeing some bugs related to
this - please report them.

 2. Modification of deleteOnTermination flag

I don't have enough details on this but if some behaviour has changed
when using the old syntax - it is likely a bug so please report it.

 3. Modification of parameters for instance booting

Again - I am not sure what this is related to exactly - but none of the
parameters have changed really (only new ones were added). It would be
good to get more information on this (preferably a bug report).

 4. deleteOnTermination and size of volume aren't stored into instance
 snapshots now.
 

This does sound like a bug - and hopefully an easy to fix one.

 Discussions during the summit on the matter were complicated because
 nobody present really understood in details why and what is happening
 with this functionality in nova. It was decided though, that overall
 direction would be to add necessary features or restore them unless
 there is something really showstopping:
 https://etherpad.openstack.org/p/YVR-nova-contributor-meetup
 
 As I understand, Nikola Depanov is the one working on the matter for
 some time obviously is the best person who can help to resolve the
 situation. Nikola, if possible, could you help with it and clarify the
 issue.
 
 My suggestion, based on my limited knowledge at the moment, still is to
 restore back or add all of the necessary APIs and provide tickets or
 known issues for the cases where the functionality is suffering from the
 backend limitations.
 
 Please let me know what you think.
 

As explained above - nothing was intentionally removed, and if something
broke - it's a bug that we should fix, so I urge the team behind the EC2
API on stackforge to report those, and I will try to at least look into
them, if not fix them. We might want to have a tag for EC2 related bugs
in LP (I seem to remember there being such a thing before).

Device names though are not something we can easily resolve without
having the users

Re: [openstack-dev] [nova] usefulness of device parameter at volumeAttachment

2015-05-26 Thread Nikola Đipanov

On 05/26/2015 11:13 AM, Daniel P. Berrange wrote:
 On Sat, May 23, 2015 at 11:00:32AM +0200, Géza Gémes wrote:
 Hi,

 When someone calls nova volume-attach or the block-device-mapping parameter
 at boot, it is possible to specify a device name for the guest. However I
 couldn't find any guest OS which would honor this. E.g. with libvirt/kvm, if
 the guest has two virtio disks already (vda and vdb), specifying vdf would
 be ignored and the disk will be attached as vdc in the guest.
 I propose to deprecate this option and at boot where it is not optional to
 accept only auto as an option.
 
 This was a design mistake in the original API which we can't now remove
 without breaking backcompatibility. While it is still supported, many
 hypervisors will completely ignore it, so we discourage people from ever
 using it. Just allow the hypervisor and/or guest OS to pick the device
 name
 


I have just proposed this yesterday:

https://review.openstack.org/#/c/185438/

Removing it from the API is a little bit trickier because it will
require a backwards incompatible version bump (this is allowed and all
good), but is probably the right thing to do in the long run. I will see
if I can propose a patch for this.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Proposal to add Melanie Witt to nova-core

2015-04-30 Thread Nikola Đipanov

On 04/30/2015 12:30 PM, John Garbutt wrote:
 Hi,
 
 I propose we add Melanie to nova-core.
 
 She has been consistently doing great quality code reviews[1],
 alongside a wide array of other really valuable contributions to the
 Nova project.
 
 Please respond with comments, +1s, or objections within one week.
 

+1

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][QA] Use case of Nova to boot VM from volume only

2015-04-03 Thread Nikola Đipanov

On 04/03/2015 07:33 AM, GHANSHYAM MANN wrote:
 Hi,
 
 This is regarding bug - https://bugs.launchpad.net/tempest/+bug/1436314
 
 When Nova is configured to boot VM from volume only, current Tempest
 integration tests will fails. We are not sure how feasible and valid
 this configuration is for Nova and what are the use cases of this.
 
 Before starting support this in Tempest, we would like to get some
 feedback and use cases about such configuration on Nova side.
 
 I am trying it by setting max_local_block_device to 0 and tests fails
 with 400 error[1]. Is that the right configuration to make boot from
 volume only or something else there on Nova config.
 
 1- ERROR (BadRequest): Block Device Mapping is Invalid: You specified
 more local devices than the limit allows (HTTP 400) (Request-ID:
 req-3ef100c7-b5c5-4a2d-a5da-8344726336e2)
 

Hey - I've responded on the bug. Let me know if the clarification makes
sense to you.

Cheers,
Nikola


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Revert objects: introduce numa topology limits objects

2015-03-24 Thread Nikola Đipanov

On 03/23/2015 02:06 PM, Dan Smith wrote:
 I am really sorry it got in as I have -1ed it several times for the same
 reason (I _really_ hate using the -2 hammer - we're all adults here
 after all).
 
 I guess that I should take some blame as a reviewer on that patch, but
 only after this mail do I read some of your comments as fundamentally
 opposed. The one that really articulates it wasn't a new vote so it
 stood out even less. IMHO, -2 is precisely for This shouldn't land as
 it is so would have been completely appropriate for this situation.
 It's a meaningful signal and has nothing to do with the age of the
 participants.
 
 My reasoning for it is quite simple and is outlined in the revert patch
 commit message:

   https://review.openstack.org/#/c/166767/

 The reason for bringing this up on the email thread is that as a result
 we need to downgrade the RPC that has technically been released (k-3).

 Let me know what you think.
 
 I don't think we should revert it. Doing so will be quite messy. I think
 we have a couple of options:
 
 1. Leave it as-is. Especially since we are able to synthesize the old
 call when necessary, it seems clear that we haven't lost any information
 here. We deal with it, roll forward and fix it in L.
 
 2. We add to the object, essentially deprecating the ratio fields that
 you feel are problematic, and pass the data that you really want. That
 way we have a small window of compatibility that we can drop after we
 snap kilo.
 
 #1 requires no work now, but more work later; #2 requires quite a bit of
 work now, which might be scary, but makes life easier in the long run.
 
 Given where we are, and since I don't really see this as a
 sky-is-falling sort of thing, I think I'd err on the side of caution and
 go with #1. A flat-out revert either requires us to ban an RPC version
 (something we've never done, AFAIK) or just flat out roll back time and
 pretend it never happened.
 

Thanks for taking a look.

Yes, agreed - it is probably better to focus on actual bugs that impact
customers at this point.

I will abandon the reverts, and work on proposing the fix-up for L.

Cheers,
Nikola


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Block Device Mapping is Invalid error

2015-03-18 Thread Nikola Đipanov

On 03/16/2015 03:55 PM, aburluka wrote:
 Hello Nova!
 
 I'd like to ask community to help me with some unclear things. I'm
 currently working on adding persistent storage support into a parallels
 driver.
 
 I'm trying to start VM.
 
 nova boot test-vm --flavor m1.medium --image centos-vm-32 --nic
 net-id=c3f40e33-d535-4217-916b-1450b8cd3987 --block-device
 id=26b7b917-2794-452a-95e5-2efb2ca6e32d,bus=sata,source=volume,bootindex=1
 
 Got an error:
 ERROR (BadRequest): Block Device Mapping is Invalid: Boot sequence for
 the instance and image/block device mapping combination is not valid.
 (HTTP 400) (Request-ID: req-454a512c-c9c0-4f01-a4c8-dd0df0c2e052)
 
 
 nova/api/openstack/compute/servers.py
 def create(self, req, body)
 Has such body arg:
 {u'server':
 {u'name': u'test-vm',
  u'imageRef': u'b9349d54-6fd3-4c09-94f5-8d1d5c5ada5c',
  u'block_device_mapping_v2': [{u'disk_bus': u'sata',
u'source_type': u'volume',
u'boot_index': u'1',
u'uuid':
 u'26b7b917-2794-452a-95e5-2efb2ca6e32d'}],
  u'flavorRef': u'3',
  u'max_count': 1,
  u'min_count': 1,
  u'networks': [{u'uuid': u'c3f40e33-d535-4217-916b-1450b8cd3987'}],
  'scheduler_hints': {}
 }
 }
 

So the reason you get such an error is because there is no block device
mapping with boot_index 0. This is for somewhat historical reasons -
when the new block device mapping syntax (so v2, see [1]) was
introduced, the idea was to stop special-casing images, and treat them
as just another block device. Still most of the driver code
special-cases the image field, so this block device is not really used
internally, but is checked for in the API when we try to validate the
boot sequence passed.

In order for this to work properly, we added code in the
python-novaclient to add a (somewhat useless) block device entry (see
commit [2]) so that the DB is used consistently and the validation passes.

[1] https://wiki.openstack.org/wiki/BlockDeviceConfig
[2] https://review.openstack.org/#/c/46537/1

 Such block device mapping leads to bad boot indexes list.
 I've tried to watch this argument while executing similiar command with
 kvm hypervisor on Juno RDO and get something like in body:
 
 {u'server': {u'name': u'test-vm',
  u'imageRef': u'78ad3d84-a165-42bb-93c0-a4ad1f1ddefc',
  u'block_device_mapping_v2': [{u'source_type': u'image',
u'destination_type': u'local',
u'boot_index': 0,
u'delete_on_termination': True,
u'uuid':
 u'78ad3d84-a165-42bb-93c0-a4ad1f1ddefc'},
 
  {u'disk_bus': u'sata',
   u'source_type': u'volume',
   u'boot_index': u'1',
   u'uuid':
 u'57a27723-65a6-472d-a67d-a551d7dc8405'}],
  u'flavorRef': u'3',
  u'max_count': 1,
  u'min_count': 1,
  'scheduler_hints': {}}}
 

The telling sign here was that you used RDO to test.

I spent some time looking at this, and the actual problem here is that
there is a line of code removed from the python-novaclient not too long
ago, that is present in the RDO Juno nova client, that actually makes it
work for.

The offending commit that breaks this for you and does not exist in the
RDO-shipped client is:

https://review.openstack.org/#/c/153203/

This basically removes the code that would add an image bdm if there are
other block devices specified. This is indeed a bug in master, but it is
not as simple as reverting the offending commit in the nova-client, as
it was a part of a separate bug fix [3]

Based on that I suspect that point the older (RDO Juno) client at a Nova
that contains the fix for [3] will also exsibit issues

Actually there is (at the time of this writing) still code in the Nova
API that expects and special-cases the case the above commit removes [4]

[3] https://bugs.launchpad.net/nova/+bug/1377958
[4]
https://github.com/openstack/nova/blob/4b1951622e4b7fcee5ef86396620e91b4b5fa1a1/nova/compute/api.py#L733

 Can you answer next questions please:
 1) Does the first version miss an 'source_type': 'image' arg?
 2) Where should and image block_device be added to this arg? Does it
 come from novaclient or is it added by some callback or decorator?
 

I think both questions are answered above. The question that we want to
answer is how to fix it and make sure that it does not regress as easily
in the future.

I have created a bug for this:

https://bugs.launchpad.net/nova/+bug/1433609

so we can continue the discussion there.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-10 Thread Nikola Đipanov

On 03/06/2015 03:19 PM, Attila Fazekas wrote:
 Looks like we need some kind of _per compute node_ mutex in the critical 
 section,
 multiple scheduler MAY be able to schedule to two compute node at same time,
 but not for scheduling to the same compute node.
 
 If we don't want to introduce another required component or
 reinvent the wheel there are some possible trick with the existing globally 
 visible
 components like with the RDMS.
 
 `Randomized` destination choose is recommended in most of the possible 
 solutions,
 alternatives are much more complex.
 
 One SQL example:
 
 * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.
 
 When the scheduler picks one (or multiple) node, he needs to verify is the 
 node(s) are 
 still good before sending the message to the n-cpu.
 
 It can be done by re-reading the ONLY the picked hypervisor(s) related data.
 with `LOCK IN SHARE MODE`.
 If the destination hyper-visors still OK:
 
 Increase the sched_cnt value exactly by 1,
 test is the UPDATE really update the required number of rows,
 the WHERE part needs to contain the previous value.
 
 You also need to update the resource usage on the hypervisor,
  by the expected cost of the new vms.
 
 If at least one selected node was ok, the transaction can be COMMITed.
 If you were able to COMMIT the transaction, the relevant messages 
  can be sent.
 
 The whole process needs to be repeated with the items which did not passed the
 post verification.
 
 If a message sending failed, `act like` migrating the vm to another host.
 
 If multiple scheduler tries to pick multiple different host in different 
 order,
 it can lead to a DEADLOCK situation.
 Solution: Try to have all scheduler to acquire to Shared RW locks in the same 
 order,
 at the end.
 
 Galera multi-writer (Active-Active) implication:
 As always, retry on deadlock. 
 
 n-sch + n-cpu crash at the same time:
 * If the scheduling is not finished properly, it might be fixed manually,
 or we need to solve which still alive scheduler instance is 
 responsible for fixing the particular scheduling..
 

So if I am reading the above correctly - you are basically proposing to
move claims to the scheduler (we would atomically check if there were
changes since the time we picked the host with the UPDATE .. WHERE using
LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
level) and then updating the usage, a.k.a doing the claim in the same
transaction.

The issue here is that we still have a window between sending the
message, and the message getting picked up by the compute host (or
timing out) or the instance outright failing, so for sure we will need
to ack/nack the claim in some way on the compute side.

I believe something like this has come up before under the umbrella term
of moving claims to the scheduler, and was discussed in some detail on
the latest Nova mid-cycle meetup, but only artifacts I could find were a
few lines on this etherpad Sylvain pointed me to [1] that I am copying here:


* White board the scheduler service interface
 ** note: this design won't change the existing way/logic of reconciling
nova db != hypervisor view
 ** gantt should just return claim ids, not entire claim objects
 ** claims are acked as being in use via the resource tracker updates
from nova-compute
 ** we still need scheduler retries for exceptional situations (admins
doing things outside openstack, hardware changes / failures)
 ** retry logic in conductor? probably a separate item/spec


As you can see - not much to go on (but that is material for a separate
thread that I may start soon).

The problem I have with this particular approach is that while it claims
to fix some of the races (and probably does) it does so by 1) turning
the current scheduling mechanism on it's head 2) and not providing any
thought into the trade-offs that it will make. For example, we may get
more correct scheduling in the general case and the correctness will not
be affected by the number of workers, but how does the fact that we now
do locking DB access on every request fare against the retry mechanism
for some of the more common usage patterns. What is the increased
overhead of calling back to he scheduler to confirm the claim? In the
end - how do we even measure that we are going in the right direction
with the new design.

I personally think that different workloads will have different needs
from the scheduler in terms of response times and tolerance to failure,
and that we need to design for that. So as an example a cloud operator
with very simple scheduling requirements may want to go for the no
locking approach and optimize for response times allowing for a small
number of instances to fail under high load/utilization due to retries,
while some others with more complicated scheduling requirements, or less
tolerance for data inconsistency might want to trade in response times
by doing locking claims in the scheduler. Some similar trade-offs and
how to

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-06 Thread Nikola Đipanov

On 03/06/2015 01:56 AM, Rui Chen wrote:
 Thank you very much for in-depth discussion about this topic, @Nikola
 and @Sylvain.
 
 I agree that we should solve the technical debt firstly, and then make
 the scheduler better.
 

That was not necessarily my point.

I would be happy to see work on how to make the scheduler less volatile
when run in parallel, but the solution must acknowledge the eventually
(or never really) consistent nature of the data scheduler has to operate
on (in it's current design - there is also the possibility of offering
an alternative design).

I'd say that fixing the technical debt that is aimed at splitting the
scheduler out of Nova is a mostly orthogonal effort.

There have been several proposals in the past for how to make the
scheduler horizontally scalable and improve it's performance. One that I
remember from the Atlanta summit time-frame was the work done by Boris
and his team [1] (they actually did some profiling and based their work
on the bottlenecks they found). There are also some nice ideas in the
bug lifeless filed [2] since this behaviour particularly impacts ironic.

N.

[1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
[2] https://bugs.launchpad.net/nova/+bug/1341420


 Best Regards.
 
 2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com
 mailto:sba...@redhat.com:
 
 
 Le 05/03/2015 13:00, Nikola Đipanov a écrit :
 
 On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
 
 Le 04/03/2015 04:51, Rui Chen a écrit :
 
 Hi all,
 
 I want to make it easy to launch a bunch of scheduler
 processes on a
 host, multiple scheduler workers will make use of
 multiple processors
 of host and enhance the performance of nova-scheduler.
 
 I had registered a blueprint and commit a patch to
 implement it.
 
 https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support
 
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
 This patch had applied in our performance environment
 and pass some
 test cases, like: concurrent booting multiple instances,
 currently we
 didn't find inconsistent issue.
 
 IMO, nova-scheduler should been scaled horizontally on
 easily way, the
 multiple workers should been supported as an out of box
 feature.
 
 Please feel free to discuss this feature, thanks.
 
 
 As I said when reviewing your patch, I think the problem is
 not just
 making sure that the scheduler is thread-safe, it's more
 about how the
 Scheduler is accounting resources and providing a retry if those
 consumed resources are higher than what's available.
 
 Here, the main problem is that two workers can actually
 consume two
 distinct resources on the same HostState object. In that
 case, the
 HostState object is decremented by the number of taken
 resources (modulo
 what means a resource which is not an Integer...) for both,
 but nowhere
 in that section, it does check that it overrides the
 resource usage. As
 I said, it's not just about decorating a semaphore, it's
 more about
 rethinking how the Scheduler is managing its resources.
 
 
 That's why I'm -1 on your patch until [1] gets merged. Once
 this BP will
 be implemented, we will have a set of classes for managing
 heterogeneous
 types of resouces and consume them, so it would be quite
 easy to provide
 a check against them in the consume_from_instance() method.
 
 I feel that the above explanation does not give the full picture in
 addition to being factually incorrect in several places. I have
 come to
 realize that the current behaviour of the scheduler is subtle enough
 that just reading the code is not enough to understand all the edge
 cases that can come up. The evidence being that it trips up even
 people
 that have spent significant time working on the code.
 
 It is also important to consider the design choices in terms of
 tradeoffs that they were trying to make.
 
 So here are some facts about the way Nova does scheduling of
 instances
 to compute hosts, considering the amount of resources requested
 by the
 flavor (we will try to put the facts into a bigger picture later):
 
 * Scheduler receives request to chose hosts for one or more
 instances.
 * Upon every request

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-05 Thread Nikola Đipanov

On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
 
 Le 04/03/2015 04:51, Rui Chen a écrit :
 Hi all,

 I want to make it easy to launch a bunch of scheduler processes on a
 host, multiple scheduler workers will make use of multiple processors
 of host and enhance the performance of nova-scheduler.

 I had registered a blueprint and commit a patch to implement it.
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support

 This patch had applied in our performance environment and pass some
 test cases, like: concurrent booting multiple instances, currently we
 didn't find inconsistent issue.

 IMO, nova-scheduler should been scaled horizontally on easily way, the
 multiple workers should been supported as an out of box feature.

 Please feel free to discuss this feature, thanks.
 
 
 As I said when reviewing your patch, I think the problem is not just
 making sure that the scheduler is thread-safe, it's more about how the
 Scheduler is accounting resources and providing a retry if those
 consumed resources are higher than what's available.
 
 Here, the main problem is that two workers can actually consume two
 distinct resources on the same HostState object. In that case, the
 HostState object is decremented by the number of taken resources (modulo
 what means a resource which is not an Integer...) for both, but nowhere
 in that section, it does check that it overrides the resource usage. As
 I said, it's not just about decorating a semaphore, it's more about
 rethinking how the Scheduler is managing its resources.
 
 
 That's why I'm -1 on your patch until [1] gets merged. Once this BP will
 be implemented, we will have a set of classes for managing heterogeneous
 types of resouces and consume them, so it would be quite easy to provide
 a check against them in the consume_from_instance() method.
 

I feel that the above explanation does not give the full picture in
addition to being factually incorrect in several places. I have come to
realize that the current behaviour of the scheduler is subtle enough
that just reading the code is not enough to understand all the edge
cases that can come up. The evidence being that it trips up even people
that have spent significant time working on the code.

It is also important to consider the design choices in terms of
tradeoffs that they were trying to make.

So here are some facts about the way Nova does scheduling of instances
to compute hosts, considering the amount of resources requested by the
flavor (we will try to put the facts into a bigger picture later):

* Scheduler receives request to chose hosts for one or more instances.
* Upon every request (_not_ for every instance as there may be several
instances in a request) the scheduler learns the state of the resources
on all compute nodes from the central DB. This state may be inaccurate
(meaning out of date).
* Compute resources are update by each compute host periodically. This
is done by updating the row in the DB.
* The wall-clock time difference between the scheduler deciding to
schedule an instance, and the resource consumption being reflected in
the data the scheduler learns from the DB can be arbitrarily long (due
to load on the compute nodes and latency of message arrival).
* To cope with the above, there is a concept of retrying the request
that fails on a certain compute node due to the scheduling decision
being made with data stale at the moment of build, by default we will
retry 3 times before giving up.
* When running multiple instances, decisions are made in a loop, and
internal in-memory view of the resources gets updated (the widely
misunderstood consume_from_instance method is used for this), so as to
keep subsequent decisions as accurate as possible. As was described
above, this is all thrown away once the request is finished.

Now that we understand the above, we can start to consider what changes
when we introduce several concurrent scheduler processes.

Several cases come to mind:
* Concurrent requests will no longer be serialized on reading the state
of all hosts (due to how eventlet interacts with mysql driver).
* In the presence of a single request for a large number of instances
there is going to be a drift in accuracy of the decisions made by other
schedulers as they will not have the accounted for any of the instances
until they actually get claimed on their respective hosts.

All of the above limitations will likely not pose a problem under normal
load and usage and can cause issues to start appearing when nodes are
close to full or when there is heavy load. Also this changes drastically
based on how we actually chose to utilize hosts (see a very interesting
Ironic bug [1])

Weather any of the above matters to users is dependant heavily on their
use-case though. This is why I feel we should be providing more information.

Finally - I think it is important to accept that the scheduler service
will always have to operate under the assumptions of stale data, and

Re: [openstack-dev] [nova] Plans to fix numa_topology related issues with migration/resize/evacuate

2015-03-04 Thread Nikola Đipanov

On 03/04/2015 03:17 PM, Wensley, Barton wrote:
 Hi,
 
 I have been exercising the numa topology related features in kilo (cpu 
 pinning, numa topology, huge pages) and have seen that there are issues
 when an operation moves an instance between compute nodes. In summary,
 the numa_topology is not recalculated for the destination node, which 
 results in the instance running with the wrong topology (or even 
 failing to run if the topology isn't supported on the destination). 
 This impacts live migration, cold migration, resize and evacuate.
 
 I have spent some time over the last couple weeks and have a working 
 fix for these issues that I would like to push upstream. The fix for
 cold migration and resize is the most straightfoward, so I plan to
 start there.
 

First of all thanks for all the hard work on this. Some comments on the
proposed changes bellow - but as usual it's best to see the code :)

 At a high level, here is what I have done to fix cold migrate and 
 resize:
 - Add the source_numa_topology and dest_numa_topology to the migration 
   object and migrations table.

Migration has access to the instance, and thus access to the current
topology. Also it seems that we actually always load the instance when
we query for migrations in the resource tracker.

Also - it might be better to have something akin to 'new_' flavor for
new topology so we can store both in the instance_extra table which
would be sligthtly more consistent.

Again - best to see the code first.

 - When a resize_claim is done, store the claimed numa topology in the
   dest_numa_topology in the migration record. Also store the current 
   numa topology as the source_numa_topology in the migration record.
 - Use the source_numa_topology and dest_numa_topology from the 
   migration record in the resource accounting when referencing 
   migration claims as appropriate. This is done for claims, dropped 
   claims and the resource audit.
 - Set the numa_topology in the instance after the cold migration/resize
   is finished to the dest_numa_topology from the migration object - 
   done in finish_resize RPC on the destination compute to match where 
   the rest of the resources for the instance are updated (there is a 
   call to _set_instance_info here that sets the memory, vcpus, disk 
   space, etc... for the migrated instance).
 - Set the numa_topology in the instance if the cold migration/resize is 
   reverted to the source_numa_topology from the migration object - 
   done in finish_revert_resize RPC on the source compute.
 
 I would appreciate any comments on my approach. I plan to start
 submitting the code for this against bug 1417667 - I will split it
 into several chunks to make it easier to review.
 

All of the above sounds relatively reasonable overall.

I'd like to hear from Jay, Sylvain and other scheduler devs on how they
see this impacting some of the planned blueprints like the RequestSpec
one [1]

Also note that this will require fixing this completely NUMA filter as
well: I've proposed a way to do it here [2]

N.

[1] https://blueprints.launchpad.net/nova/+spec/request-spec-object
[2] https://review.openstack.org/160484

 Fixing live migration was significantly more effort - I'll start a
 different thread on that once I have feedback on the above approach.
 
 Thanks,
 
 Bart Wensley, Member of Technical Staff, Wind River
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Outcome of the nova FFE meeting for Kilo

2015-02-23 Thread Nikola Đipanov

On 02/20/2015 11:33 PM, Sourabh Patwardhan (sopatwar) wrote:
 Nova core reviewers,
 
 May I request an FFE for Cisco VIF driver:
 https://review.openstack.org/#/c/157616/
 
 This is a small isolated change similar to the vhostuser / open contrail
 vif drivers for which FFE has been granted.
 
 Thanks,
 Sourabh
 
 

Hey Sourabh,

Sorry that you didn't get any responses sooner.

Actually the FFEs get decided by a subset of the Nova core team called
the nova-drivers. You can see it briefly mentioned here [1]. (*)

You can see an ethepad that hosts the minutes of the meeting where
nova-drivers were deciding on FFEs at [2] which may give you more
insight into why your BP did not make the cut.

Once again - apologies for any poor experience you may have had trying
to contribute to Nova.

N.

[1] https://wiki.openstack.org/wiki/Nova
[2] https://etherpad.openstack.org/p/kilo-nova-ffe-requests

(*) On a side note this is an example of us not making the process and
the players clear to our contributors, so we should probably try to
document the role of the nova-drivers better at the very least.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Ubuntu, qemu, NUMA support

2015-02-22 Thread Nikola Đipanov

On 02/23/2015 06:17 AM, Tom Fifield wrote:
 Also, we currently assume that qemu can pin to NUMA nodes.  This is an
 invalid assumption since this was only added as of qemu 2.1, and there
 only if it's compiled with NUMA support.  At the very least we should
 have a version check, but if Ubuntu doesn't fix things then maybe we
 should actually verify the functionality first before trying to use it.

 I've opened a bug to track this issue:
 https://bugs.launchpad.net/nova/+bug/1422775
 
 This bug might still be worthwhile, as quite a few folks will likely
 stick with Trusty for Kilo. Though, did you by change check the flag
 status of the package in the Ubuntu Cloud Archive? It packages a
 different Qemu (ver 2.2) to the main repo ...
 

Hey,

I've responded to the bug too (tl; dr - IMHO we should be failing the
instance request).

It might be better to move any discussion that ensues there so that it's
in one place.

Cheers for reporting it though!
N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Question about force_host skip filters

2015-02-17 Thread Nikola Đipanov

On 02/17/2015 04:59 PM, Chris Friesen wrote:
 On 02/16/2015 01:17 AM, Nikola Đipanov wrote:
 On 02/14/2015 08:25 AM, Alex Xu wrote:
 
 Agree with Nikola, the claim already checking that. And instance booting
 must be failed if there isn't pci device. But I still think it should go
 through the filters, because in the future we may move the claim into
 the scheduler. And we needn't any new options, I didn't see there is any
 behavior changed.


 I think that it's not as simple as just re-running all the filters. When
 we want to force a host - there are certain things we may want to
 disregard (like aggregates? affinity?) that the admin de-facto overrides
 by saying they want a specific host, and there are things we definitely
 need to re-run to set the limits and for the request to even make sense
 (like NUMA, PCI, maybe some others).

 So what I am thinking is that we need a subset of filters that we flag
 as - we need to re-run this even for force-host, and then run them on
 every request.
 
 Yeah, that makes sense.  Also, I think that flag should be an attribute
 of the filter itself, so that people adding new filters don't need to
 also add the filter to a list somewhere.
 

This is basically what I had in mind - definitely a filter property!

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [nova] Feature Freeze Exception Request (DRBD for Nova) WAS: Re: [nova] Request Spec Freeze Exception (DRBD for Nova)

2015-02-16 Thread Nikola Đipanov

Re-titling the email so that it does not get missed as it does not have
the right subject line.

I looked at the code and it is quite straightforward - some small nits
inline but other than that - no reason to keep it out.

This is the first contribution to Nova by Philipp and his team, and
their experience navigating our bureaucracy-meritocracy seems far from a
happy one (from what I could gather on IRC at least) - one more reason
to not keep this feature out.

Thanks,
N.

On 02/16/2015 03:06 PM, Philipp Marek wrote:
 Hi all,
 
 Nikola just told me that I need an FFE for the code as well.
 
 Here it is: please grant a FFE for 
 
 https://review.openstack.org/#/c/149244/
 
 which is the code for the spec at
 
 https://review.openstack.org/#/c/134153/
 
 
 Regards,
 
 Phil
 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Feature Freeze Exception Request (DRBD for Nova) WAS: Re: [nova] Request Spec Freeze Exception (DRBD for Nova)

2015-02-16 Thread Nikola Đipanov

On 02/16/2015 03:27 PM, Nikola Đipanov wrote:
 Re-titling the email so that it does not get missed as it does not have
 the right subject line.
 

Ugh - as Daniel pointed out - the spec is not actually approved so
please disregard this email - I missed that bit.

Although - I still stand by the following paragraph:

 
 This is the first contribution to Nova by Philipp and his team, and
 their experience navigating our bureaucracy-meritocracy seems far from a
 happy one (from what I could gather on IRC at least) - one more reason
 to not keep this feature out.
 




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Question about force_host skip filters

2015-02-15 Thread Nikola Đipanov

On 02/14/2015 08:25 AM, Alex Xu wrote:
 
 
 2015-02-14 1:41 GMT+08:00 Nikola Đipanov ndipa...@redhat.com
 mailto:ndipa...@redhat.com:
 
 On 02/12/2015 04:10 PM, Chris Friesen wrote:
  On 02/12/2015 03:44 AM, Sylvain Bauza wrote:
 
  Any action done by the operator is always more important than what the
  Scheduler
  could decide. So, in an emergency situation, the operator wants to
  force a
  migration to an host, we need to accept it and do it, even if it
  doesn't match
  what the Scheduler could decide (and could violate any policy)
 
  That's a *force* action, so please leave the operator decide.
 
  Are we suggesting that the operator would/should only ever specify a
  specific host if the situation is an emergency?
 
  If not, then perhaps it would make sense to have it go through the
  scheduler filters even if a host is specified.  We could then have a
  --force flag that would proceed anyways even if the filters don't 
 match.
 
  There are some cases (provider networks or PCI passthrough for example)
  where it really makes no sense to try and run an instance on a compute
  node that wouldn't pass the scheduler filters.  Maybe it would make the
  most sense to specify a list of which filters to override while still
  using the others.
 
 
 Actually this kind of already happens on the compute node when doing
 claims. Even if we do force the host, the claim will fail on the compute
 node and we will end up with a consistent scheduling.
 
 
 
 Agree with Nikola, the claim already checking that. And instance booting
 must be failed if there isn't pci device. But I still think it should go
 through the filters, because in the future we may move the claim into
 the scheduler. And we needn't any new options, I didn't see there is any
 behavior changed.
 

I think that it's not as simple as just re-running all the filters. When
we want to force a host - there are certain things we may want to
disregard (like aggregates? affinity?) that the admin de-facto overrides
by saying they want a specific host, and there are things we definitely
need to re-run to set the limits and for the request to even make sense
(like NUMA, PCI, maybe some others).

So what I am thinking is that we need a subset of filters that we flag
as - we need to re-run this even for force-host, and then run them on
every request.

thoughts?

N.

 
 
 This sadly breaks down for stuff that needs to use limits, as limits
 won't be set by the filters.
 
 Jay had a BP before to move limits onto compute nodes, which would solve
 this issue, as you would not need to run the filters at all - all the
 stuff would be known to the compute host that could then easily say
 nice of you to want this here, but it ain't happening.
 
 It will also likely need a check in the retry logic to make sure we
 don't hit the host 'retry' number of times.
 
 N.
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Question about force_host skip filters

2015-02-13 Thread Nikola Đipanov

On 02/12/2015 04:10 PM, Chris Friesen wrote:
 On 02/12/2015 03:44 AM, Sylvain Bauza wrote:
 
 Any action done by the operator is always more important than what the
 Scheduler
 could decide. So, in an emergency situation, the operator wants to
 force a
 migration to an host, we need to accept it and do it, even if it
 doesn't match
 what the Scheduler could decide (and could violate any policy)

 That's a *force* action, so please leave the operator decide.
 
 Are we suggesting that the operator would/should only ever specify a
 specific host if the situation is an emergency?
 
 If not, then perhaps it would make sense to have it go through the
 scheduler filters even if a host is specified.  We could then have a
 --force flag that would proceed anyways even if the filters don't match.
 
 There are some cases (provider networks or PCI passthrough for example)
 where it really makes no sense to try and run an instance on a compute
 node that wouldn't pass the scheduler filters.  Maybe it would make the
 most sense to specify a list of which filters to override while still
 using the others.
 

Actually this kind of already happens on the compute node when doing
claims. Even if we do force the host, the claim will fail on the compute
node and we will end up with a consistent scheduling.

This sadly breaks down for stuff that needs to use limits, as limits
won't be set by the filters.

Jay had a BP before to move limits onto compute nodes, which would solve
this issue, as you would not need to run the filters at all - all the
stuff would be known to the compute host that could then easily say
nice of you to want this here, but it ain't happening.

It will also likely need a check in the retry logic to make sure we
don't hit the host 'retry' number of times.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-12 Thread Nikola Đipanov

On 02/11/2015 06:20 PM, Clint Byrum wrote:
 Excerpts from Nikola Đipanov's message of 2015-02-11 05:26:47 -0800:
 On 02/11/2015 02:13 PM, Sean Dague wrote:

 If core team members start dropping off external IRC where they are
 communicating across corporate boundaries, then the local tribal effects
 start taking over. You get people start talking about the upstream as
 them. The moment we get into us vs. them, we've got a problem.
 Especially when the upstream project is them.


 A lot of assumptions being presented as fact here.

 I believe the technical term for the above is 'slippery slope fallacy'.

 
 I don't see that fallacy, though it could descend into that if people
 keep pushing in that direction. Where I think Sean did a nice job
 stopping short of the slippery slope is that he only identified the step
 that is happening _now_, not the next step.
 
 I tend to agree that right now, if core team members are not talking
 on IRC to other core members in the open, whether inside or outside
 corporate boundaries, then we do see an us vs. them mentality happen.
 It's not I think thats the next step. I have personally seen that
 happening and will work hard to stop it. I think Sean has probably seen
 his share of it too,  as that is what he described in detail without
 publicly shaming anyone or any company (well done Sean).
 

There are several things I don't agree with in Sean's email, but this
one strikes me as particularly annoying, and potentially dangerous. You
also reinforce it in your reply.

Both of you seem to imply that there is the right way to do OpenStack,
and be core outside of following the development process. The notion
is annoying because it leads to exclusivity that Flavio complains about,
and is making our community a worse place for that. Different people who
can be valuable contributors, have wildly different (to name only a
few): personal styles of working, obligations to their own employer,
obligations to their family, level of command of the English language,
possibility to travel to remote parts of the world, possibility to cross
boarders without additional strain on time and finances, possibility to
engage in a real-time written discussion, possibility to engage in a
real time discussion in person in a language that is not their own in a
room full of native speakers of the used language, possibility to engage
in real-time discussions effectively. Need I go on...

Not only does your and Sean's argument not acknowledge these differences
that can easily lead to exclusion of valuable contributors - you
actually go as far as to say that unless everyone does it the right
way, the community will be worse for it, and try to back it up with
made up stuff like local tribe effects (really?! We are talking about
adult professional people here).

So yes there is a us and them - but the divide is not where you
think it is. This is why I believe an argument like this dropped smack
in the middle of a discussion like the one Flavio started is deeply
toxic, all fallacies aside.

 We can and _must_ do much better than this on this mailing list! Let's
 drag the discussion level back up!
 
 I'm certain we can always improve, and I appreciate you taking the time
 to have a Gandalf moment to stop the Balrog of fallacy from  entering
 this thread. We seriously can't let the discussion slip down that
 slope.. oh wait.
 

LOL on the LOTR reference (I look nothing like Gandalf though I may
dress like that sometimes). I hope I explained what I meant when I said
that this kind of argument really has no place in a discussion about
making the community more open by nurturing open communication.

 That said, I do want us to talk about uncomfortable things when
 necessary. I think this thread is not something where it will be entirely
 productive to stay 100% positive throughout. We might just have to use
 some negative language along side our positive suggestions to make sure
 people have an efficient way to measure their own behavior.


By all means - I only wish there would be more level-headed discussion
about the negatives around here.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Feature Freeze Exception Request (libvirt vhostuser vif driver)

2015-02-11 Thread Nikola Đipanov

On 02/09/2015 11:04 AM, Czesnowicz, Przemyslaw wrote:
 Hi,  
 
  
 
 I would like to request FFE for vhostuser vif driver.
 
  
 
 2 reviews :
 https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/libvirt-vif-vhost-user,n,z
 
  
 
 BP: https://blueprints.launchpad.net/nova/+spec/libvirt-vif-vhost-user
 
 Spec: https://review.openstack.org/138736
 
  
 
 Blueprint was approved but it’s status was changed because of FF.
 
 Vhostuser is a Qemu feature that allows fastpath into the VM for
 userspace vSwitches.
 
 The changes are small and mostly contained to libvirt driver.
 
 Vhostuser support was proposed for Juno by Snabb switch guys but didn’t
 make it,
 
 this implementation supports their usecase as well .


The patches are really non-invasive, and extremely contained, and have
had several reviews. I cannot come up with a good reason to keep it out.

This is also interesting for the NFV use-cases (mainly) so I'd like to
see it happen on that account too - and thus will be happy to sponsor it.

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-11 Thread Nikola Đipanov

+ Inf for writing this Flavio!

Only some observations below.

On 02/11/2015 10:55 AM, Flavio Percoco wrote:
 Greetings all,
 
 During the last two cycles, I've had the feeling that some of the
 things I love the most about this community are degrading and moving
 to a state that I personally disagree with. With the hope of seeing
 these things improve, I'm taking the time today to share one of my
 concerns.
 
 Since I believe we all work with good faith and we *all* should assume
 such when it comes to things happening in our community, I won't make
 names and I won't point fingers - yes, I don't have enough fingers to
 point based on the info I have. People that fall into the groups I'll
 mention below know that I'm talking to them.
 
 This email is dedicated to the openness of our community/project.
 
 ## Keep discussions open
 
 I don't believe there's anything wrong about kicking off some
 discussions in private channels about specs/bugs. I don't believe
 there's anything wrong in having calls to speed up some discussions.
 HOWEVER, I believe it's *completely* wrong to consider those private
 discussions sufficient. If you have had that kind of private
 discussions, if you've discussed a spec privately and right after you
 went upstream and said: This has been discussed in a call and it's
 good to go, I beg you to stop for 2 seconds and reconsider that. I
 don't believe you were able to fit all the community in that call and
 that you had enough consensus.
 
 Furthermore, you should consider that having private conversations, at
 the very end, doesn't help with speeding up discussions. We've a
 community of people who *care* about the project they're working on.
 This means that whenever they see something that doesn't make much
 sense, they'll chime in and ask for clarification. If there was a
 private discussion on that topic, you'll have to provide the details
 of such discussion and bring that person up to date, which means the
 discussion will basically start again... from scratch.
 
 ## Mailing List vs IRC Channel
 
 I get it, our mailing list is freaking busy, keeping up with it is
 hard and time consuming and that leads to lots of IRC discussions. I
 don't think there's anything wrong with that but I believe it's wrong
 to expect *EVERYONE* to be in the IRC channel when those discussions
 happen.
 
 If you are discussing something on IRC that requires the attention of
 most of your project's community, I highly recommend you to use the
 mailing list as oppose to pinging everyone independently and fighting
 with time zones. Using IRC bouncers as a replacement for something
 that should go to the mailing list is absurd. Please, use the mailing
 list and don't be afraid of having a bigger community chiming in in
 your discussion.  *THAT'S A GOOD THING*
 
 Changes, specs, APIs, etc. Everything is good for the mailing list.
 We've fought hard to make this community grow, why shouldn't we take
 advantage of it?
 

I think the above 2 are somewhat intertwined with another trend in the
community I've personally noticed towards the end of the Juno cycle,
that I also strongly believe needs to DIAFF.

An idea that it is possible to manage and open source community using
similar methods that are commonly used for managing subordinates in a
corporate hierarchy.

There are other (somewhat less) horrible examples around, and they all
came about as a (IMHO knee jerk) response to explosive growth, and they
all need to stop.

I urge people who are seen as leaders in their respective projects to
stop and think the next time they want to propose a policy change or a
process - ask yourself Is there an OSS project that does something
similar successfully, or have I seen this from our old PM? and then not
propose it if the answer is clearly that this will help the distributed
workflow of an OSS community.

On 02/11/2015 11:29 AM, Thierry Carrez wrote:
 This is the point where my good faith assumption skill falls short.
 Seriously, don't get me wrong but: WHAT IN THE ACTUAL F**K?

 THERE IS ABSOLUTELY NOTHING PRIVATE FOR CORE REVIEWERS* TO
 DISCUSS.

 If anything core reviewers should be the ones *FORCING* - it seems
 that *encouraging* doesn't have the same effect anymore - *OPENNESS* in
 order to include other non-core members in those discussions.

 Remember that the core flag is granted because of the reviews that
 person has provided and because that individual *WANTS* to be part of
 it. It's not a prize for people. In fact, I consider core reviewers to
 be volunteers and their job is infinitely thanked.

 +1000

 Core reviewing has always be designed to be a duty, not a badge. There
 has been a trends toward making it a badge, with some companies giving
 bonuses to core reviewers, and HP making +2 pins and throwing +2
 parties. I think that's a significant mistake and complained about it,
 but then my influence only goes that far.

 The problem with special rights (like +2) is that if you don't

Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-11 Thread Nikola Đipanov

On 02/11/2015 02:13 PM, Sean Dague wrote:
 
 If core team members start dropping off external IRC where they are
 communicating across corporate boundaries, then the local tribal effects
 start taking over. You get people start talking about the upstream as
 them. The moment we get into us vs. them, we've got a problem.
 Especially when the upstream project is them.
 

A lot of assumptions being presented as fact here.

I believe the technical term for the above is 'slippery slope fallacy'.

We can and _must_ do much better than this on this mailing list! Let's
drag the discussion level back up!

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Feature Freeze Exception Request for Quiesce boot from volume instances

2015-02-10 Thread Nikola Đipanov

On 02/10/2015 11:12 AM, Daniel P. Berrange wrote:
 On Fri, Feb 06, 2015 at 10:20:04PM +, Tomoki Sekiyama wrote:
 Hello,

 I'd like to request a feature freeze exception for the change
   https://review.openstack.org/#/c/138795/ .

 This patch makes live volume-boot instance snapshots consistent by
 quiescing instances before snapshotting. Quiescing for image-boot
 instances are already merged in the libvirt driver, and this is a
 complementary part for volume-boot instances.


 Nikola Dipanov and Daniel Berrange actively reviewed the patch and I hope
 it is ready now (+1 from Nikola with a comment that he is waiting for the
 FFE process at this point so no +2s).
 Please consider approving this FFE.
 
 I'm happy to sponsor this one having given it multiple reviews
 
 You could probably even argue this feature is in fact a bug fix since
 it fixes the problem of inconsistent snapshots which can result in guest
 application data corruption in the worst case.
 

I will sponsor it too - it basically one patch that is ready to merge
and has had a number of reviews.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Need nova-specs core reviews for Scheduler spec

2015-02-06 Thread Nikola Đipanov

On 02/06/2015 02:15 PM, Ed Leafe wrote:
 At the mid-cycle we discussed the last spec for the scheduler cleanup: 
 Isolate Scheduler DB for Instances
 
 https://review.openstack.org/#/c/138444/
 
 There was a lot of great feedback from those discussions, and that has been 
 incorporated into the spec. It has been re-reviewed by most of the scheduler 
 team with several +1s, but we really need the cores to approve it so we can 
 move ahead with the patches.
 

Hey Ed,

I've left a comment on the spec - basically I don't think this is an
approach we should take.

Since I was not at the midcycle, I am sorry the discussions happened so
close to the FF freeze, and there was not enough time to get broader
feedback from the community in time.

Best,
N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Nominating Melanie Witt for python-novaclient-core

2015-01-29 Thread Nikola Đipanov

On 01/27/2015 11:41 PM, Michael Still wrote:
 Greetings,
 
 I would like to nominate Melanie Witt for the python-novaclient-core team.
 
 (What is python-novaclient-core? Its a new group which will contain
 all of nova-core as well as anyone else we think should have core
 reviewer powers on just the python-novaclient code).
 
 Melanie has been involved with nova for a long time now. She does
 solid reviews in python-novaclient, and at least two current
 nova-cores have suggested her as ready for core review powers on that
 repository.
 
 Please respond with +1s or any concerns.
 
 References:
 
 
 https://review.openstack.org/#/q/project:openstack/python-novaclient+reviewer:%22melanie+witt+%253Cmelwitt%2540yahoo-inc.com%253E%22,n,z
 
 As a reminder, we use the voting process outlined at
 https://wiki.openstack.org/wiki/Nova/CoreTeam to add members to our
 core team.
 
 Thanks,
 Michael
 

+1

N.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][gate][stable] How eventlet 0.16.1 broke the gate

2015-01-15 Thread Nikola Đipanov

On 01/15/2015 10:35 AM, Joe Gordon wrote:

 So how could we have avoided this problem? By capping stable branch
 requirements so we only have to worry about uncapped dependencies on
 master. Capping stable branches has been previous discussed but no
 action has been taken. So going forward I propose we pin all
 requirements, including transitive, on stable branches. This way the
 release of new dependencies cannot automatically break stable branches
 and thus break grenade on master.
 

This is an absolute must IMHO, including transitive dependencies,
because if they are not capped - they can cause other issues like bring
in additional deps a stable release is not even supposed to have, among
all the usual issues.

The problem as I understand it is that this breaks how we do upgrades
testing in the gate, AKA the granade job (all in a single VM, install
everything from pip). IMHO this is broken and needs to be fixed ASAP, if
capping breaks it.

N.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Spring cleaning nova-core

2014-12-07 Thread Nikola Đipanov

On 12/07/2014 06:02 PM, Jay Pipes wrote:
 On 12/07/2014 04:19 AM, Michael Still wrote:
 On Sun, Dec 7, 2014 at 7:03 PM, Gary Kotton gkot...@vmware.com wrote:
 On 12/6/14, 7:42 PM, Jay Pipes jaypi...@gmail.com wrote:

 [snip]

 -1 on pixelbeat, since he's been active in reviews on
 various things AFAICT in the last 60-90 days and seems to be still a
 considerate reviewer in various areas.

 I agree -1 for Padraig

 I'm going to be honest and say I'm confused here.

 We've always said we expect cores to maintain an average of two
 reviews per day. That's not new, nor a rule created by me. Padraig is
 a great guy, but has been working on other things -- he's done 60
 reviews in the last 60 days -- which is about half of what we expect
 from a core.

 Are we talking about removing the two reviews a day requirement? If
 so, how do we balance that with the widespread complaints that core
 isn't keeping up with its workload? We could add more people to core,
 but there is also a maximum practical size to the group if we're going
 to keep everyone on the same page, especially when the less active
 cores don't generally turn up to our IRC meetings and are therefore
 more expensive to keep up to date.

 How can we say we are doing our best to keep up with the incoming
 review workload if all reviewers aren't doing at least the minimum
 level of reviews?
 
 Personally, I care more about the quality of reviews than the quantity.
 That said, I understand that we have a small number of core reviewers
 relative to the number of open reviews in Nova (~650-700 open reviews
 most days) and agree with Dan Smith that 2 reviews per day doesn't sound
 like too much of a hurdle for core reviewers.
 
 The reason I think it's important to keep Padraig as a core is that he
 has done considerate, thoughtful code reviews, albeit in a smaller
 quantity. By saying we only look at the number of reviews in our
 estimation of keeping contributors on the core team, we are
 incentivizing the wrong behaviour, IMO. We should be pushing that the
 thought that goes into reviews is more important than the sheer number
 of reviews.
 
 Is it critical that we get more eyeballs reviewing code? Yes, absolutely
 it is. Is it critical that we get more reviews from core reviewers as
 well as non-core reviewers. Yes, absolutely.
 
 Bottom line, we need to balance between quality and quantity, and
 kicking out a core reviewer who has quality code reviews because they
 don't have that many of them sends the wrong message, IMO.
 

I could not *possibly* agree more with everything Jay wrote above!

Quality should always win! And 2 reviews a day is a nice approximation
of what is expected but we should not have any number as a hard
requirement. It's lazy (in addition to sending the wrong message) and we
_need_ to be better than that!

Slightly off-topic - since we're so into numbers - Russell's statistics
were at one point showing the ratio between reviews given and reviews
received. I tend to be wary of people reviewing without writing any code
themselves, as they tend to lose touch with the actual constraints the
code is written under in different parts of Nova. This is especially
important when reviewing larger feature branches or more complicated
refactoring (a big part of what we want to prioritize in Kilo).

As any number - that one is also never going to tell the whole story,
and should not ever become a hard rule - but I for one would be
interested to see it.

N.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Spring cleaning nova-core

2014-12-05 Thread Nikola Đipanov

On 12/05/2014 01:05 AM, Michael Still wrote:
 One of the things that happens over time is that some of our core
 reviewers move on to other projects. This is a normal and healthy
 thing, especially as nova continues to spin out projects into other
 parts of OpenStack.
 
 However, it is important that our core reviewers be active, as it
 keeps them up to date with the current ways we approach development in
 Nova. I am therefore removing some no longer sufficiently active cores
 from the nova-core group.
 
 I’d like to thank the following people for their contributions over the years:
 
 * cbehrens: Chris Behrens
 * vishvananda: Vishvananda Ishaya
 * dan-prince: Dan Prince
 * belliott: Brian Elliott
 * p-draigbrady: Padraig Brady
 

I am personally -1 on Padraig and Vish, especially Padraig. As one of
the coreutils maintainers - his contribution to Nova is invaluable
regardless whatever metrics you apply to his reviews makes him appear on
this list (hint - quality should really be the only one). Removing him
from core will probably not affect that, but I personally definitely
trust him to not vote +2 on the stuff he is not in touch with, and view
his +2 when I see them as a sign of thorough reviews. Also he has not
exactly been inactive lately by any measure.

Vish has not been active for some time now, but he is on IRC and in the
community still (as opposed to Chris for example), so not sure why do
this now.

N.


 I’d love to see any of these cores return if they find their available
 time for code reviews increases.
 
 Thanks,
 Michael
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] NUMA Cells

2014-12-04 Thread Nikola Đipanov

On 12/04/2014 05:30 AM, Michael Still wrote:
 Hi,
 
 so just having read a bunch of the libvirt driver numa code, I have a
 concern. At first I thought it was a little thing, but I am starting
 to think its more of a big deal...
 
 We use the term cells to describe numa cells. However, that term has
 a specific meaning in nova, and I worry that overloading the term is
 confusing.
 
 (Yes, I know the numa people had it first, but hey).
 
 So, what do people think about trying to move the numa code to use
 something like numa cell or numacell based on context?
 

Seeing that node is also not exactly unambiguous in this space - I am
fine with both with either numanode or numacell with a slight
preference for numacell.

A small issue will be renaming it in objects though - as this will
require adding a new field for use in Kilo while still remaining
backwards compatible with Juno, resulting in even more compatibility
code (we already added some for the slightly different data format). The
whole name is quite in context there, but we would use it like:

  for cell in numa_topology.cells:
 # awesome algo here with cell :(

but if we were to rename it just in places where it's used to:

  for numacell in numa_topology.cells:
 # awesome algo here with numacell :)

We would achieve a lot of the disambiguation without renaming object
attributes (but not really make it future proof).

Thoughts?

N.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Consistency, efficiency, and safety of NovaObject.save()

2014-11-13 Thread Nikola Đipanov

On 11/13/2014 02:45 AM, Dan Smith wrote:
 I’m not sure if I’m seeing the second SELECT here either but I’m less
 familiar with what I’m looking at. compute_node_update() does the
 one SELECT as we said, then it doesn’t look like
 self._from_db_object() would emit any further SQL specific to that
 row.
 
 I don't think you're missing anything. I don't see anything in that
 object code, or the other db/sqlalchemy/api.py code that looks like a
 second select. Perhaps he was referring to two *queries*, being the
 initial select and the following update?
 

FWIW - I think an example Matt was giving me yesterday was block devices
where we have:

@require_context
def block_device_mapping_update(context, bdm_id, values, legacy=True):
_scrub_empty_str_values(values, ['volume_size'])
values = _from_legacy_values(values, legacy, allow_updates=True)
query =_block_device_mapping_get_query(context).filter_by(id=bdm_id)
query.update(values)
return query.first()

which gets called from object save()

N.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Taking a break..

2014-10-27 Thread Nikola Đipanov

On 10/22/2014 07:37 PM, Chris Behrens wrote:
 Hey all,
 
 Just wanted to drop a quick note to say that I decided to leave Rackspace to 
 pursue another opportunity. My last day was last Friday. I won’t have much 
 time for OpenStack, but I’m going to continue to hang out in the channels. 
 Having been involved in the project since day 1, I’m going to find it 
 difficult to fully walk away. I really don’t know how much I’ll continue to 
 stay involved. I am completely burned out on nova. However, I’d really like 
 to see versioned objects broken out into oslo and Ironic synced with nova’s 
 object advancements. So, if I work on anything, it’ll probably be related to 
 that.
 
 Cells will be left in a lot of capable hands. I have shared some thoughts 
 with people on how I think we can proceed to make it ‘the way’ in nova. I’m 
 going to work on documenting some of this in an etherpad so the thoughts 
 aren’t lost.
 
 Anyway, it’s been fun… the project has grown like crazy! Keep on trucking... 
 And while I won’t be active much, don’t be afraid to ping me!
 

Thanks for all the hard work and best of luck in the new chapter Chris!

I will definitely take you up on the ping offer as I will need a -2
removed soon :)

N.

 - Chris
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Cells conversation starter

2014-10-21 Thread Nikola Đipanov

On 10/20/2014 08:00 PM, Andrew Laski wrote:
 One of the big goals for the Kilo cycle by users and developers of the
 cells functionality within Nova is to get it to a point where it can be
 considered a first class citizen of Nova.  Ultimately I think this comes
 down to getting it tested by default in Nova jobs, and making it easy
 for developers to work with.  But there's a lot of work to get there. 
 In order to raise awareness of this effort, and get the conversation
 started on a few things, I've summarized a little bit about cells and
 this effort below.
 
 
 Goals:
 
 Testing of a single cell setup in the gate.
 Feature parity.
 Make cells the default implementation.  Developers write code once and
 it works for  cells.
 
 Ultimately the goal is to improve maintainability of a large feature
 within the Nova code base.


Thanks for the write-up Andrew! Some thoughts/questions below. Looking
forward to the discussion on some of these topics, and would be happy to
review the code once we get to that point.

 
 Feature gaps:
 
 Host aggregates
 Security groups
 Server groups
 
 
 Shortcomings:
 
 Flavor syncing
 This needs to be addressed now.
 
 Cells scheduling/rescheduling
 Instances can not currently move between cells
 These two won't affect the default one cell setup so they will be
 addressed later.
 
 
 What does cells do:
 
 Schedule an instance to a cell based on flavor slots available.
 Proxy API requests to the proper cell.
 Keep a copy of instance data at the global level for quick retrieval.
 Sync data up from a child cell to keep the global level up to date.
 
 
 Simplifying assumptions:
 
 Cells will be treated as a two level tree structure.
 

Are we thinking of making this official by removing code that actually
allows cells to be an actual tree of depth N? I am not sure if doing so
would be a win, although it does complicate the RPC/Messaging/State code
a bit, but if it's not being used, even though a nice generalization,
why keep it around?

 
 Plan:
 
 Fix flavor breakage in child cell which causes boot tests to fail.
 Currently the libvirt driver needs flavor.extra_specs which is not
 synced to the child cell.  Some options are to sync flavor and extra
 specs to child cell db, or pass full data with the request.
 https://review.openstack.org/#/c/126620/1 offers a means of passing full
 data with the request.
 
 Determine proper switches to turn off Tempest tests for features that
 don't work with the goal of getting a voting job.  Once this is in place
 we can move towards feature parity and work on internal refactorings.
 
 Work towards adding parity for host aggregates, security groups, and
 server groups.  They should be made to work in a single cell setup, but
 the solution should not preclude them from being used in multiple
 cells.  There needs to be some discussion as to whether a host aggregate
 or server group is a global concept or per cell concept.
 

Have there been any previous discussions on this topic? If so I'd really
like to read up on those to make sure I understand the pros and cons
before the summit session.

 Work towards merging compute/api.py and compute/cells_api.py so that
 developers only need to make changes/additions in once place.  The goal
 is for as much as possible to be hidden by the RPC layer, which will
 determine whether a call goes to a compute/conductor/cell.
 
 For syncing data between cells, look at using objects to handle the
 logic of writing data to the cell/parent and then syncing the data to
 the other.
 

Some of that work has been done already, although in a somewhat ad-hoc
fashion, were you thinking of extending objects to support this natively
(whatever that means), or do we continue to inline the code in the
existing object methods.

 A potential migration scenario is to consider a non cells setup to be a
 child cell and converting to cells will mean setting up a parent cell
 and linking them.  There are periodic tasks in place to sync data up
 from a child already, but a manual kick off mechanism will need to be
 added.
 
 
 Future plans:
 
 Something that has been considered, but is out of scope for now, is that
 the parent/api cell doesn't need the same data model as the child cell. 
 Since the majority of what it does is act as a cache for API requests,
 it does not need all the data that a cell needs and what data it does
 need could be stored in a form that's optimized for reads.
 
 
 Thoughts?
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Pulling nova/virt/hardware.py into nova/objects/

2014-10-21 Thread Nikola Đipanov

On 10/20/2014 07:38 PM, Jay Pipes wrote:
Hi Dan, Dan, Nikola, all Nova devs,

OK, so in reviewing Dan B's patch series that refactors the virt
driver's get_available_resource() method [1], I am stuck between two
concerns. I like (love even) much of the refactoring work involved in
Dan's patches. They replace a whole bunch of our nested dicts that are
used in the resource tracker with real objects -- and this is something
I've been harping on for months that really hinders developer's
understanding of Nova's internals.

However, all of the object classes that Dan B has introduced have been
unversioned objects -- i.e. they have not derived from
nova.objects.base.NovaObject. This means that these objects cannot be
sent over the wire via an RPC API call. In practical terms, this issue
has not yet reared its head, because the resource tracker still sends a
dictified JSON representation of the object's fields directly over the
wire, in the same format as Icehouse, therefore there have been no
breakages in RPC API compatibility.

The problems with having all these objects not modelled by deriving from
nova.objects.base.NovaObject are two-fold:

* The object's fields/schema cannot be changed -- or rather, cannot be
changed without introducing upgrade problems.
* The objects introduce a different way of serializing the object
contents than is used in nova/objects -- it's not that much different,
but it's different, and only has not caused a problem because the
serialization routines are not yet being used to transfer data over the
wire

So, what to do? Clearly, I think the nova/virt/hardware.py objects are
badly needed. However, one of (the top?) priorities of the Nova project
is upgradeability, and by not deriving from
nova.objects.base.NovaObject, these nova.virt.hardware objects are
putting that mission in jeopardy, IMO.

My proposal is that before we go and approve any BPs or patches that add
to nova/virt/hardware.py, we first put together a patch series that
moves the object models in nova/virt/hardware.py to being full-fledged
objects in nova/objects/*

I think that we should have both in some cases, and although it makes
sense to have them only as objects in some cases - having them as
separate classes for some and not others may be confusing.

So when does it make sense to have them as separate classes? Well
basically whenever there is a need for driver-agnostic logic that will
be used outside of the driver (scheduler/claims/API/). Can this stuff go
in objects? Technically yes, but objects are really not a good place for
such logic as they may already be trying to solve too much (data
versioning and downgrading when there is a multi version cloud running,
database access for compute, and there are at least 2 more features
considered to be part of objects - cells integration and schema data
migrations).

Take CPU pinning as an example [1] - none of that logic would benefit
from living in the NovaObject child class itself, and will make it quite
bloated. Having it in the separate module objects can call into is
definitely beneficial, while we definitely should stay with objects for
versioning/backporting support. So I say in a number of cases we need both.

Both is exactly what I did for NUMA, with the exception of the compute
node side (we are hopping to start the json blob cleanup in K so I did
not concern myself with it for the sake of getting things done, but we
will need it). This is what I am doing now with CPU pinning.

The question I did not touch upon is what kind of interface does that
leave poor Nova developers with. Having everything as objects would
allow us to write things like (in the CPU pinning case):

instance.cpu_pinning = compute.cpu_pinning.get_pinning_for_instance(
instance)

Pretty slick, no? While keeping it completely separate would make us do
things like

cpu_pinning = compute.cpu_pinning.topology_from_obj()
if cpu_pinning:
instance_pinning = cpu_pinning.get_pinning_for_instance(
instance.cpu_pinning.topology_from_obj())
instance.cpu_pinning = objects.InstanceCPUPinning.obj_from_topology(
instance_pinning)

Way less slick, but can be easily fixed with a level of indirection.
Note that the above holds only when we are objectified everywhere -
until then - we pretty much *have* to have both.

So to sum up - what I think we should do is:

1) Don't bloat the object code with low level stuff
2) Do have objects for versioning everything
3) Make nice APIs that developers can enjoy (after we've converted all
the code to use objects).

[1] https://review.openstack.org/#/c/128738/4/nova/virt/hardware.py

Thoughts?

-jay

[1]
https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/virt-driver-get-available-resources-object,n,z

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org

Re: [openstack-dev] [nova] Nova Project Priorities for Kilo

2014-10-08 Thread Nikola Đipanov

On 10/07/2014 01:01 AM, Joe Gordon wrote:
 Hi all,
 
 One of the outcomes of the nova midcyle meetup was to pick several
 things for nova, as a team, to prioritize in Kilo. More background on
 this can be found at [0]
 
 We are now collecting ideas for project priorities on this etherpad [1],
 with the goal of discussing and finalizing the list or priorities for
 Kilo at the summit.
 
 [0] 
 http://docs.openstack.org/developer/nova/devref/kilo.blueprints.html#project-priorities
 [1] https://etherpad.openstack.org/p/kilo-nova-priorities
 

Sightly off topic:

Jay and Sylvain, can we maybe move the discussion that started there to
a different etherpad that we can link, so that we don't hijack the pad
completely, but still have that discussion as I think we have very
similar ideas - we should just agree on implementation.

Thanks,
N.

 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] What's holding nova development back?

2014-09-15 Thread Nikola Đipanov

On 09/13/2014 11:07 PM, Michael Still wrote:
 Just an observation from the last week or so...
 
 The biggest problem nova faces at the moment isn't code review latency.
 Our biggest problem is failing to fix our bugs so that the gate is
 reliable. The number of rechecks we've done in the last week to try and
 land code is truly startling.
 

This is exactly what I was saying in my ranty email from 2 weeks ago
[1]. Debt is everywhere and as any debt, it is unlikely going away on
it's own.


 I know that some people are focused by their employers on feature work,
 but those features aren't going to land in a world in which we have to
 hand walk everything through the gate.
 

The thing is that - without doing work on the code - you cannot know
where the real issues are. You cannot look at a codebase as big as Nova
and say, hmmm looks like we need to fix the resource tracker. You can
know that only if you are neck-deep in the stuff. And then you need to
agree on what is really bad and what is just distasteful, and then focus
the efforts on that. None of the things we've put in place (specs, the
way we do and organize code review and bugs) acknowledge or help this
part of the development process.

I tried to explain this in my previous ranty email [1] but I guess I
failed due to ranting :) so let me try again: Nova team needs to act as
a development team.

We are not in a place (yet?) where we can just overlook the addition of
features based on weather they are appropriate for our use case. We have
to work together on a set of important things to get Nova to where we
think it needs to be and make sure we get it done - by actually doing
it! (*)

However - I don't think freezing development of features for a cycle is
a viable option - this is just not how software in the real world gets
done. It will likely be the worst possible thing we can do, no matter
how appealing it seems to us as developers.

But we do need to be extremely strict on what we let in, and under which
conditions! As I mentioned to sdague on IRC the other day (yes, I am
quoting myself :) ): Not all features are the same - there are
features that are better, that are coded better, and are integrated
better - we should be wanting those features always! Then there are
features that are a net negative on the code - we should *never* want
those features. And then there are features in the middle - we may want
to cut those or push them back depending on a number of things that are
important. Things like: code quality, can it fit withing the current
constraints, can we let it in like that, or some work needs to happen
first. Things which we haven't been really good at considering
previously IMHO.

But you can't really judge that unless you are actively developing Nova
yourself, and have a tighter grip on the proposed code than what our
current process gives.

Peace!
N.

[1]
http://lists.openstack.org/pipermail/openstack-dev/2014-September/044722.html

(*) The only effort like this going on at the moment in Nova is the
Objects work done by dansmith (even thought there are several others
proposed) - I will let the readers judge how much of an impact it was in
only 2 short cycles, from just a single effort.

 Michael
 
 
 -- 
 Rackspace Australia
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] What's holding nova development back?

2014-09-15 Thread Nikola Đipanov

On 09/14/2014 12:27 AM, Boris Pavlovic wrote:
 Michael, 
 
 I am so glad that you started this topic.
 I really like idea of  of taking a pause with features and concentrating
 on improvement of current code base. 
 
 Even if the 1 k open bugs https://bugs.launchpad.net/nova are vital
 issue, there are other things that could be addressed to improve Nova
 team throughput. 
 
 Like it was said in another thread: Nova code is current too big and
 complex to be understand by one person.
 It produces 2 issues: 
 A) There is hard to find person who can observer full project and make
 global architecture decisions including work on cross projects interactions
 (So project doesn't have straight direction of development)
 B) It's really hard to find cores, and current cores are under too heavy
 load (because of project complexity)
 
 I believe that whole current Nova functionality can be implemented in
 much simpler manner.

Just a brief comment on the sentence above.

This is a common thing to hear from coders, and is very rarely rooted in
reality IMHO. Nova does _a lot_ of things. Saying that given an
exhaustive list of features it has, we can implement them in a much
simpler manner is completely disregarding all the complexity of building
software that works within real world constraints.

 Basically, complexity was added during the process of adding a lot of
 features for years, that didn't perfectly fit to architecture of Nova. 
 And there wasn't much work on refactoring the architecture to cleanup
 these features. 
 

I agree with this of course - fixing architectural flaws is important
and needs to be an ongoing part of the process, as I mention in my other
mail to the thread. Halting all other development is not the way to do
it though.

N.

 So maybe it's proper time to think about what, why and how we are
 doing. 
 That will allows us to find simpler solutions for current functionality. 
 
 
 Best regards,
 Boris Pavlovic 
 
 
 On Sun, Sep 14, 2014 at 1:07 AM, Michael Still mi...@stillhq.com
 mailto:mi...@stillhq.com wrote:
 
 Just an observation from the last week or so...
 
 The biggest problem nova faces at the moment isn't code review
 latency. Our biggest problem is failing to fix our bugs so that the
 gate is reliable. The number of rechecks we've done in the last week
 to try and land code is truly startling.
 
 I know that some people are focused by their employers on feature
 work, but those features aren't going to land in a world in which we
 have to hand walk everything through the gate.
 
 Michael
 
 
 -- 
 Rackspace Australia
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 mailto:OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][FFE] Feature freeze exception for virt-driver-numa-placement

2014-09-05 Thread Nikola Đipanov

On 09/04/2014 07:42 PM, Murray, Paul (HP Cloud) wrote:
  
 
 
 
 Anyway, not enough to -1 it, but enough to at least say something.
 
 
 
 
 
 
 
 
 
 
 
 .. but I do not want to get into the discussion about software testing
 
 
 
 here, not the place really.
 
 
 
 
 
 
 
 However, I do think it is very harmful to respond to FFE request with
 
 
 
 such blanket statements and generalizations, if only for the message it
 
 
 
 sends to the contributors (that we really care more about upholding our
 
 
 
 own myths as a community than users and features).
 
 
 
 
 
 
 
 
 
 
 
 I believe you brought this up as one of your justifications for the FFE.
 
 When I read your statement it does sound as though you want to put
 
 experimental code in at the final release. I am sure that is not what
 
 you had in mind, but I am also sure you can also understand Sean's point
 
 of view. His point is clear and pertinent to your request.
 
 
 
 
 
 
 
 As the person responsible for Nova in HP I will be interested to see how
 
 it operates in practice. I can assure you we will do extensive testing
 
 on it before it goes into the wild and we will not put it into practice
 
 if we are not happy.
 
 
 
  
 
 That is awesome and we as a project are lucky to have that! I would not
 
 want things put into practice that users can't use or see huge flaws with.
 
  
 
 I can't help but read this as you being OK with the feature going ahead,
 
 though :).
 
  
 
  
 
 Actually, let’s say I have no particular objection. Just thought Sean’s
 point is worth noting.
 
  
 
 Now, if this had been done as an extensible resource I could easily
 decouple deploying it from all the bug fixes that come through with the
 release. But that’s another matter…
 
  

Quick response as not to hijack the thread:

I think we all agree on the benefits of having resources you can turn
off and on at will.

The current implementation of it, however, has some glaring drawbacks
that made it impossible for me to base my work on it, that have been
discussed in detail on other threads and IRC heavily, hence we need to
rethink how to get there.

 
 Paul
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Feature Freeze Exception process for Juno

2014-09-05 Thread Nikola Đipanov

On 09/04/2014 10:25 PM, Solly Ross wrote:
 Anyway, I think it would be useful to have some sort of page where people
 could say I'm an SME in X, ask me for reviews and then patch submitters 
 could go
 and say, oh, I need an someone to review my patch about storage backends, 
 let me
 ask sross.
 

This is a good point - I've been thinking along similar lines that we
really could have a huge win in terms of the review experience by
building a tool (maybe a social network looking one :)) that relates
reviews to people being able to do them, visualizes reviewer karma and
other things that can help make the code submissions and reviews more
human friendly.

Dan seems to dismiss the idea of improved tooling as something that can
get us only thus far, but I am not convinced. However - this will
require even more manpower and we are already ridiculously short on that
so...

N.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][FFE] Feature freeze exception for virt-driver-numa-placement

2014-09-05 Thread Nikola Đipanov

Since this did not get an 'Approved' as of yet, I want to make sure that
this is not because the number of sponsors. 2 core members have already
sponsored it, and as per [1] cores can sponsor their own FFEs so that's 3.

N.

[1]
http://lists.openstack.org/pipermail/openstack-dev/2014-September/044669.html

On 09/04/2014 01:58 PM, Nikola Đipanov wrote:
 Hi team,
 
 I am requesting the exception for the feature from the subject (find
 specs at [1] and outstanding changes at [2]).
 
 Some reasons why we may want to grant it:
 
 First of all all patches have been approved in time and just lost the
 gate race.
 
 Rejecting it makes little sense really, as it has been commented on by a
 good chunk of the core team, most of the invasive stuff (db migrations
 for example) has already merged, and the few parts that may seem
 contentious have either been discussed and agreed upon [3], or can
 easily be addressed in subsequent bug fixes.
 
 It would be very beneficial to merge it so that we actually get real
 testing on the feature ASAP (scheduling features are not tested in the
 gate so we need to rely on downstream/3rd party/user testing for those).
 
 Thanks,
 
 Nikola
 
 [1]
 http://git.openstack.org/cgit/openstack/nova-specs/tree/specs/juno/virt-driver-numa-placement.rst
 [2]
 https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/virt-driver-numa-placement,n,z
 [3] https://review.openstack.org/#/c/111782/
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Feature Freeze Exception process for Juno

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 02:07 AM, Joe Gordon wrote:
 
 
 
 On Wed, Sep 3, 2014 at 2:50 AM, Nikola Đipanov ndipa...@redhat.com
 mailto:ndipa...@redhat.com wrote:
 
 On 09/02/2014 09:23 PM, Michael Still wrote:
  On Tue, Sep 2, 2014 at 1:40 PM, Nikola Đipanov
 ndipa...@redhat.com mailto:ndipa...@redhat.com wrote:
  On 09/02/2014 08:16 PM, Michael Still wrote:
  Hi.
 
  We're soon to hit feature freeze, as discussed in Thierry's recent
  email. I'd like to outline the process for requesting a freeze
  exception:
 
  * your code must already be up for review
  * your blueprint must have an approved spec
  * you need three (3) sponsoring cores for an exception to be
 granted
 
  Can core reviewers who have features up for review have this number
  lowered to two (2) sponsoring cores, as they in reality then need
 four
  (4) cores (since they themselves are one (1) core but cannot really
  vote) making it an order of magnitude more difficult for them to hit
  this checkbox?
 
  That's a lot of numbers in that there paragraph.
 
  Let me re-phrase your question... Can a core sponsor an exception they
  themselves propose? I don't have a problem with someone doing that,
  but you need to remember that does reduce the number of people who
  have agreed to review the code for that exception.
 
 
 Michael has correctly picked up on a hint of snark in my email, so let
 me explain where I was going with that:
 
 The reason many features including my own may not make the FF is not
 because there was not enough buy in from the core team (let's be
 completely honest - I have 3+ other core members working for the same
 company that are by nature of things easier to convince), but because of
 any of the following:
 
 
 I find the statement about having multiple cores at the same company
 very concerning. To quote Mark McLoughlin, It is assumed that all core
 team members are wearing their upstream hat and aren't there merely to
 represent their employers interests [0]. Your statement appears to be
 in direct conflict with Mark's idea of what core reviewer is, and idea
 that IMHO is one of the basic tenants of OpenStack development.
 

This is of course taking my words completely out of context - I was
making a point of how arbitrary changing the number of reviewers needed
is, and how it completely misses the real issues IMHO.

I have no interest in continuing this particular debate further, and
would appreciate if people could refraining from resorting to such
straw-man type arguments, as it can be very damaging to the overall
level of conversation we need to maintain.

 [0] http://lists.openstack.org/pipermail/openstack-dev/2013-July/012073.html
 
  
 
 
 * Crippling technical debt in some of the key parts of the code
 * that we have not been acknowledging as such for a long time
 * which leads to proposed code being arbitrarily delayed once it makes
 the glaring flaws in the underlying infra apparent
 * and that specs process has been completely and utterly useless in
 helping uncover (not that process itself is useless, it is very useful
 for other things)
 
 I am almost positive we can turn this rather dire situation around
 easily in a matter of months, but we need to start doing it! It will not
 happen through pinning arbitrary numbers to arbitrary processes.
 
 
 Nova is big and complex enough that I don't think any one person is able
 to identify what we need to work on to make things better. That is one
 of the reasons why I have the project priorities patch [1] up. I would
 like to see nova as a team discuss and come up with what we think we
 need to focus on to get us back on track.
 
 
 [1] https://review.openstack.org/#/c/112733/
 

Yes - I was thinking along similar lines as what you propose on that
patch, too bad if the above sentence came across as implying I had some
kind of cowboy one-man crusade in mind :) it is totally not what I meant.

We need strong consensus on what is important for the project, and we
need hands behind that (both hackers and reviewers). Having a good chunk
of core devs not actually writing critical bits of code is a bad sign IMHO.

I have some additions to your list of priorities which I will add as
comments on the review above (with some other comments of my own), and
we can discuss from there - sorry I missed this! I will likely do that
instead of spamming further with another email as the baseline seems
sufficiently similar to where I stand.

 
 
 I will follow up with a more detailed email about what I believe we are
 missing, once the FF settles and I have applied some soothing creme to
 my burnout wounds, but currently my sentiment is:
 
 Contributing features to Nova nowadays SUCKS!!1 (even as a core
 reviewer) We _have_ to change that!
 
 
 Yes, I can agree with you

[openstack-dev] [Nova][FFE] Feature freeze exception for virt-driver-numa-placement

2014-09-04 Thread Nikola Đipanov

Hi team,

I am requesting the exception for the feature from the subject (find
specs at [1] and outstanding changes at [2]).

Some reasons why we may want to grant it:

First of all all patches have been approved in time and just lost the
gate race.

Rejecting it makes little sense really, as it has been commented on by a
good chunk of the core team, most of the invasive stuff (db migrations
for example) has already merged, and the few parts that may seem
contentious have either been discussed and agreed upon [3], or can
easily be addressed in subsequent bug fixes.

It would be very beneficial to merge it so that we actually get real
testing on the feature ASAP (scheduling features are not tested in the
gate so we need to rely on downstream/3rd party/user testing for those).

Thanks,

Nikola

[1]
http://git.openstack.org/cgit/openstack/nova-specs/tree/specs/juno/virt-driver-numa-placement.rst
[2]
https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/virt-driver-numa-placement,n,z
[3] https://review.openstack.org/#/c/111782/

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Feature Freeze Exception process for Juno

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 03:36 AM, Dean Troyer wrote:
 On Wed, Sep 3, 2014 at 7:07 PM, Joe Gordon joe.gord...@gmail.com
 mailto:joe.gord...@gmail.com wrote:
 
 On Wed, Sep 3, 2014 at 2:50 AM, Nikola Đipanov ndipa...@redhat.com
 mailto:ndipa...@redhat.com wrote:
 
 The reason many features including my own may not make the FF is not
 because there was not enough buy in from the core team (let's be
 completely honest - I have 3+ other core members working for the
 same
 company that are by nature of things easier to convince), but
 because of
 any of the following:
 
 
 I find the statement about having multiple cores at the same company
 very concerning. To quote Mark McLoughlin, It is assumed that all
 core team members are wearing their upstream hat and aren't there
 merely to represent their employers interests [0]. Your statement
 appears to be in direct conflict with Mark's idea of what core
 reviewer is, and idea that IMHO is one of the basic tenants of
 OpenStack development.
 
 
 FWIW I read Nikola's 'by nature of things' statement to be more of a
 representation of the higher-bandwith communication and relationships
 with co-workers rather than for the company.  I hope my reading is not
 wrong.
 

Thanks for not reading too much into that sentence - yes, this is quite
close to what I meant, and used it to make a point of how I think we are
focusing on the wrong thing (as already mentioned on the direct response
to Joe).

N.

 I know a while back some of the things I was trying to land in multiple
 projects really benefited from having both the relationships and
 high-bandwidth communication to 4 PTLs, three of whom were in the same
 room at the time.
 
 There is the perception problem, exactly what Mark also wrote about,
 when that happens off-line, and I think it is our responsibility (those
 advocating the reviews, and those responding to them) to note the
 outcome of those discussions on the record somewhere, IMO preferably in
 Gerrit.
 
 dt
 
 -- 
 
 Dean Troyer
 dtro...@gmail.com mailto:dtro...@gmail.com
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][FFE] Feature freeze exception for virt-driver-numa-placement

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 02:31 PM, Sean Dague wrote:
 On 09/04/2014 07:58 AM, Nikola Đipanov wrote:
 Hi team,

 I am requesting the exception for the feature from the subject (find
 specs at [1] and outstanding changes at [2]).

 Some reasons why we may want to grant it:

 First of all all patches have been approved in time and just lost the
 gate race.

 Rejecting it makes little sense really, as it has been commented on by a
 good chunk of the core team, most of the invasive stuff (db migrations
 for example) has already merged, and the few parts that may seem
 contentious have either been discussed and agreed upon [3], or can
 easily be addressed in subsequent bug fixes.

 It would be very beneficial to merge it so that we actually get real
 testing on the feature ASAP (scheduling features are not tested in the
 gate so we need to rely on downstream/3rd party/user testing for those).
 
 This statement bugs me. It seems kind of backwards to say we should
 merge a thing that we don't have a good upstream test plan on and put it
 in a release so that the testing will happen only in the downstream case.
 

The objective reality is that many other things have not had upstream
testing for a long time (anything that requires more than 1 compute node
in Nova for example, and any scheduling feature - as I mention clearly
above), so not sure how that is backwards from any reasonable point.

Thanks to folks using them, it is still kept working and bugs get fixed.
Getting features into the hands of users is extremely important...

 Anyway, not enough to -1 it, but enough to at least say something.
 

.. but I do not want to get into the discussion about software testing
here, not the place really.

However, I do think it is very harmful to respond to FFE request with
such blanket statements and generalizations, if only for the message it
sends to the contributors (that we really care more about upholding our
own myths as a community than users and features).

N.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] FFE request serial-ports

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 02:42 PM, Sahid Orentino Ferdjaoui wrote:
 Hello,
 
 I would like to request a FFE for 4 changesets to complete the
 blueprint serial-ports.
 
 Topic on gerrit:
   
 https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/serial-ports,n,z
 
 Blueprint on launchpad.net:
   https://blueprints.launchpad.net/nova/+spec/serial-ports
 
 They have already been approved but didn't get enough time to be merged
 by the gate.
 
 Sponsored by:
 Daniel Berrange
 Nikola Dipanov
 

This is also one of the ones that simply lost the gate race in the end,
and I've reviewed several iterations of it, so +1 from me.

N.

 s.
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [vmware][nova][FFE] vmware-spawn-refactor

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 03:46 PM, Daniel P. Berrange wrote:
 On Thu, Sep 04, 2014 at 02:09:26PM +0100, Matthew Booth wrote:
 I'd like to request a FFE for the remaining changes from
 vmware-spawn-refactor. They are:

 https://review.openstack.org/#/c/109754/
 https://review.openstack.org/#/c/109755/
 https://review.openstack.org/#/c/114817/
 https://review.openstack.org/#/c/117467/
 https://review.openstack.org/#/c/117283/

 https://review.openstack.org/#/c/98322/

 All but the last had +A, and were in the gate at the time it was closed.
 The last had not yet been approved, but is ready for core review. It has
 recently had some orthogonal changes split out to simplify it
 considerably. It is largely a code motion patch, and has been given +1
 by VMware CI multiple times.
 
 They're all internal to the VMWare driver, have multiple ACKs from VMWare
 maintainers as well as core, so don't require extra review time. So I think
 it is reasonable request.
 
 ACK, I'll sponsor it.
 

+1 here - I've already looked at a number of those.

N.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][FFE] Feature freeze exception for virt-driver-numa-placement

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 04:51 PM, Murray, Paul (HP Cloud) wrote:
  
 
 On 4 September 2014 14:07, Nikola Đipanov ndipa...@redhat.com wrote:
 
 On 09/04/2014 02:31 PM, Sean Dague wrote:
 
 On 09/04/2014 07:58 AM, Nikola Đipanov wrote:
 
 Hi team,
 
 
 
 I am requesting the exception for the feature from the subject (find
 
 specs at [1] and outstanding changes at [2]).
 
 
 
 Some reasons why we may want to grant it:
 
 
 
 First of all all patches have been approved in time and just lost the
 
 gate race.
 
 
 
 Rejecting it makes little sense really, as it has been commented on by a
 
 good chunk of the core team, most of the invasive stuff (db migrations
 
 for example) has already merged, and the few parts that may seem
 
 contentious have either been discussed and agreed upon [3], or can
 
 easily be addressed in subsequent bug fixes.
 
 
 
 It would be very beneficial to merge it so that we actually get real
 
 testing on the feature ASAP (scheduling features are not tested in the
 
 gate so we need to rely on downstream/3rd party/user testing for those).
 
 
 
 This statement bugs me. It seems kind of backwards to say we should
 
 merge a thing that we don't have a good upstream test plan on and put it
 
 in a release so that the testing will happen only in the downstream case.
 
 
 
  
 
 The objective reality is that many other things have not had upstream
 
 testing for a long time (anything that requires more than 1 compute node
 
 in Nova for example, and any scheduling feature - as I mention clearly
 
 above), so not sure how that is backwards from any reasonable point.
 
  
 
 Thanks to folks using them, it is still kept working and bugs get fixed.
 
 Getting features into the hands of users is extremely important...
 
  
 
 Anyway, not enough to -1 it, but enough to at least say something.
 
 
 
  
 
 .. but I do not want to get into the discussion about software testing
 
 here, not the place really.
 
  
 
 However, I do think it is very harmful to respond to FFE request with
 
 such blanket statements and generalizations, if only for the message it
 
 sends to the contributors (that we really care more about upholding our
 
 own myths as a community than users and features).
 
  
 
  
 
 I believe you brought this up as one of your justifications for the FFE.
 When I read your statement it does sound as though you want to put
 experimental code in at the final release. I am sure that is not what
 you had in mind, but I am also sure you can also understand Sean's point
 of view. His point is clear and pertinent to your request.
 
  
 
 As the person responsible for Nova in HP I will be interested to see how
 it operates in practice. I can assure you we will do extensive testing
 on it before it goes into the wild and we will not put it into practice
 if we are not happy.
 

That is awesome and we as a project are lucky to have that! I would not
want things put into practice that users can't use or see huge flaws with.

I can't help but read this as you being OK with the feature going ahead,
though :).

N.

  
 
 Paul
 
  
 
 Paul Murray
 
 Nova Technical Lead, HP Cloud
 
 +44 117 312 9309
 
 Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks
 RG12 1HN Registered No: 690597 England. The contents of this message and
 any attachments to it are confidential and may be legally privileged. If
 you have received this message in error, you should delete it from your
 system immediately and advise the sender. To any recipient of this
 message within HP, unless otherwise stated you should consider this
 message and attachments as HP CONFIDENTIAL.
 
  
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] requesting an FFE for SRIOV

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 05:16 PM, Dan Smith wrote:
 The main sr-iov patches have gone through lots of code reviews, manual
 rebasing, etc. Now we have some critical refactoring work on the
 existing infra to get it ready. All the code for refactoring and sr-iov
 is up for review.  
 
 I've been doing a lot of work on this recently, and plan to see it
 through if possible.
 
 So, I'll be a sponsor.
 
 In the meeting russellb said he would as well. I think he's tied up
 today, so I'm proxying him in here :)
 
 --Dan
 

I've already looked at some of this, and some of the work is based on
the work I did for the NUMA blueprint (that Dan contributed to quite a
bit as well) so I'd be happy to make sure this lands too.

N.

 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Feature Freeze Exception process for Juno

2014-09-04 Thread Nikola Đipanov

On 09/04/2014 11:23 AM, John Garbutt wrote:
 Sorry for another top post, but I like how Nikola has pulled this
 problem apart, and wanted to respond directly to his response.
 

Thanks John - I'm glad this caught your eye.

 On 3 September 2014 10:50, Nikola Đipanov ndipa...@redhat.com wrote:
 The reason many features including my own may not make the FF is not
 because there was not enough buy in from the core team (let's be
 completely honest - I have 3+ other core members working for the same
 company that are by nature of things easier to convince), but because of
 any of the following:

 * Crippling technical debt in some of the key parts of the code
 
 +1
 
 We have problems that need solving.
 
 One of the ideas behind the slots proposal is to encourage work on
 the urgent technical debt, before related features are even approved.
 

As I stated before, my issue with slots was more about the fact that
they seem to me like needlessly elaborate process to communicate a
simple list of important things we focus on, not what it was trying to
accomplish, which I fully support.

Not sure where I stand on it still, but if it will get us closer to
fixing stuff we really need to fix, then I won't argue with it.

 * that we have not been acknowledging as such for a long time
 
 -1
 
 We keep saying thats cool, but we have to fix/finish XXX first.
 
 But... we have been very bad at:
 * remembering that, and recording that
 * actually fixing those problems
 

This seems to me ties in with prioritizing important work, like jog0 is
proposing on [1]. I am not sure if just prioritizing it will work
though, we've had blueprints before that we all agreed were high
priority that got delayed and punted mostly because of lack of hands to
do the work. I am not sure how to solve this problem.

Even with what danpb is proposing on a parallel thread with the drivers
- this is still work that needs to be done on the core.

Objects is a good example of how this can work, but we must not forget
that it had a strong backing and several highly skilled people working
on it. This is the part prioritizing won't solve.

[1] https://review.openstack.org/#/c/112733/

 * which leads to proposed code being arbitrarily delayed once it makes
 the glaring flaws in the underlying infra apparent
 
 Sometimes we only spot this stuff in code reviews, where you throw up
 reading all the code around the change, and see all the extra
 complexity being added to a fragile bit of the code, and well, then
 you really don't want to be the person who clicks approve on that.
 
 We need to track this stuff better. Every time it happens, we should
 try make a not to go back there and do more tidy ups.
 

+1 - absolutely - we definitely lack the grumpy developer who goes in
and fixes stuff mentality!

 * and that specs process has been completely and utterly useless in
 helping uncover (not that process itself is useless, it is very useful
 for other things)
 
 Yeah, it hasn't helped for this.
 
 I don't think we should do this, but I keep thinking about making
 specs two step:
 * write generally direction doc
 * go write the code, maybe upload as WIP
 * write the documentation part of the spec
 * get docs merged before any code
 

I would say that we need to keep the spec approval as lightweight as
possible so that we can make sure we get to the details (where devil
resides), sooner rather than later... so maybe a 2-phase process along
the lines of:

* This feature makes sense for Nova and it is proposed in a reasonable
manner (merge quickly in /tentative dir)
* This now looks like a coherent whole with the POC code, move spec to a
/approved dir, work on the details of the code OR we need to fix some
issues, let's see how we can do that and still not make the life of
feature proposer a living hell just for hitting a snag.

 I am almost positive we can turn this rather dire situation around
 easily in a matter of months, but we need to start doing it! It will not
 happen through pinning arbitrary numbers to arbitrary processes.
 
 +1
 
 This is ongoing, but there are some major things, I feel we should
 stop and fix in kilo.
 
 ...and that will make getting features in much worse for a little
 while, but it will be much better on the other side.
 

I really do hope so because all things considered - I don't think Nova
is in horrible state, but we need to work on it _now_. Many of these
issues have been known for a long time and are just piling up.

 I will follow up with a more detailed email about what I believe we are
 missing, once the FF settles and I have applied some soothing creme to
 my burnout wounds
 
 Awesome, please catch up with jogo who was also trying to build this
 list. I would love to continue to contribute to that too.
 

Yes - as already said on mu response to jog0 - I missed his proposal and
will comment on there.

 Might be working moving into here:
 https://etherpad.openstack.org/p/kilo-nova-summit-topics
 
 The idea was/is to use

Re: [openstack-dev] [Nova] Feature Freeze Exception process for Juno

2014-09-03 Thread Nikola Đipanov

On 09/02/2014 09:23 PM, Michael Still wrote:
 On Tue, Sep 2, 2014 at 1:40 PM, Nikola Đipanov ndipa...@redhat.com wrote:
 On 09/02/2014 08:16 PM, Michael Still wrote:
 Hi.

 We're soon to hit feature freeze, as discussed in Thierry's recent
 email. I'd like to outline the process for requesting a freeze
 exception:

 * your code must already be up for review
 * your blueprint must have an approved spec
 * you need three (3) sponsoring cores for an exception to be granted

 Can core reviewers who have features up for review have this number
 lowered to two (2) sponsoring cores, as they in reality then need four
 (4) cores (since they themselves are one (1) core but cannot really
 vote) making it an order of magnitude more difficult for them to hit
 this checkbox?
 
 That's a lot of numbers in that there paragraph.
 
 Let me re-phrase your question... Can a core sponsor an exception they
 themselves propose? I don't have a problem with someone doing that,
 but you need to remember that does reduce the number of people who
 have agreed to review the code for that exception.
 

Michael has correctly picked up on a hint of snark in my email, so let
me explain where I was going with that:

The reason many features including my own may not make the FF is not
because there was not enough buy in from the core team (let's be
completely honest - I have 3+ other core members working for the same
company that are by nature of things easier to convince), but because of
any of the following:

* Crippling technical debt in some of the key parts of the code
* that we have not been acknowledging as such for a long time
* which leads to proposed code being arbitrarily delayed once it makes
the glaring flaws in the underlying infra apparent
* and that specs process has been completely and utterly useless in
helping uncover (not that process itself is useless, it is very useful
for other things)

I am almost positive we can turn this rather dire situation around
easily in a matter of months, but we need to start doing it! It will not
happen through pinning arbitrary numbers to arbitrary processes.

I will follow up with a more detailed email about what I believe we are
missing, once the FF settles and I have applied some soothing creme to
my burnout wounds, but currently my sentiment is:

Contributing features to Nova nowadays SUCKS!!1 (even as a core
reviewer) We _have_ to change that!

N.

 Michael
 
 * exceptions must be granted before midnight, Friday this week
 (September 5) UTC
 * the exception is valid until midnight Friday next week
 (September 12) UTC when all exceptions expire

 For reference, our rc1 drops on approximately 25 September, so the
 exception period needs to be short to maximise stabilization time.

 John Garbutt and I will both be granting exceptions, to maximise our
 timezone coverage. We will grant exceptions as they come in and gather
 the required number of cores, although I have also carved some time
 out in the nova IRC meeting this week for people to discuss specific
 exception requests.

 Michael



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] Feature Freeze Exception process for Juno

2014-09-02 Thread Nikola Đipanov

On 09/02/2014 08:16 PM, Michael Still wrote:
 Hi.
 
 We're soon to hit feature freeze, as discussed in Thierry's recent
 email. I'd like to outline the process for requesting a freeze
 exception:
 
 * your code must already be up for review
 * your blueprint must have an approved spec
 * you need three (3) sponsoring cores for an exception to be granted

Can core reviewers who have features up for review have this number
lowered to two (2) sponsoring cores, as they in reality then need four
(4) cores (since they themselves are one (1) core but cannot really
vote) making it an order of magnitude more difficult for them to hit
this checkbox?

Thanks,
N.

 * exceptions must be granted before midnight, Friday this week
 (September 5) UTC
 * the exception is valid until midnight Friday next week
 (September 12) UTC when all exceptions expire
 
 For reference, our rc1 drops on approximately 25 September, so the
 exception period needs to be short to maximise stabilization time.
 
 John Garbutt and I will both be granting exceptions, to maximise our
 timezone coverage. We will grant exceptions as they come in and gather
 the required number of cores, although I have also carved some time
 out in the nova IRC meeting this week for people to discuss specific
 exception requests.
 
 Michael
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

1 2 >

1 - 100 of 150 matches

Mail list logo