Re: [Openstack-operators] [nova][cinder] Is there interest in an admin-api to refresh volume connection info?

2017-09-14 Thread Matt Riedemann

On 9/13/2017 10:31 AM, Morgenstern, Chad wrote:

I have been studying how to perform failover operations with Cinder --failover. 
 Nova is not aware of the failover event. Being able to refresh the connection 
state especially for Nova would come in very handy, especially in admin level 
dr scenarios.

I'm attaching the blog I wrote on the subject:
http://netapp.io/2017/08/09/cinder-cheesecake-things-to-consider/


I'm not aware of the host failover feature in Cinder, but you're correct 
that there is no event listener on the nova side for this happening.


You could build something like this into the os-server-external-events 
API in Nova:


https://developer.openstack.org/api-ref/compute/#create-external-events-os-server-external-events

That is used today for Cinder to trigger a swap volume or volume extend 
operation in Nova. It would require a microversion and spec in nova, but 
it doesn't seem that hard to do, depending on what you need to do on the 
nova side. With the new 3.27 Cinder attachments APIs, it seems it could 
just be an attachment delete/create operation, but would we also need to 
disconnect old/connect new connections in os-brick?


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] Cinder cross_az_attach=False changes/fixes

2017-09-14 Thread Sylvain Bauza
On Tue, Jun 6, 2017 at 9:45 PM, Sam Morrison  wrote:

> Hi Matt,
>
> Just looking into this,
>
> > On 1 Jun 2017, at 9:08 am, Matt Riedemann  wrote:
> >
> > This is a request for any operators out there that configure nova to set:
> >
> > [cinder]
> > cross_az_attach=False
> >
> > To check out these two bug fixes:
> >
> > 1. https://review.openstack.org/#/c/366724/
> >
> > This is a case where nova is creating the volume during boot from volume
> and providing an AZ to cinder during the volume create request. Today we
> just pass the instance.availability_zone which is None if the instance was
> created without an AZ set. It's unclear to me if that causes the volume
> creation to fail (someone in IRC was showing the volume going into ERROR
> state while Nova was waiting for it to be available), but I think it will
> cause the later attach to fail here [1] because the instance AZ (defaults
> to None) and volume AZ (defaults to nova) may not match. I'm still looking
> for more details on the actual failure in that one though.
> >
> > The proposed fix in this case is pass the AZ associated with any host
> aggregate that the instance is in.
>
> If cross_az_attach is false won’t it always result in the instance AZ
> being None as it won’t be on a host yet?
> I haven’t traced back the code fully so not sure if an instance gets
> scheduled onto a host and then the volume create call happens  or they
> happen in parallel etc. (in the case for boot from volume)
>
>
Sorry for ressurecting an old thread, but we recently discussed about the
AZ relationship between Nova and Cinder at the PTG and I wanted to clarify
a couple of things.


> When cross_az_attach is false:
> If a user does a boot from volume (create new volume) and specifies an AZ
> then I would expect the instance and the volume to be created in the
> specified AZ.
>

I agree, that looks to me the right behaviour to see.
I also add that if Nova is configured to assign an AZ by default (by using
default_schedule_zone opt), then that behaviour has to be enforced too.


If the AZ doesn’t exist in cinder or nova I would expect it to fail.
>
>
I agree.

If a user doesn’t specify an AZ I would expect that the instance and the
> volume are in the same AZ.
>

That's where I disagree. If no AZ was specified by the time the instance
was created OR if Nova wasn't configured to assign an AZ by default to each
instance, then Nova will pick any AZ and will honestly don't care about
which AZ the instance is. In other words, by a transient relationship, the
instance will have an AZ because it will be hosted on a compute that is
part of an AZ (or by default to the value of default_availability_zone
option) but that doesn't mean that that instance will be on that AZ
forever, since there was no formal contract that expressed a specific AZ.
Consequently, when an instance is migrated, it could land to a host which
is not *in the same AZ*.

In that case, I don't see a reason why if the instance is not specifically
tied to an AZ, we should ask Cinder to honor that AZ since the volume and
the instance could have different AZ names in the future after a move
operation.


If there isn’t a common AZ between cinder and nova I would expect it to
> fail.
>
>
>
I'd rather prefer to have the same behaviour as if cross_az_attach was set
to True, ie. not providing an AZ in our call to Cinder.


> >
> > 2. https://review.openstack.org/#/c/469675/
> >
> > This is similar, but rather than checking the AZ when we're on the
> compute and the instance has a host, we're in the API and doing a boot from
> volume where an existing volume is provided during server create. By
> default, the volume's AZ is going to be 'nova'. The code doing the check
> here is getting the AZ for the instance, and since the instance isn't on a
> host yet, it's not in any aggregate, so the only AZ we can get is from the
> server create request itself. If an AZ isn't provided during the server
> create request, then we're comparing instance.availability_zone (None) to
> volume['availability_zone'] ("nova") and that results in a 400.
> >
> > My proposed fix is in the case of BFV checks from the API, we default
> the AZ if one wasn't requested when comparing against the volume. By
> default this is going to compare "nova" for nova and "nova" for cinder,
> since CONF.default_availability_zone is "nova" by default in both projects.
> >
>
> Is this an alternative approach? Just trying to get my head around this
> all.
>
>
 Same as the above I wrote. If the user didn't specify an AZ and if Nova
isn't configuring for assigning a default AZ, then Nova shouldn't care of
which Cinder AZ the instance can be attached to. Since the AZ instance can
change, that contract would be broken in case of a move operation.


Thanks,
> Sam
>
>
> > --
> >
> > I'm requesting help from any operators that are setting
> cross_az_attach=False because I have to imagine your users have run into
> this and 

Re: [Openstack-operators] [nova][cinder] Is there interest in an admin-api to refresh volume connection info?

2017-09-14 Thread Matt Riedemann

On 9/13/2017 9:52 AM, Arne Wiebalck wrote:



On 13 Sep 2017, at 16:52, Matt Riedemann  wrote:

On 9/13/2017 3:24 AM, Arne Wiebalck wrote:

I’m reviving this thread to check if the suggestion to address potentially 
stale connection
data by an admin command (or a scheduled task) made it to the planning for one 
of the
upcoming releases?


It hasn't, but we're at the PTG this week so I can throw it on the list of 
topics.



That’d be great, thanks!

--
Arne Wiebalck
CERN IT



We talked about this at the PTG today, notes are in the Cinder etherpad 
[1], search for "API to refresh volume connection info".


We agreed that we don't need a new admin level API. We are already 
processing block device info in several operations that involve 
rebuilding the VM, such as cold migrate/resize, stop/start, 
suspend/resume, and rebuild. There is a little flag in those code paths 
which already exists to refresh the connection information for each 
block device mapping. We agreed to just change those code paths to set 
that flag to True to refresh the connection information per BDM. We also 
said we wouldn't fail the operation if the refresh fails for whatever 
reason, like if Cinder fails. This would not be a backportable change 
and will have a release note, but it's much more automatic than needing 
to add an entirely new API. If you want/need to refresh volume 
connection information without disruption, you'd have to live migrate 
the server instance.


[1] https://etherpad.openstack.org/p/cinder-ptg-queens

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] MTU on Provider Networks

2017-09-14 Thread John Petrini
Hi List,

We are running Mitaka and having an MTU issue. Instances that we launch on
our provider network use Jumbo Frames (9000 MTU). There is a Layer2 link
between the OpenStack switches and our Core. This link uses and MTU of
1500.

Up until recently this MTU mismatch has not been an issue because none of
our systems are sending large enough packets to cause a problem. Recently
we've begun implementing a SIP device that sends very large packets,
sometimes even over 9000 bytes and requires fragmentation.

What we found in our troubleshooting is that when large packets originate
from our network to an instance in OpenStack they are being fragmented (as
expected). Once these packets reach the qbr-XX port iptables
defragments the packet and forwards it to the tap interface unfragmented.
If we set the MTU on the tap interface to 1500 it will refragment the
packet before forwarding it to the instance.

A similar issue happens the other direction. Large packets originating from
the OpenStack instance are fragmented (we set the mtu of the interface in
the instance to 1500 so this is expected) but again once the packets reach
the qbr--XX interface iptables defragments them again. If we set
the MTU of the qvb-XX to 1500 the packet is refragmented.

So long story short if we set the instance MTU to 1500 and the
qbr-XX and qvb-XX ports on the compute node to 1500 MTU the
packets remain fragmented and are able to traverse the network.

So the question becomes can we modify the default MTU of our provider
networks so that the instances created on this network receive a 1500 MTU
from DHCP and the ports on the compute node are also configured to a 1500
MTU?

I've been looking at the following neutron config option in
/etc/neutron/plugins/ml2/ml2_conf.ini:

physical_network_mtus =physnet1:9000,providernet:9000

Documentation on this setting is not very clear. Will adjusting this to
1500 for providernet accomplish what we need?

Thank You,

John Petrini
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack-dev] [nova][ironic] Concerns over rigid resource class-only ironic scheduling

2017-09-14 Thread melanie witt

On Thu, 14 Sep 2017 11:15:26 -0600, Ed Leafe wrote:

On Sep 14, 2017, at 10:30 AM, melanie witt  wrote:


I was thinking, if it's possible to assign more than one resource class to an Ironic 
node, maybe you could get similar behavior to the old non-exact filters. So if you have 
an oddball config, you could tag it as multiple resource classes that it's "close 
enough" to for a match. But I'm not sure whether it's possible for an Ironic node to 
be tagged with more than one resource class.


On the placement side, having an ironic node with two resource classes such as 
RC1 and RC2, would mean that the ResourceProvider (the ironic node) would have 
two inventory records: one for RC1, and another for RC2. When a request for a 
flavor specifying one of these classes is handled, only the one class’s 
inventory would be consumed. Placement would think that the node still had one 
of the other resource class available, and would include that if another 
request for that class is received, which would then fail as the node is 
already in use.


Okay, so it's not possible to have one Ironic node with one inventory 
record be classified as two different resource classes. Never mind that 
idea then.


Thanks for pointing that out.

-melanie




___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack-dev] [nova][ironic] Concerns over rigid resource class-only ironic scheduling

2017-09-14 Thread Ed Leafe
On Sep 14, 2017, at 10:30 AM, melanie witt  wrote:
> 
> I was thinking, if it's possible to assign more than one resource class to an 
> Ironic node, maybe you could get similar behavior to the old non-exact 
> filters. So if you have an oddball config, you could tag it as multiple 
> resource classes that it's "close enough" to for a match. But I'm not sure 
> whether it's possible for an Ironic node to be tagged with more than one 
> resource class.

On the placement side, having an ironic node with two resource classes such as 
RC1 and RC2, would mean that the ResourceProvider (the ironic node) would have 
two inventory records: one for RC1, and another for RC2. When a request for a 
flavor specifying one of these classes is handled, only the one class’s 
inventory would be consumed. Placement would think that the node still had one 
of the other resource class available, and would include that if another 
request for that class is received, which would then fail as the node is 
already in use.

-- Ed Leafe






___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack-dev] [nova][ironic] Concerns over rigid resource class-only ironic scheduling

2017-09-14 Thread melanie witt

On Thu, 7 Sep 2017 14:57:24 -0500, Matt Riedemann wrote:

Some more background information is in the ironic spec here:

https://review.openstack.org/#/c/500429/

Also, be aware of these release notes for Pike related to baremetal 
scheduling:


http://docs-draft.openstack.org/77/501477/1/check/gate-nova-releasenotes/1dc7513//releasenotes/build/html/unreleased.html#id2 



In Pike, nova is using a combination of VCPU/MEMORY_MB/DISK_GB resource 
class amounts from the flavor during scheduling as it always has, but it 
will also check for the custom resource_class which comes from the 
ironic node. The custom resource class is optional in Pike but will be a 
hard requirement in Queens, or at least that was the plan. The idea 
being that long-term we'd stop consulting VCPU/MEMORY_MB/DISK_GB from 
the flavor during scheduling and just use the atomic node.resource_class 
since we want to allocate a nova instance to an entire ironic node, and 
this is also why the Exact* filters were used too.


There are more details on using custom resource classes for scheduling 
here:


https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/custom-resource-classes-in-flavors.html 



Nisha is raising the question about whether or not we're making 
incorrect assumptions about how people are using nova/ironic and they 
want to use the non-Exact filters for VCPU/MEMORY_MB/DISK_GB, which as 
far as I have ever heard is not recommended/supported upstream as it can 
lead to resource tracking issues in Nova that eventually lead to 
scheduling failures later because of the scheduler thinking a node is 
available for more than one instance when it's really not.


This came up in the Nova PTG room yesterday and I wanted to reply on the 
thread with what I understood about it, for those who weren't in the 
session. In general, it's recommended to use the exact filters (1 flavor 
per Ironic node hardware config) as there's no concept of partially 
claiming a baremetal node.


But, with the old non-exact filters, you _could_ get away with creating 
fewer flavors than you have hardware configs and get "fuzzy matching" on 
Ironic nodes, to get nodes whose configs are "close enough" but not 
exact. This might be helpful in situations where you have some oddball 
configs you don't want to have separate flavors for.
I was thinking, if it's possible to assign more than one resource class 
to an Ironic node, maybe you could get similar behavior to the old 
non-exact filters. So if you have an oddball config, you could tag it as 
multiple resource classes that it's "close enough" to for a match. But 
I'm not sure whether it's possible for an Ironic node to be tagged with 
more than one resource class.


-melanie

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] [scientific] s/WG/SIG/g

2017-09-14 Thread Blair Bethwaite
Hi all,

If you happen to have been following along with recent discussions
about introducing OpenStack SIGs then this won't come as a surprise.
PS: the openstack-sig mailing list has been minted - get on it!

The meta-SIG is now looking for existing WGs who wish to convert to
SIGs, see 
http://lists.openstack.org/pipermail/openstack-sigs/2017-July/22.html.
The Scientific-WG is a good candidate for this because, at our core
(as I see it), we've never really been about bounded task-oriented
goals, but more of an open community of OpenStack
operators/architects/users. At any point we may have groups working on
particular goals, e.g., the OpenStack HPC book, performance
benchmarking/troubleshooting, integration architectures, and so on -
these groups could in future be spun out to their own WGs if
warranted.

What does this mean, practically? Essentially we just do some renaming
here and there and move our mailing list discussions to
openstack-s...@lists.openstack.org.

We've already discussed this in the past couple of meetings and so far
had no objections, so we're planning to move ahead with it soon. The
intention of this thread is to canvas broader input.

-- 
Cheers,
~Blairo

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators