Re: [vdsm] Fwd: Bonding, bridges and ifcfg

2012-12-10 Thread Antoni Segura Puimedon


- Original Message -
 From: Alon Bar-Lev alo...@redhat.com
 To: Antoni Segura Puimedon asegu...@redhat.com
 Cc: vdsm-devel@lists.fedorahosted.org
 Sent: Monday, December 10, 2012 2:07:38 PM
 Subject: Re: [vdsm] Fwd: Bonding, bridges and ifcfg
 
 Hi,
 
 Just to make sure... working in non-persistent mode will eliminate
 these kind of issues, right?
Yes, I'm quite sure that working directly with the kernel through netlink
or the ip tools would not exhibit the issues here mentioned.
 
 Alon
 
 - Original Message -
  From: Antoni Segura Puimedon asegu...@redhat.com
  To: vdsm-devel@lists.fedorahosted.org
  Sent: Monday, December 10, 2012 2:24:11 PM
  Subject: [vdsm] Fwd: Bonding, bridges and ifcfg
  
  Hello everybody,
  
  We found some unexpected behavior with bonds and we'd like to
  discuss
  it.
  Please, read the forwarded messages.
  
  Best,
  
  Toni
  
  - Forwarded Message -
   From: Dan Kenigsberg dan...@redhat.com
   To: Antoni Segura Puimedon asegu...@redhat.com
   Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky
   ilvov...@redhat.com
   Sent: Monday, December 10, 2012 1:03:48 PM
   Subject: Re: Bonding, ifcfg and luck
   
   On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon
   wrote:
Hi all,

I discussed this briefly with Livnat over the phone and
mentioned
it to Dan.
The issue that we have is that, if I understand correctly our
current
configNetwork, it could very well be that it works by means of
good
design with
a side-dish of luck.

I'll explain myself:
By design, as documented in
http://www.kernel.org/doc/Documentation/networking/bonding.txt:
All slaves of bond0 have the same MAC address (HWaddr) as
bond0
for all modes
except TLB and ALB that require a unique MAC address for each
slave.

Thus, all operations on the slave interfaces after they are
added
to the bond
(except on TLB and ALB modes) that rely on ifcfg will fail with
a
message like:
Device eth3 has different MAC address than expected,
ignoring.,
and no
ifup/ifdown will be performed.

Currently, we were not noticing this, because we were ignoring
completely
errors in ifdown and ifup, but
http://gerrit.ovirt.org/#/c/8415/
shed light on
the matter. As you can see in the following example (bonding
mode
4) the
behavior is just as documented:

[root@rhel64 ~]# cat /sys/class/net/eth*/address
52:54:00:a2:b4:50
52:54:00:3f:9b:28
52:54:00:51:50:49
52:54:00:ac:32:1b -
[root@rhel64 ~]# echo +eth2 
/sys/class/net/bond0/bonding/slaves
[root@rhel64 ~]# echo +eth3 
/sys/class/net/bond0/bonding/slaves
[root@rhel64 ~]# cat /sys/class/net/eth*/address
52:54:00:a2:b4:50
52:54:00:3f:9b:28
52:54:00:51:50:49
52:54:00:51:50:49 -
[root@rhel64 ~]# echo -eth3 
/sys/class/net/bond0/bonding/slaves
[root@rhel64 ~]# cat /sys/class/net/eth*/address
52:54:00:a2:b4:50
52:54:00:3f:9b:28
52:54:00:51:50:49
52:54:00:ac:32:1b -

Obviously, this means that, for example, when we add a bridge
on
top of a bond,
the ifdown, ifup of the bond slaves will be completely
fruitless
(although
luckily that doesn't prevent them from working).
   
   
   Sorry, thi is not obvious to me.
   When we change something in a nic, we first take it down (which
   break
   it
   away from the bond), change it, and then take it up again (and
   back
   to
   the bond).
   
   I did not understand which flow of configuration leads us to the
   unexpected mac error. I hope that we can circumvent it.
   
   

To solve this issue on the ifcfg based operation we could
either:
- Continue ignoring these issues and either not do ifup ifdown
for
bonding
  slaves or catch the specific error and ignore it.
   
   That's reasonable, for a hack.
   
- Modify the ifcfg files of the slaves after they are enslaved
to
reflect the
  MAC addr of /sys/class/net/bond0/address. Modify the ifcfg
  files
  after the
  bond is destroyed to reflect their own addresses as in
  /sys/class/net/ethx/address
   
   I do not undestand this solution at all... Fixing initscripts to
   expect
   the permanent mac address instead of the bond's one makes more
   sense
   to
   me. ( /proc/net/bonding/bond0 has Permanent HW addr:  )
   

Livnat made me note that this behavior can be a problem to the
anti
mac-spoofing rules that we add to iptables, as they rely on the
identity device
-macaddr to work and, obviously, in most bonding modes that is
broken unless
the device's macaddr is the one chosen for the bond.
   
   Right. I suppose we can open a bug about it: in-guest bond does
   not
   

Re: [vdsm] Fwd: Bonding, bridges and ifcfg

2012-12-10 Thread Dan Kenigsberg
On Mon, Dec 10, 2012 at 08:07:38AM -0500, Alon Bar-Lev wrote:
 Hi,
 
 Just to make sure... working in non-persistent mode will eliminate these kind 
 of issues, right?


No. It would eliminate the need to debug initscripts. But it would require
vdsm developer of an intimate recognition of kernel quirks.

We'd have fewer building blocks, and less of chance for incompatibility.
But we would need to reimplement (some of) the logic within ifup script.

 
 Alon
 
 - Original Message -
  From: Antoni Segura Puimedon asegu...@redhat.com
  To: vdsm-devel@lists.fedorahosted.org
  Sent: Monday, December 10, 2012 2:24:11 PM
  Subject: [vdsm] Fwd: Bonding, bridges and ifcfg
  
  Hello everybody,
  
  We found some unexpected behavior with bonds and we'd like to discuss
  it.
  Please, read the forwarded messages.
  
  Best,
  
  Toni
  
  - Forwarded Message -
   From: Dan Kenigsberg dan...@redhat.com
   To: Antoni Segura Puimedon asegu...@redhat.com
   Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky
   ilvov...@redhat.com
   Sent: Monday, December 10, 2012 1:03:48 PM
   Subject: Re: Bonding, ifcfg and luck
   
   On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon
   wrote:
Hi all,

I discussed this briefly with Livnat over the phone and mentioned
it to Dan.
The issue that we have is that, if I understand correctly our
current
configNetwork, it could very well be that it works by means of
good
design with
a side-dish of luck.

I'll explain myself:
By design, as documented in
http://www.kernel.org/doc/Documentation/networking/bonding.txt:
All slaves of bond0 have the same MAC address (HWaddr) as bond0
for all modes
except TLB and ALB that require a unique MAC address for each
slave.

Thus, all operations on the slave interfaces after they are added
to the bond
(except on TLB and ALB modes) that rely on ifcfg will fail with a
message like:
Device eth3 has different MAC address than expected, ignoring.,
and no
ifup/ifdown will be performed.

Currently, we were not noticing this, because we were ignoring
completely
errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/
shed light on
the matter. As you can see in the following example (bonding mode
4) the
behavior is just as documented:

[root@rhel64 ~]# cat /sys/class/net/eth*/address
52:54:00:a2:b4:50
52:54:00:3f:9b:28
52:54:00:51:50:49
52:54:00:ac:32:1b -
[root@rhel64 ~]# echo +eth2 
/sys/class/net/bond0/bonding/slaves
[root@rhel64 ~]# echo +eth3 
/sys/class/net/bond0/bonding/slaves
[root@rhel64 ~]# cat /sys/class/net/eth*/address
52:54:00:a2:b4:50
52:54:00:3f:9b:28
52:54:00:51:50:49
52:54:00:51:50:49 -
[root@rhel64 ~]# echo -eth3 
/sys/class/net/bond0/bonding/slaves
[root@rhel64 ~]# cat /sys/class/net/eth*/address
52:54:00:a2:b4:50
52:54:00:3f:9b:28
52:54:00:51:50:49
52:54:00:ac:32:1b -

Obviously, this means that, for example, when we add a bridge on
top of a bond,
the ifdown, ifup of the bond slaves will be completely fruitless
(although
luckily that doesn't prevent them from working).
   
   
   Sorry, thi is not obvious to me.
   When we change something in a nic, we first take it down (which
   break
   it
   away from the bond), change it, and then take it up again (and back
   to
   the bond).
   
   I did not understand which flow of configuration leads us to the
   unexpected mac error. I hope that we can circumvent it.
   
   

To solve this issue on the ifcfg based operation we could either:
- Continue ignoring these issues and either not do ifup ifdown
for
bonding
  slaves or catch the specific error and ignore it.
   
   That's reasonable, for a hack.
   
- Modify the ifcfg files of the slaves after they are enslaved to
reflect the
  MAC addr of /sys/class/net/bond0/address. Modify the ifcfg
  files
  after the
  bond is destroyed to reflect their own addresses as in
  /sys/class/net/ethx/address
   
   I do not undestand this solution at all... Fixing initscripts to
   expect
   the permanent mac address instead of the bond's one makes more
   sense
   to
   me. ( /proc/net/bonding/bond0 has Permanent HW addr:  )
   

Livnat made me note that this behavior can be a problem to the
anti
mac-spoofing rules that we add to iptables, as they rely on the
identity device
-macaddr to work and, obviously, in most bonding modes that is
broken unless
the device's macaddr is the one chosen for the bond.
   
   Right. I suppose we can open a bug about it: in-guest bond does not
   work
   with mac-no-spoofing. I have a vague memory of discussing this with
   lpeer few months 

Re: [vdsm] Fwd: Bonding, bridges and ifcfg

2012-12-10 Thread Alon Bar-Lev


- Original Message -
 From: Dan Kenigsberg dan...@redhat.com
 To: Alon Bar-Lev alo...@redhat.com
 Cc: Antoni Segura Puimedon asegu...@redhat.com, 
 vdsm-devel@lists.fedorahosted.org
 Sent: Monday, December 10, 2012 3:16:21 PM
 Subject: Re: [vdsm] Fwd: Bonding, bridges and ifcfg
 
 On Mon, Dec 10, 2012 at 08:07:38AM -0500, Alon Bar-Lev wrote:
  Hi,
  
  Just to make sure... working in non-persistent mode will eliminate
  these kind of issues, right?
 
 
 No. It would eliminate the need to debug initscripts. But it would
 require
 vdsm developer of an intimate recognition of kernel quirks.
 
 We'd have fewer building blocks, and less of chance for
 incompatibility.
 But we would need to reimplement (some of) the logic within ifup
 script.

Sure you need to reimplement ifup and ifdown functionality as you would not use 
these...

You will not have fewer building blocks if you will break the fedora/redhat 
border, actually if you go non persistent you will have fewer of these and be 
more portable as you have one kernel (linux) to support.

vdsm developer [should] already require intimate recognition of the kernel, see 
bellow one example. It is just that even if one has intimate recognition of the 
kernel, working via primitive tools like rhel/fedora network-script only make 
it harder to produce the desired outcome, while having full control over the 
process and the result.

 
  
  Alon
  
  - Original Message -
   From: Antoni Segura Puimedon asegu...@redhat.com
   To: vdsm-devel@lists.fedorahosted.org
   Sent: Monday, December 10, 2012 2:24:11 PM
   Subject: [vdsm] Fwd: Bonding, bridges and ifcfg
   
   Hello everybody,
   
   We found some unexpected behavior with bonds and we'd like to
   discuss
   it.
   Please, read the forwarded messages.
   
   Best,
   
   Toni
   
   - Forwarded Message -
From: Dan Kenigsberg dan...@redhat.com
To: Antoni Segura Puimedon asegu...@redhat.com
Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky
ilvov...@redhat.com
Sent: Monday, December 10, 2012 1:03:48 PM
Subject: Re: Bonding, ifcfg and luck

On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura
Puimedon
wrote:
 Hi all,
 
 I discussed this briefly with Livnat over the phone and
 mentioned
 it to Dan.
 The issue that we have is that, if I understand correctly our
 current
 configNetwork, it could very well be that it works by means
 of
 good
 design with
 a side-dish of luck.
 
 I'll explain myself:
 By design, as documented in
 http://www.kernel.org/doc/Documentation/networking/bonding.txt:
 All slaves of bond0 have the same MAC address (HWaddr) as
 bond0
 for all modes
 except TLB and ALB that require a unique MAC address for each
 slave.
 
 Thus, all operations on the slave interfaces after they are
 added
 to the bond
 (except on TLB and ALB modes) that rely on ifcfg will fail
 with a
 message like:
 Device eth3 has different MAC address than expected,
 ignoring.,
 and no
 ifup/ifdown will be performed.
 
 Currently, we were not noticing this, because we were
 ignoring
 completely
 errors in ifdown and ifup, but
 http://gerrit.ovirt.org/#/c/8415/
 shed light on
 the matter. As you can see in the following example (bonding
 mode
 4) the
 behavior is just as documented:
 
 [root@rhel64 ~]# cat /sys/class/net/eth*/address
 52:54:00:a2:b4:50
 52:54:00:3f:9b:28
 52:54:00:51:50:49
 52:54:00:ac:32:1b -
 [root@rhel64 ~]# echo +eth2 
 /sys/class/net/bond0/bonding/slaves
 [root@rhel64 ~]# echo +eth3 
 /sys/class/net/bond0/bonding/slaves
 [root@rhel64 ~]# cat /sys/class/net/eth*/address
 52:54:00:a2:b4:50
 52:54:00:3f:9b:28
 52:54:00:51:50:49
 52:54:00:51:50:49 -
 [root@rhel64 ~]# echo -eth3 
 /sys/class/net/bond0/bonding/slaves
 [root@rhel64 ~]# cat /sys/class/net/eth*/address
 52:54:00:a2:b4:50
 52:54:00:3f:9b:28
 52:54:00:51:50:49
 52:54:00:ac:32:1b -
 
 Obviously, this means that, for example, when we add a bridge
 on
 top of a bond,
 the ifdown, ifup of the bond slaves will be completely
 fruitless
 (although
 luckily that doesn't prevent them from working).


Sorry, thi is not obvious to me.
When we change something in a nic, we first take it down (which
break
it
away from the bond), change it, and then take it up again (and
back
to
the bond).

I did not understand which flow of configuration leads us to
the
unexpected mac error. I hope that we can circumvent it.


 
 To solve this issue on the ifcfg based operation we could
 either:
 - Continue ignoring these 

Re: [vdsm] RFC: New Storage API

2012-12-10 Thread Adam Litke
On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote:
 
 
 - Original Message -
  From: Shu Ming shum...@linux.vnet.ibm.com
  To: Saggi Mizrahi smizr...@redhat.com
  Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org, 
  engine-devel engine-de...@ovirt.org
  Sent: Thursday, December 6, 2012 11:02:02 AM
  Subject: Re: [vdsm] RFC: New Storage API
  
  Saggi,
  
  Thanks for sharing your thought and I get some comments below.
  
  
  Saggi Mizrahi:
   I've been throwing a lot of bits out about the new storage API and
   I think it's time to talk a bit.
   I will purposefully try and keep implementation details away and
   concentrate about how the API looks and how you use it.
  
   First major change is in terminology, there is no long a storage
   domain but a storage repository.
   This change is done because so many things are already called
   domain in the system and this will make things less confusing for
   new-commers with a libvirt background.
  
   One other changes is that repositories no longer have a UUID.
   The UUID was only used in the pool members manifest and is no
   longer needed.
  
  
   connectStorageRepository(repoId, repoFormat,
   connectionParameters={}):
   repoId - is a transient name that will be used to refer to the
   connected domain, it is not persisted and doesn't have to be the
   same across the cluster.
   repoFormat - Similar to what used to be type (eg. localfs-1.0,
   nfs-3.4, clvm-1.2).
   connectionParameters - This is format specific and will used to
   tell VDSM how to connect to the repo.
  
  
  Where does repoID come from? I think repoID doesn't exist before
  connectStorageRepository() return.  Isn't repoID a return value of
  connectStorageRepository()?
 No, repoIDs are no longer part of the domain, they are just a transient 
 handle.
 The user can put whatever it wants there as long as it isn't already taken by 
 another currently connected domain.
  
  
   disconnectStorageRepository(self, repoId)
  
  
   In the new API there are only images, some images are mutable and
   some are not.
   mutable images are also called VirtualDisks
   immutable images are also called Snapshots
  
   There are no explicit templates, you can create as many images as
   you want from any snapshot.
  
   There are 4 major image operations:
  
  
   createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
  userData={}, options={}):
  
   targetRepoId - ID of a connected repo where the disk will be
   created
   size - The size of the image you wish to create
   baseSnapshotId - the ID of the snapshot you want the base the new
   virtual disk on
   userData - optional data that will be attached to the new VD, could
   be anything that the user desires.
   options - options to modify VDSMs default behavior
  
   returns the id of the new VD
  
  I think we will also need a function to check if a a VirtualDisk is
  based on a specific snapshot.
  Like: isSnapshotOf(virtualDiskId, baseSnapshotID):
 No, the design is that volume dependencies are an implementation detail.
 There is no reason for you to know that an image is physically a snapshot of 
 another.
 Logical snapshots, template information, and any other information can be set 
 by the user by using the userData field available for every image.

Statements like this make me start to worry about your userData concept.  It's a
sign of a bad API if the user needs to invent a custom metadata scheme for
itself.  This reminds me of the abomination that is the 'custom' property in the
vm definition today.

   createSnapshot(targetRepoId, baseVirtualDiskId,
   userData={}, options={}):
   targetRepoId - The ID of a connected repo where the new sanpshot
   will be created and the original image exists as well.
   size - The size of the image you wish to create
   baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
   to snapshot
   userData - optional data that will be attached to the new Snapshot,
   could be anything that the user desires.
   options - options to modify VDSMs default behavior
  
   returns the id of the new Snapshot
  
   copyImage(targetRepoId, imageId, baseImageId=None, userData={},
   options={})
   targetRepoId - The ID of a connected repo where the new image will
   be created
   imageId - The image you wish to copy
   baseImageId - if specified, the new image will contain only the
   diff between image and Id.
  If None the new image will contain all the bits of
  image Id. This can be used to copy partial parts of
  images for export.
   userData - optional data that will be attached to the new image,
   could be anything that the user desires.
   options - options to modify VDSMs default behavior
  
  Does this function mean that we can copy the image from one
  repository
  to another repository? Does it cover the semantics of storage
  migration,
  storage backup, storage 

Re: [vdsm] RFC: New Storage API

2012-12-10 Thread Saggi Mizrahi


- Original Message -
 From: Adam Litke a...@us.ibm.com
 To: Saggi Mizrahi smizr...@redhat.com
 Cc: Deepak C Shetty deepa...@linux.vnet.ibm.com, engine-devel 
 engine-de...@ovirt.org, VDSM Project
 Development vdsm-devel@lists.fedorahosted.org
 Sent: Monday, December 10, 2012 1:49:31 PM
 Subject: Re: [vdsm] RFC: New Storage API
 
 On Fri, Dec 07, 2012 at 02:53:41PM -0500, Saggi Mizrahi wrote:
 
 snip
 
   1) Can you provide more info on why there is a exception for 'lvm
   based
   block domain'. Its not coming out clearly.
  File based domains are responsible for syncing up object
  manipulation (creation\deletion)
  The backend is responsible for making sure it all works either by
  having a single writer (NFS) or having it's own locking mechanism
  (gluster).
  In our LVM based domains VDSM is responsible for basic object
  manipulation.
  The current design uses an approach where there is a single host
  responsible for object creation\deleteion it is the
  SRM\SDM\SPM\S?M.
  If we ever find a way to make it fully clustered without a big hit
  in performance the S?M requirement will be removed form that type
  of domain.
 
 I would like to see us maintain a LOCALFS domain as well.  For this,
 we would
 also need SRM, correct?
No, why?
 
 --
 Adam Litke a...@us.ibm.com
 IBM Linux Technology Center
 
 
___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] RFC: New Storage API

2012-12-10 Thread Adam Litke
On Mon, Dec 10, 2012 at 02:03:09PM -0500, Saggi Mizrahi wrote:
 
 
 - Original Message -
  From: Adam Litke a...@us.ibm.com
  To: Saggi Mizrahi smizr...@redhat.com
  Cc: Deepak C Shetty deepa...@linux.vnet.ibm.com, engine-devel 
  engine-de...@ovirt.org, VDSM Project
  Development vdsm-devel@lists.fedorahosted.org
  Sent: Monday, December 10, 2012 1:49:31 PM
  Subject: Re: [vdsm] RFC: New Storage API
  
  On Fri, Dec 07, 2012 at 02:53:41PM -0500, Saggi Mizrahi wrote:
  
  snip
  
1) Can you provide more info on why there is a exception for 'lvm
based
block domain'. Its not coming out clearly.
   File based domains are responsible for syncing up object
   manipulation (creation\deletion)
   The backend is responsible for making sure it all works either by
   having a single writer (NFS) or having it's own locking mechanism
   (gluster).
   In our LVM based domains VDSM is responsible for basic object
   manipulation.
   The current design uses an approach where there is a single host
   responsible for object creation\deleteion it is the
   SRM\SDM\SPM\S?M.
   If we ever find a way to make it fully clustered without a big hit
   in performance the S?M requirement will be removed form that type
   of domain.
  
  I would like to see us maintain a LOCALFS domain as well.  For this,
  we would
  also need SRM, correct?
 No, why?

Sorry, nevermind.  I was thinking of a scenario with multiple clients talking to
a single vdsm and making sure they don't stomp on one another.  This is
probably not something we are going to care about though.

-- 
Adam Litke a...@us.ibm.com
IBM Linux Technology Center

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] RFC: New Storage API

2012-12-10 Thread Adam Litke
On Mon, Dec 10, 2012 at 03:36:23PM -0500, Saggi Mizrahi wrote:
  Statements like this make me start to worry about your userData
  concept.  It's a
  sign of a bad API if the user needs to invent a custom metadata
  scheme for
  itself.  This reminds me of the abomination that is the 'custom'
  property in the
  vm definition today.
 In one sentence: If VDSM doesn't care about it, VDSM doesn't manage it.
 
 userData being a void* is quite common and I don't understand why you would 
 thing it's a sign of a bad API.
 Further more, giving the user choice about how to represent it's own metadata 
 and what fields it want to keep seems reasonable to me.
 Especially given the fact that VDSM never reads it.
 
 The reason we are pulling away from the current system of VDSM understanding 
 the extra data is that it makes that data tied to VDSMs on disk format.
 VDSM on disk format has to be very stable because of clusters with multiple 
 VDSM versions.
 Further more, since this is actually manager data it has to be tied to the 
 manager backward compatibility lifetime as well.
 Having it be opaque to VDSM ties it to only one, simpler, support lifetime 
 instead of two.
 
 I guess you are implying that it will make it problematic for multiple users 
 to read userData left by another user because the formats might not be 
 compatible.
 The solution is that all parties interested in using VDSM storage agree on 
 format, and common fields, and supportability, and all the other things that 
 choosing a supporting *something* entails.
 This is, however, out of the scope of VDSM. When the time comes I think how 
 the userData blob is actually parsed and what fields it keeps should be 
 discussed on ovirt-devel or engine-devel.
 
 The crux of the issue is that VDSM manages only what it cares about and the 
 user can't modify directly.
 This is done because everything we expose we commit to.
 If you want any information persisted like:
 - Human readable name (in whatever encoding)
 - Is this a template or a snapshot
 - What user owns this image
 
 You can just put it in the userData.
 VDSM is not going to impose what encoding you use.
 It's not going to decide if you represent your users as IDs or names or ldap 
 queries or Public Keys.
 It's not going to decide if you have explicit templates or not.
 It's not going to decide if you care what is the logical image chain.
 It's not going to decide anything that is out of it's scope.
 No format is future proof, no selection of fields will be good for any 
 situation.
 I'd much rather it be someone else's problem when any of them need to be 
 changed.
 They have currently been VDSMs problem and it has been hell to maintain.

In general, I actually agree with most of this.  What I want to avoid is pushing
things that should actually be a part of the API into this userData blob.  We do
want to keep the API as simple as possible to give vdsm flexibility.  If, over
time, we find that users are always using userData to work around something
missing in the API, this could be a really good sign that the API needs
extension.

-- 
Adam Litke a...@us.ibm.com
IBM Linux Technology Center

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] RFC: New Storage API

2012-12-10 Thread Saggi Mizrahi


- Original Message -
 From: Adam Litke a...@us.ibm.com
 To: Saggi Mizrahi smizr...@redhat.com
 Cc: Shu Ming shum...@linux.vnet.ibm.com, engine-devel 
 engine-de...@ovirt.org, VDSM Project Development
 vdsm-devel@lists.fedorahosted.org
 Sent: Monday, December 10, 2012 4:47:46 PM
 Subject: Re: [vdsm] RFC: New Storage API
 
 On Mon, Dec 10, 2012 at 03:36:23PM -0500, Saggi Mizrahi wrote:
   Statements like this make me start to worry about your userData
   concept.  It's a
   sign of a bad API if the user needs to invent a custom metadata
   scheme for
   itself.  This reminds me of the abomination that is the 'custom'
   property in the
   vm definition today.
  In one sentence: If VDSM doesn't care about it, VDSM doesn't manage
  it.
  
  userData being a void* is quite common and I don't understand why
  you would thing it's a sign of a bad API.
  Further more, giving the user choice about how to represent it's
  own metadata and what fields it want to keep seems reasonable to
  me.
  Especially given the fact that VDSM never reads it.
  
  The reason we are pulling away from the current system of VDSM
  understanding the extra data is that it makes that data tied to
  VDSMs on disk format.
  VDSM on disk format has to be very stable because of clusters with
  multiple VDSM versions.
  Further more, since this is actually manager data it has to be tied
  to the manager backward compatibility lifetime as well.
  Having it be opaque to VDSM ties it to only one, simpler, support
  lifetime instead of two.
  
  I guess you are implying that it will make it problematic for
  multiple users to read userData left by another user because the
  formats might not be compatible.
  The solution is that all parties interested in using VDSM storage
  agree on format, and common fields, and supportability, and all
  the other things that choosing a supporting *something* entails.
  This is, however, out of the scope of VDSM. When the time comes I
  think how the userData blob is actually parsed and what fields it
  keeps should be discussed on ovirt-devel or engine-devel.
  
  The crux of the issue is that VDSM manages only what it cares about
  and the user can't modify directly.
  This is done because everything we expose we commit to.
  If you want any information persisted like:
  - Human readable name (in whatever encoding)
  - Is this a template or a snapshot
  - What user owns this image
  
  You can just put it in the userData.
  VDSM is not going to impose what encoding you use.
  It's not going to decide if you represent your users as IDs or
  names or ldap queries or Public Keys.
  It's not going to decide if you have explicit templates or not.
  It's not going to decide if you care what is the logical image
  chain.
  It's not going to decide anything that is out of it's scope.
  No format is future proof, no selection of fields will be good for
  any situation.
  I'd much rather it be someone else's problem when any of them need
  to be changed.
  They have currently been VDSMs problem and it has been hell to
  maintain.
 
 In general, I actually agree with most of this.  What I want to avoid
 is pushing
 things that should actually be a part of the API into this userData
 blob.  We do
 want to keep the API as simple as possible to give vdsm flexibility.
  If, over
 time, we find that users are always using userData to work around
 something
 missing in the API, this could be a really good sign that the API
 needs
 extension.
I was actually contemplating about this for quite a while.
If while you create an image the reply is lost or, VDSM is unable to know if 
the operation was committed or not, the user will have no way of knowing what 
thew new image ID is.
To solve this it is recommended that the manager puts some sort of task related 
information in the user data.
If the operation ever finishes in an an ambiguous state the user just reads the 
userData from any images it doesn't know or is unsure about their state.

This is a flow that every client will have to have.
So why not just add that to the API?
Because I don't want to impose how this information gets generated, what is 
the content of that data or how unique it has to be.
Since VDSM doesn't use it for anything, I don't feel like I need to figure this 
out.
I am all for simplicity, but simplicity is kind of an abstract concept. Having 
it be a blob is in some aspects the simplest thing you can do.
Just saying that I have a field, put whatever in it is simple to convey but 
does requires more work on the user's side to figure out what to do with it.

All that being said, I do think that the format, fields and how to use them 
should be defined so different users can communicate and synchronize.
It's also important that you don't reinvent the wheel for every flow in every 
client.
I'm just saying that it's not in the scope of VDSM.
It should be done as a standard that all users of VDSM agree too conform to.
It's the same way that a 

Re: [vdsm] RFC: New Storage API

2012-12-10 Thread Shu Ming

2012-12-11 4:36, Saggi Mizrahi:


- Original Message -

From: Adam Litke a...@us.ibm.com
To: Saggi Mizrahi smizr...@redhat.com
Cc: Shu Ming shum...@linux.vnet.ibm.com, engine-devel engine-de...@ovirt.org, 
VDSM Project Development
vdsm-devel@lists.fedorahosted.org
Sent: Monday, December 10, 2012 1:39:51 PM
Subject: Re: [vdsm] RFC: New Storage API

On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote:


- Original Message -

From: Shu Ming shum...@linux.vnet.ibm.com
To: Saggi Mizrahi smizr...@redhat.com
Cc: VDSM Project Development
vdsm-devel@lists.fedorahosted.org, engine-devel
engine-de...@ovirt.org
Sent: Thursday, December 6, 2012 11:02:02 AM
Subject: Re: [vdsm] RFC: New Storage API

Saggi,

Thanks for sharing your thought and I get some comments below.


Saggi Mizrahi:

I've been throwing a lot of bits out about the new storage API
and
I think it's time to talk a bit.
I will purposefully try and keep implementation details away
and
concentrate about how the API looks and how you use it.

First major change is in terminology, there is no long a
storage
domain but a storage repository.
This change is done because so many things are already called
domain in the system and this will make things less confusing
for
new-commers with a libvirt background.

One other changes is that repositories no longer have a UUID.
The UUID was only used in the pool members manifest and is no
longer needed.


connectStorageRepository(repoId, repoFormat,
connectionParameters={}):
repoId - is a transient name that will be used to refer to the
connected domain, it is not persisted and doesn't have to be
the
same across the cluster.
repoFormat - Similar to what used to be type (eg. localfs-1.0,
nfs-3.4, clvm-1.2).
connectionParameters - This is format specific and will used to
tell VDSM how to connect to the repo.


Where does repoID come from? I think repoID doesn't exist before
connectStorageRepository() return.  Isn't repoID a return value
of
connectStorageRepository()?

No, repoIDs are no longer part of the domain, they are just a
transient handle.
The user can put whatever it wants there as long as it isn't
already taken by another currently connected domain.

disconnectStorageRepository(self, repoId)


In the new API there are only images, some images are mutable
and
some are not.
mutable images are also called VirtualDisks
immutable images are also called Snapshots

There are no explicit templates, you can create as many images
as
you want from any snapshot.

There are 4 major image operations:


createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
userData={}, options={}):

targetRepoId - ID of a connected repo where the disk will be
created
size - The size of the image you wish to create
baseSnapshotId - the ID of the snapshot you want the base the
new
virtual disk on
userData - optional data that will be attached to the new VD,
could
be anything that the user desires.
options - options to modify VDSMs default behavior

returns the id of the new VD

I think we will also need a function to check if a a VirtualDisk
is
based on a specific snapshot.
Like: isSnapshotOf(virtualDiskId, baseSnapshotID):

No, the design is that volume dependencies are an implementation
detail.
There is no reason for you to know that an image is physically a
snapshot of another.
Logical snapshots, template information, and any other information
can be set by the user by using the userData field available for
every image.

Statements like this make me start to worry about your userData
concept.  It's a
sign of a bad API if the user needs to invent a custom metadata
scheme for
itself.  This reminds me of the abomination that is the 'custom'
property in the
vm definition today.

In one sentence: If VDSM doesn't care about it, VDSM doesn't manage it.

userData being a void* is quite common and I don't understand why you would 
thing it's a sign of a bad API.
Further more, giving the user choice about how to represent it's own metadata 
and what fields it want to keep seems reasonable to me.
Especially given the fact that VDSM never reads it.

The reason we are pulling away from the current system of VDSM understanding 
the extra data is that it makes that data tied to VDSMs on disk format.
VDSM on disk format has to be very stable because of clusters with multiple 
VDSM versions.
Further more, since this is actually manager data it has to be tied to the 
manager backward compatibility lifetime as well.
Having it be opaque to VDSM ties it to only one, simpler, support lifetime 
instead of two.


Making userData being opaque gives flexibilities to the management 
applications.  To me, opaque userDaa can have two types at least. The 
first is the userData for runtime only.  The second is the userData 
expected to be persisted into the metadata disk.  For the first type, 
the management applications can store their own data structures like 
temporary task states, VDSM query caches etc. After the VDSM 

Re: [vdsm] Fwd: Bonding, bridges and ifcfg

2012-12-10 Thread Mark Wu

On 12/10/2012 08:24 PM, Antoni Segura Puimedon wrote:

Hello everybody,

We found some unexpected behavior with bonds and we'd like to discuss it.
Please, read the forwarded messages.

Best,

Toni

- Forwarded Message -

From: Dan Kenigsberg dan...@redhat.com
To: Antoni Segura Puimedon asegu...@redhat.com
Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky ilvov...@redhat.com
Sent: Monday, December 10, 2012 1:03:48 PM
Subject: Re: Bonding, ifcfg and luck

On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon
wrote:

Hi all,

I discussed this briefly with Livnat over the phone and mentioned
it to Dan.
The issue that we have is that, if I understand correctly our
current
configNetwork, it could very well be that it works by means of good
design with
a side-dish of luck.

I'll explain myself:
By design, as documented in
http://www.kernel.org/doc/Documentation/networking/bonding.txt:
All slaves of bond0 have the same MAC address (HWaddr) as bond0
for all modes
except TLB and ALB that require a unique MAC address for each
slave.

Thus, all operations on the slave interfaces after they are added
to the bond
(except on TLB and ALB modes) that rely on ifcfg will fail with a
message like:
Device eth3 has different MAC address than expected, ignoring.,
and no
ifup/ifdown will be performed.

Currently, we were not noticing this, because we were ignoring
completely
errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/
shed light on
the matter. As you can see in the following example (bonding mode
4) the
behavior is just as documented:

 [root@rhel64 ~]# cat /sys/class/net/eth*/address
 52:54:00:a2:b4:50
 52:54:00:3f:9b:28
 52:54:00:51:50:49
 52:54:00:ac:32:1b -
 [root@rhel64 ~]# echo +eth2 
 /sys/class/net/bond0/bonding/slaves
 [root@rhel64 ~]# echo +eth3 
 /sys/class/net/bond0/bonding/slaves
 [root@rhel64 ~]# cat /sys/class/net/eth*/address
 52:54:00:a2:b4:50
 52:54:00:3f:9b:28
 52:54:00:51:50:49
 52:54:00:51:50:49 -
 [root@rhel64 ~]# echo -eth3 
 /sys/class/net/bond0/bonding/slaves
 [root@rhel64 ~]# cat /sys/class/net/eth*/address
 52:54:00:a2:b4:50
 52:54:00:3f:9b:28
 52:54:00:51:50:49
 52:54:00:ac:32:1b -

Obviously, this means that, for example, when we add a bridge on
top of a bond,
the ifdown, ifup of the bond slaves will be completely fruitless
(although
luckily that doesn't prevent them from working).


Sorry, thi is not obvious to me.
When we change something in a nic, we first take it down (which break
it
away from the bond), change it, and then take it up again (and back
to
the bond).

I did not understand which flow of configuration leads us to the
unexpected mac error. I hope that we can circumvent it.
I get the same question.  The warning message should be only seen when 
you run ifup on bonding or its slave, which is already up, otherwise
the slave nic's mac address should hold its own permanent mac address.   
If the bonding is down before, you shouldn't see this message

because the nic is not enslaved.




To solve this issue on the ifcfg based operation we could either:
- Continue ignoring these issues and either not do ifup ifdown for
bonding
   slaves or catch the specific error and ignore it.

That's reasonable, for a hack.


- Modify the ifcfg files of the slaves after they are enslaved to
reflect the
   MAC addr of /sys/class/net/bond0/address. Modify the ifcfg files
   after the
   bond is destroyed to reflect their own addresses as in
   /sys/class/net/ethx/address

I do not undestand this solution at all... Fixing initscripts to
expect
the permanent mac address instead of the bond's one makes more sense
to
me. ( /proc/net/bonding/bond0 has Permanent HW addr:  )


Livnat made me note that this behavior can be a problem to the anti
mac-spoofing rules that we add to iptables, as they rely on the
identity device
-macaddr to work and, obviously, in most bonding modes that is
broken unless
the device's macaddr is the one chosen for the bond.

Right. I suppose we can open a bug about it: in-guest bond does not
work
with mac-no-spoofing. I have a vague memory of discussing this with
lpeer few months back, but it somehow slipped my mind.
I am not sure why we need bonding inside guest.  Bonding can provide 
link failover and bandwidth aggregation.
We can achieve that by setting up bonding on host.  If we really need 
it,  we could workaround it by defining a netfilter for each vm,
which allow the traffic from all the mac addresses belong to it.  To be 
exact,  we also could make qemu generate an event
to libvirt on guest vif's mac address change, and then libvirt can 
validate the change and updating it's data accordingly.






Well, I think that is all for this issue. We should discuss which
is the best
approach for this before we move on with patches that account for
ifup ifdown
return information.

Best,

Toni



Re: [vdsm] Fwd: Bonding, bridges and ifcfg

2012-12-10 Thread Livnat Peer
On 11/12/12 07:42, Mark Wu wrote:
 On 12/10/2012 08:24 PM, Antoni Segura Puimedon wrote:
 Hello everybody,

 We found some unexpected behavior with bonds and we'd like to discuss it.
 Please, read the forwarded messages.

 Best,

 Toni

 - Forwarded Message -
 From: Dan Kenigsberg dan...@redhat.com
 To: Antoni Segura Puimedon asegu...@redhat.com
 Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky
 ilvov...@redhat.com
 Sent: Monday, December 10, 2012 1:03:48 PM
 Subject: Re: Bonding, ifcfg and luck

 On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon
 wrote:
 Hi all,

 I discussed this briefly with Livnat over the phone and mentioned
 it to Dan.
 The issue that we have is that, if I understand correctly our
 current
 configNetwork, it could very well be that it works by means of good
 design with
 a side-dish of luck.

 I'll explain myself:
 By design, as documented in
 http://www.kernel.org/doc/Documentation/networking/bonding.txt:
 All slaves of bond0 have the same MAC address (HWaddr) as bond0
 for all modes
 except TLB and ALB that require a unique MAC address for each
 slave.

 Thus, all operations on the slave interfaces after they are added
 to the bond
 (except on TLB and ALB modes) that rely on ifcfg will fail with a
 message like:
 Device eth3 has different MAC address than expected, ignoring.,
 and no
 ifup/ifdown will be performed.

 Currently, we were not noticing this, because we were ignoring
 completely
 errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/
 shed light on
 the matter. As you can see in the following example (bonding mode
 4) the
 behavior is just as documented:

  [root@rhel64 ~]# cat /sys/class/net/eth*/address
  52:54:00:a2:b4:50
  52:54:00:3f:9b:28
  52:54:00:51:50:49
  52:54:00:ac:32:1b -
  [root@rhel64 ~]# echo +eth2 
  /sys/class/net/bond0/bonding/slaves
  [root@rhel64 ~]# echo +eth3 
  /sys/class/net/bond0/bonding/slaves
  [root@rhel64 ~]# cat /sys/class/net/eth*/address
  52:54:00:a2:b4:50
  52:54:00:3f:9b:28
  52:54:00:51:50:49
  52:54:00:51:50:49 -
  [root@rhel64 ~]# echo -eth3 
  /sys/class/net/bond0/bonding/slaves
  [root@rhel64 ~]# cat /sys/class/net/eth*/address
  52:54:00:a2:b4:50
  52:54:00:3f:9b:28
  52:54:00:51:50:49
  52:54:00:ac:32:1b -

 Obviously, this means that, for example, when we add a bridge on
 top of a bond,
 the ifdown, ifup of the bond slaves will be completely fruitless
 (although
 luckily that doesn't prevent them from working).

 Sorry, thi is not obvious to me.
 When we change something in a nic, we first take it down (which break
 it
 away from the bond), change it, and then take it up again (and back
 to
 the bond).

 I did not understand which flow of configuration leads us to the
 unexpected mac error. I hope that we can circumvent it.
 I get the same question.  The warning message should be only seen when
 you run ifup on bonding or its slave, which is already up, otherwise
 the slave nic's mac address should hold its own permanent mac address.  
 If the bonding is down before, you shouldn't see this message
 because the nic is not enslaved.


 To solve this issue on the ifcfg based operation we could either:
 - Continue ignoring these issues and either not do ifup ifdown for
 bonding
slaves or catch the specific error and ignore it.
 That's reasonable, for a hack.

 - Modify the ifcfg files of the slaves after they are enslaved to
 reflect the
MAC addr of /sys/class/net/bond0/address. Modify the ifcfg files
after the
bond is destroyed to reflect their own addresses as in
/sys/class/net/ethx/address
 I do not undestand this solution at all... Fixing initscripts to
 expect
 the permanent mac address instead of the bond's one makes more sense
 to
 me. ( /proc/net/bonding/bond0 has Permanent HW addr:  )

 Livnat made me note that this behavior can be a problem to the anti
 mac-spoofing rules that we add to iptables, as they rely on the
 identity device
 -macaddr to work and, obviously, in most bonding modes that is
 broken unless
 the device's macaddr is the one chosen for the bond.
 Right. I suppose we can open a bug about it: in-guest bond does not
 work
 with mac-no-spoofing. I have a vague memory of discussing this with
 lpeer few months back, but it somehow slipped my mind.
 I am not sure why we need bonding inside guest.  Bonding can provide
 link failover and bandwidth aggregation.
 We can achieve that by setting up bonding on host.  If we really need
 it,  we could workaround it by defining a netfilter for each vm,
 which allow the traffic from all the mac addresses belong to it.  To be
 exact,  we also could make qemu generate an event
 to libvirt on guest vif's mac address change, and then libvirt can
 validate the change and updating it's data accordingly.
 

Network configuration in the guest is not something vdsm does (at least
not today) but our