Re: [vdsm] Fwd: Bonding, bridges and ifcfg
- Original Message - From: Alon Bar-Lev alo...@redhat.com To: Antoni Segura Puimedon asegu...@redhat.com Cc: vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 2:07:38 PM Subject: Re: [vdsm] Fwd: Bonding, bridges and ifcfg Hi, Just to make sure... working in non-persistent mode will eliminate these kind of issues, right? Yes, I'm quite sure that working directly with the kernel through netlink or the ip tools would not exhibit the issues here mentioned. Alon - Original Message - From: Antoni Segura Puimedon asegu...@redhat.com To: vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 2:24:11 PM Subject: [vdsm] Fwd: Bonding, bridges and ifcfg Hello everybody, We found some unexpected behavior with bonds and we'd like to discuss it. Please, read the forwarded messages. Best, Toni - Forwarded Message - From: Dan Kenigsberg dan...@redhat.com To: Antoni Segura Puimedon asegu...@redhat.com Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky ilvov...@redhat.com Sent: Monday, December 10, 2012 1:03:48 PM Subject: Re: Bonding, ifcfg and luck On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon wrote: Hi all, I discussed this briefly with Livnat over the phone and mentioned it to Dan. The issue that we have is that, if I understand correctly our current configNetwork, it could very well be that it works by means of good design with a side-dish of luck. I'll explain myself: By design, as documented in http://www.kernel.org/doc/Documentation/networking/bonding.txt: All slaves of bond0 have the same MAC address (HWaddr) as bond0 for all modes except TLB and ALB that require a unique MAC address for each slave. Thus, all operations on the slave interfaces after they are added to the bond (except on TLB and ALB modes) that rely on ifcfg will fail with a message like: Device eth3 has different MAC address than expected, ignoring., and no ifup/ifdown will be performed. Currently, we were not noticing this, because we were ignoring completely errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/ shed light on the matter. As you can see in the following example (bonding mode 4) the behavior is just as documented: [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - [root@rhel64 ~]# echo +eth2 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# echo +eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:51:50:49 - [root@rhel64 ~]# echo -eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - Obviously, this means that, for example, when we add a bridge on top of a bond, the ifdown, ifup of the bond slaves will be completely fruitless (although luckily that doesn't prevent them from working). Sorry, thi is not obvious to me. When we change something in a nic, we first take it down (which break it away from the bond), change it, and then take it up again (and back to the bond). I did not understand which flow of configuration leads us to the unexpected mac error. I hope that we can circumvent it. To solve this issue on the ifcfg based operation we could either: - Continue ignoring these issues and either not do ifup ifdown for bonding slaves or catch the specific error and ignore it. That's reasonable, for a hack. - Modify the ifcfg files of the slaves after they are enslaved to reflect the MAC addr of /sys/class/net/bond0/address. Modify the ifcfg files after the bond is destroyed to reflect their own addresses as in /sys/class/net/ethx/address I do not undestand this solution at all... Fixing initscripts to expect the permanent mac address instead of the bond's one makes more sense to me. ( /proc/net/bonding/bond0 has Permanent HW addr: ) Livnat made me note that this behavior can be a problem to the anti mac-spoofing rules that we add to iptables, as they rely on the identity device -macaddr to work and, obviously, in most bonding modes that is broken unless the device's macaddr is the one chosen for the bond. Right. I suppose we can open a bug about it: in-guest bond does not
Re: [vdsm] Fwd: Bonding, bridges and ifcfg
On Mon, Dec 10, 2012 at 08:07:38AM -0500, Alon Bar-Lev wrote: Hi, Just to make sure... working in non-persistent mode will eliminate these kind of issues, right? No. It would eliminate the need to debug initscripts. But it would require vdsm developer of an intimate recognition of kernel quirks. We'd have fewer building blocks, and less of chance for incompatibility. But we would need to reimplement (some of) the logic within ifup script. Alon - Original Message - From: Antoni Segura Puimedon asegu...@redhat.com To: vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 2:24:11 PM Subject: [vdsm] Fwd: Bonding, bridges and ifcfg Hello everybody, We found some unexpected behavior with bonds and we'd like to discuss it. Please, read the forwarded messages. Best, Toni - Forwarded Message - From: Dan Kenigsberg dan...@redhat.com To: Antoni Segura Puimedon asegu...@redhat.com Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky ilvov...@redhat.com Sent: Monday, December 10, 2012 1:03:48 PM Subject: Re: Bonding, ifcfg and luck On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon wrote: Hi all, I discussed this briefly with Livnat over the phone and mentioned it to Dan. The issue that we have is that, if I understand correctly our current configNetwork, it could very well be that it works by means of good design with a side-dish of luck. I'll explain myself: By design, as documented in http://www.kernel.org/doc/Documentation/networking/bonding.txt: All slaves of bond0 have the same MAC address (HWaddr) as bond0 for all modes except TLB and ALB that require a unique MAC address for each slave. Thus, all operations on the slave interfaces after they are added to the bond (except on TLB and ALB modes) that rely on ifcfg will fail with a message like: Device eth3 has different MAC address than expected, ignoring., and no ifup/ifdown will be performed. Currently, we were not noticing this, because we were ignoring completely errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/ shed light on the matter. As you can see in the following example (bonding mode 4) the behavior is just as documented: [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - [root@rhel64 ~]# echo +eth2 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# echo +eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:51:50:49 - [root@rhel64 ~]# echo -eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - Obviously, this means that, for example, when we add a bridge on top of a bond, the ifdown, ifup of the bond slaves will be completely fruitless (although luckily that doesn't prevent them from working). Sorry, thi is not obvious to me. When we change something in a nic, we first take it down (which break it away from the bond), change it, and then take it up again (and back to the bond). I did not understand which flow of configuration leads us to the unexpected mac error. I hope that we can circumvent it. To solve this issue on the ifcfg based operation we could either: - Continue ignoring these issues and either not do ifup ifdown for bonding slaves or catch the specific error and ignore it. That's reasonable, for a hack. - Modify the ifcfg files of the slaves after they are enslaved to reflect the MAC addr of /sys/class/net/bond0/address. Modify the ifcfg files after the bond is destroyed to reflect their own addresses as in /sys/class/net/ethx/address I do not undestand this solution at all... Fixing initscripts to expect the permanent mac address instead of the bond's one makes more sense to me. ( /proc/net/bonding/bond0 has Permanent HW addr: ) Livnat made me note that this behavior can be a problem to the anti mac-spoofing rules that we add to iptables, as they rely on the identity device -macaddr to work and, obviously, in most bonding modes that is broken unless the device's macaddr is the one chosen for the bond. Right. I suppose we can open a bug about it: in-guest bond does not work with mac-no-spoofing. I have a vague memory of discussing this with lpeer few months
Re: [vdsm] Fwd: Bonding, bridges and ifcfg
- Original Message - From: Dan Kenigsberg dan...@redhat.com To: Alon Bar-Lev alo...@redhat.com Cc: Antoni Segura Puimedon asegu...@redhat.com, vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 3:16:21 PM Subject: Re: [vdsm] Fwd: Bonding, bridges and ifcfg On Mon, Dec 10, 2012 at 08:07:38AM -0500, Alon Bar-Lev wrote: Hi, Just to make sure... working in non-persistent mode will eliminate these kind of issues, right? No. It would eliminate the need to debug initscripts. But it would require vdsm developer of an intimate recognition of kernel quirks. We'd have fewer building blocks, and less of chance for incompatibility. But we would need to reimplement (some of) the logic within ifup script. Sure you need to reimplement ifup and ifdown functionality as you would not use these... You will not have fewer building blocks if you will break the fedora/redhat border, actually if you go non persistent you will have fewer of these and be more portable as you have one kernel (linux) to support. vdsm developer [should] already require intimate recognition of the kernel, see bellow one example. It is just that even if one has intimate recognition of the kernel, working via primitive tools like rhel/fedora network-script only make it harder to produce the desired outcome, while having full control over the process and the result. Alon - Original Message - From: Antoni Segura Puimedon asegu...@redhat.com To: vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 2:24:11 PM Subject: [vdsm] Fwd: Bonding, bridges and ifcfg Hello everybody, We found some unexpected behavior with bonds and we'd like to discuss it. Please, read the forwarded messages. Best, Toni - Forwarded Message - From: Dan Kenigsberg dan...@redhat.com To: Antoni Segura Puimedon asegu...@redhat.com Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky ilvov...@redhat.com Sent: Monday, December 10, 2012 1:03:48 PM Subject: Re: Bonding, ifcfg and luck On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon wrote: Hi all, I discussed this briefly with Livnat over the phone and mentioned it to Dan. The issue that we have is that, if I understand correctly our current configNetwork, it could very well be that it works by means of good design with a side-dish of luck. I'll explain myself: By design, as documented in http://www.kernel.org/doc/Documentation/networking/bonding.txt: All slaves of bond0 have the same MAC address (HWaddr) as bond0 for all modes except TLB and ALB that require a unique MAC address for each slave. Thus, all operations on the slave interfaces after they are added to the bond (except on TLB and ALB modes) that rely on ifcfg will fail with a message like: Device eth3 has different MAC address than expected, ignoring., and no ifup/ifdown will be performed. Currently, we were not noticing this, because we were ignoring completely errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/ shed light on the matter. As you can see in the following example (bonding mode 4) the behavior is just as documented: [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - [root@rhel64 ~]# echo +eth2 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# echo +eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:51:50:49 - [root@rhel64 ~]# echo -eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - Obviously, this means that, for example, when we add a bridge on top of a bond, the ifdown, ifup of the bond slaves will be completely fruitless (although luckily that doesn't prevent them from working). Sorry, thi is not obvious to me. When we change something in a nic, we first take it down (which break it away from the bond), change it, and then take it up again (and back to the bond). I did not understand which flow of configuration leads us to the unexpected mac error. I hope that we can circumvent it. To solve this issue on the ifcfg based operation we could either: - Continue ignoring these
Re: [vdsm] RFC: New Storage API
On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote: - Original Message - From: Shu Ming shum...@linux.vnet.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org, engine-devel engine-de...@ovirt.org Sent: Thursday, December 6, 2012 11:02:02 AM Subject: Re: [vdsm] RFC: New Storage API Saggi, Thanks for sharing your thought and I get some comments below. Saggi Mizrahi: I've been throwing a lot of bits out about the new storage API and I think it's time to talk a bit. I will purposefully try and keep implementation details away and concentrate about how the API looks and how you use it. First major change is in terminology, there is no long a storage domain but a storage repository. This change is done because so many things are already called domain in the system and this will make things less confusing for new-commers with a libvirt background. One other changes is that repositories no longer have a UUID. The UUID was only used in the pool members manifest and is no longer needed. connectStorageRepository(repoId, repoFormat, connectionParameters={}): repoId - is a transient name that will be used to refer to the connected domain, it is not persisted and doesn't have to be the same across the cluster. repoFormat - Similar to what used to be type (eg. localfs-1.0, nfs-3.4, clvm-1.2). connectionParameters - This is format specific and will used to tell VDSM how to connect to the repo. Where does repoID come from? I think repoID doesn't exist before connectStorageRepository() return. Isn't repoID a return value of connectStorageRepository()? No, repoIDs are no longer part of the domain, they are just a transient handle. The user can put whatever it wants there as long as it isn't already taken by another currently connected domain. disconnectStorageRepository(self, repoId) In the new API there are only images, some images are mutable and some are not. mutable images are also called VirtualDisks immutable images are also called Snapshots There are no explicit templates, you can create as many images as you want from any snapshot. There are 4 major image operations: createVirtualDisk(targetRepoId, size, baseSnapshotId=None, userData={}, options={}): targetRepoId - ID of a connected repo where the disk will be created size - The size of the image you wish to create baseSnapshotId - the ID of the snapshot you want the base the new virtual disk on userData - optional data that will be attached to the new VD, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new VD I think we will also need a function to check if a a VirtualDisk is based on a specific snapshot. Like: isSnapshotOf(virtualDiskId, baseSnapshotID): No, the design is that volume dependencies are an implementation detail. There is no reason for you to know that an image is physically a snapshot of another. Logical snapshots, template information, and any other information can be set by the user by using the userData field available for every image. Statements like this make me start to worry about your userData concept. It's a sign of a bad API if the user needs to invent a custom metadata scheme for itself. This reminds me of the abomination that is the 'custom' property in the vm definition today. createSnapshot(targetRepoId, baseVirtualDiskId, userData={}, options={}): targetRepoId - The ID of a connected repo where the new sanpshot will be created and the original image exists as well. size - The size of the image you wish to create baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want to snapshot userData - optional data that will be attached to the new Snapshot, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new Snapshot copyImage(targetRepoId, imageId, baseImageId=None, userData={}, options={}) targetRepoId - The ID of a connected repo where the new image will be created imageId - The image you wish to copy baseImageId - if specified, the new image will contain only the diff between image and Id. If None the new image will contain all the bits of image Id. This can be used to copy partial parts of images for export. userData - optional data that will be attached to the new image, could be anything that the user desires. options - options to modify VDSMs default behavior Does this function mean that we can copy the image from one repository to another repository? Does it cover the semantics of storage migration, storage backup, storage
Re: [vdsm] RFC: New Storage API
- Original Message - From: Adam Litke a...@us.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: Deepak C Shetty deepa...@linux.vnet.ibm.com, engine-devel engine-de...@ovirt.org, VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 1:49:31 PM Subject: Re: [vdsm] RFC: New Storage API On Fri, Dec 07, 2012 at 02:53:41PM -0500, Saggi Mizrahi wrote: snip 1) Can you provide more info on why there is a exception for 'lvm based block domain'. Its not coming out clearly. File based domains are responsible for syncing up object manipulation (creation\deletion) The backend is responsible for making sure it all works either by having a single writer (NFS) or having it's own locking mechanism (gluster). In our LVM based domains VDSM is responsible for basic object manipulation. The current design uses an approach where there is a single host responsible for object creation\deleteion it is the SRM\SDM\SPM\S?M. If we ever find a way to make it fully clustered without a big hit in performance the S?M requirement will be removed form that type of domain. I would like to see us maintain a LOCALFS domain as well. For this, we would also need SRM, correct? No, why? -- Adam Litke a...@us.ibm.com IBM Linux Technology Center ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] RFC: New Storage API
On Mon, Dec 10, 2012 at 02:03:09PM -0500, Saggi Mizrahi wrote: - Original Message - From: Adam Litke a...@us.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: Deepak C Shetty deepa...@linux.vnet.ibm.com, engine-devel engine-de...@ovirt.org, VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 1:49:31 PM Subject: Re: [vdsm] RFC: New Storage API On Fri, Dec 07, 2012 at 02:53:41PM -0500, Saggi Mizrahi wrote: snip 1) Can you provide more info on why there is a exception for 'lvm based block domain'. Its not coming out clearly. File based domains are responsible for syncing up object manipulation (creation\deletion) The backend is responsible for making sure it all works either by having a single writer (NFS) or having it's own locking mechanism (gluster). In our LVM based domains VDSM is responsible for basic object manipulation. The current design uses an approach where there is a single host responsible for object creation\deleteion it is the SRM\SDM\SPM\S?M. If we ever find a way to make it fully clustered without a big hit in performance the S?M requirement will be removed form that type of domain. I would like to see us maintain a LOCALFS domain as well. For this, we would also need SRM, correct? No, why? Sorry, nevermind. I was thinking of a scenario with multiple clients talking to a single vdsm and making sure they don't stomp on one another. This is probably not something we are going to care about though. -- Adam Litke a...@us.ibm.com IBM Linux Technology Center ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] RFC: New Storage API
On Mon, Dec 10, 2012 at 03:36:23PM -0500, Saggi Mizrahi wrote: Statements like this make me start to worry about your userData concept. It's a sign of a bad API if the user needs to invent a custom metadata scheme for itself. This reminds me of the abomination that is the 'custom' property in the vm definition today. In one sentence: If VDSM doesn't care about it, VDSM doesn't manage it. userData being a void* is quite common and I don't understand why you would thing it's a sign of a bad API. Further more, giving the user choice about how to represent it's own metadata and what fields it want to keep seems reasonable to me. Especially given the fact that VDSM never reads it. The reason we are pulling away from the current system of VDSM understanding the extra data is that it makes that data tied to VDSMs on disk format. VDSM on disk format has to be very stable because of clusters with multiple VDSM versions. Further more, since this is actually manager data it has to be tied to the manager backward compatibility lifetime as well. Having it be opaque to VDSM ties it to only one, simpler, support lifetime instead of two. I guess you are implying that it will make it problematic for multiple users to read userData left by another user because the formats might not be compatible. The solution is that all parties interested in using VDSM storage agree on format, and common fields, and supportability, and all the other things that choosing a supporting *something* entails. This is, however, out of the scope of VDSM. When the time comes I think how the userData blob is actually parsed and what fields it keeps should be discussed on ovirt-devel or engine-devel. The crux of the issue is that VDSM manages only what it cares about and the user can't modify directly. This is done because everything we expose we commit to. If you want any information persisted like: - Human readable name (in whatever encoding) - Is this a template or a snapshot - What user owns this image You can just put it in the userData. VDSM is not going to impose what encoding you use. It's not going to decide if you represent your users as IDs or names or ldap queries or Public Keys. It's not going to decide if you have explicit templates or not. It's not going to decide if you care what is the logical image chain. It's not going to decide anything that is out of it's scope. No format is future proof, no selection of fields will be good for any situation. I'd much rather it be someone else's problem when any of them need to be changed. They have currently been VDSMs problem and it has been hell to maintain. In general, I actually agree with most of this. What I want to avoid is pushing things that should actually be a part of the API into this userData blob. We do want to keep the API as simple as possible to give vdsm flexibility. If, over time, we find that users are always using userData to work around something missing in the API, this could be a really good sign that the API needs extension. -- Adam Litke a...@us.ibm.com IBM Linux Technology Center ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] RFC: New Storage API
- Original Message - From: Adam Litke a...@us.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: Shu Ming shum...@linux.vnet.ibm.com, engine-devel engine-de...@ovirt.org, VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 4:47:46 PM Subject: Re: [vdsm] RFC: New Storage API On Mon, Dec 10, 2012 at 03:36:23PM -0500, Saggi Mizrahi wrote: Statements like this make me start to worry about your userData concept. It's a sign of a bad API if the user needs to invent a custom metadata scheme for itself. This reminds me of the abomination that is the 'custom' property in the vm definition today. In one sentence: If VDSM doesn't care about it, VDSM doesn't manage it. userData being a void* is quite common and I don't understand why you would thing it's a sign of a bad API. Further more, giving the user choice about how to represent it's own metadata and what fields it want to keep seems reasonable to me. Especially given the fact that VDSM never reads it. The reason we are pulling away from the current system of VDSM understanding the extra data is that it makes that data tied to VDSMs on disk format. VDSM on disk format has to be very stable because of clusters with multiple VDSM versions. Further more, since this is actually manager data it has to be tied to the manager backward compatibility lifetime as well. Having it be opaque to VDSM ties it to only one, simpler, support lifetime instead of two. I guess you are implying that it will make it problematic for multiple users to read userData left by another user because the formats might not be compatible. The solution is that all parties interested in using VDSM storage agree on format, and common fields, and supportability, and all the other things that choosing a supporting *something* entails. This is, however, out of the scope of VDSM. When the time comes I think how the userData blob is actually parsed and what fields it keeps should be discussed on ovirt-devel or engine-devel. The crux of the issue is that VDSM manages only what it cares about and the user can't modify directly. This is done because everything we expose we commit to. If you want any information persisted like: - Human readable name (in whatever encoding) - Is this a template or a snapshot - What user owns this image You can just put it in the userData. VDSM is not going to impose what encoding you use. It's not going to decide if you represent your users as IDs or names or ldap queries or Public Keys. It's not going to decide if you have explicit templates or not. It's not going to decide if you care what is the logical image chain. It's not going to decide anything that is out of it's scope. No format is future proof, no selection of fields will be good for any situation. I'd much rather it be someone else's problem when any of them need to be changed. They have currently been VDSMs problem and it has been hell to maintain. In general, I actually agree with most of this. What I want to avoid is pushing things that should actually be a part of the API into this userData blob. We do want to keep the API as simple as possible to give vdsm flexibility. If, over time, we find that users are always using userData to work around something missing in the API, this could be a really good sign that the API needs extension. I was actually contemplating about this for quite a while. If while you create an image the reply is lost or, VDSM is unable to know if the operation was committed or not, the user will have no way of knowing what thew new image ID is. To solve this it is recommended that the manager puts some sort of task related information in the user data. If the operation ever finishes in an an ambiguous state the user just reads the userData from any images it doesn't know or is unsure about their state. This is a flow that every client will have to have. So why not just add that to the API? Because I don't want to impose how this information gets generated, what is the content of that data or how unique it has to be. Since VDSM doesn't use it for anything, I don't feel like I need to figure this out. I am all for simplicity, but simplicity is kind of an abstract concept. Having it be a blob is in some aspects the simplest thing you can do. Just saying that I have a field, put whatever in it is simple to convey but does requires more work on the user's side to figure out what to do with it. All that being said, I do think that the format, fields and how to use them should be defined so different users can communicate and synchronize. It's also important that you don't reinvent the wheel for every flow in every client. I'm just saying that it's not in the scope of VDSM. It should be done as a standard that all users of VDSM agree too conform to. It's the same way that a
Re: [vdsm] RFC: New Storage API
2012-12-11 4:36, Saggi Mizrahi: - Original Message - From: Adam Litke a...@us.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: Shu Ming shum...@linux.vnet.ibm.com, engine-devel engine-de...@ovirt.org, VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Monday, December 10, 2012 1:39:51 PM Subject: Re: [vdsm] RFC: New Storage API On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote: - Original Message - From: Shu Ming shum...@linux.vnet.ibm.com To: Saggi Mizrahi smizr...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org, engine-devel engine-de...@ovirt.org Sent: Thursday, December 6, 2012 11:02:02 AM Subject: Re: [vdsm] RFC: New Storage API Saggi, Thanks for sharing your thought and I get some comments below. Saggi Mizrahi: I've been throwing a lot of bits out about the new storage API and I think it's time to talk a bit. I will purposefully try and keep implementation details away and concentrate about how the API looks and how you use it. First major change is in terminology, there is no long a storage domain but a storage repository. This change is done because so many things are already called domain in the system and this will make things less confusing for new-commers with a libvirt background. One other changes is that repositories no longer have a UUID. The UUID was only used in the pool members manifest and is no longer needed. connectStorageRepository(repoId, repoFormat, connectionParameters={}): repoId - is a transient name that will be used to refer to the connected domain, it is not persisted and doesn't have to be the same across the cluster. repoFormat - Similar to what used to be type (eg. localfs-1.0, nfs-3.4, clvm-1.2). connectionParameters - This is format specific and will used to tell VDSM how to connect to the repo. Where does repoID come from? I think repoID doesn't exist before connectStorageRepository() return. Isn't repoID a return value of connectStorageRepository()? No, repoIDs are no longer part of the domain, they are just a transient handle. The user can put whatever it wants there as long as it isn't already taken by another currently connected domain. disconnectStorageRepository(self, repoId) In the new API there are only images, some images are mutable and some are not. mutable images are also called VirtualDisks immutable images are also called Snapshots There are no explicit templates, you can create as many images as you want from any snapshot. There are 4 major image operations: createVirtualDisk(targetRepoId, size, baseSnapshotId=None, userData={}, options={}): targetRepoId - ID of a connected repo where the disk will be created size - The size of the image you wish to create baseSnapshotId - the ID of the snapshot you want the base the new virtual disk on userData - optional data that will be attached to the new VD, could be anything that the user desires. options - options to modify VDSMs default behavior returns the id of the new VD I think we will also need a function to check if a a VirtualDisk is based on a specific snapshot. Like: isSnapshotOf(virtualDiskId, baseSnapshotID): No, the design is that volume dependencies are an implementation detail. There is no reason for you to know that an image is physically a snapshot of another. Logical snapshots, template information, and any other information can be set by the user by using the userData field available for every image. Statements like this make me start to worry about your userData concept. It's a sign of a bad API if the user needs to invent a custom metadata scheme for itself. This reminds me of the abomination that is the 'custom' property in the vm definition today. In one sentence: If VDSM doesn't care about it, VDSM doesn't manage it. userData being a void* is quite common and I don't understand why you would thing it's a sign of a bad API. Further more, giving the user choice about how to represent it's own metadata and what fields it want to keep seems reasonable to me. Especially given the fact that VDSM never reads it. The reason we are pulling away from the current system of VDSM understanding the extra data is that it makes that data tied to VDSMs on disk format. VDSM on disk format has to be very stable because of clusters with multiple VDSM versions. Further more, since this is actually manager data it has to be tied to the manager backward compatibility lifetime as well. Having it be opaque to VDSM ties it to only one, simpler, support lifetime instead of two. Making userData being opaque gives flexibilities to the management applications. To me, opaque userDaa can have two types at least. The first is the userData for runtime only. The second is the userData expected to be persisted into the metadata disk. For the first type, the management applications can store their own data structures like temporary task states, VDSM query caches etc. After the VDSM
Re: [vdsm] Fwd: Bonding, bridges and ifcfg
On 12/10/2012 08:24 PM, Antoni Segura Puimedon wrote: Hello everybody, We found some unexpected behavior with bonds and we'd like to discuss it. Please, read the forwarded messages. Best, Toni - Forwarded Message - From: Dan Kenigsberg dan...@redhat.com To: Antoni Segura Puimedon asegu...@redhat.com Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky ilvov...@redhat.com Sent: Monday, December 10, 2012 1:03:48 PM Subject: Re: Bonding, ifcfg and luck On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon wrote: Hi all, I discussed this briefly with Livnat over the phone and mentioned it to Dan. The issue that we have is that, if I understand correctly our current configNetwork, it could very well be that it works by means of good design with a side-dish of luck. I'll explain myself: By design, as documented in http://www.kernel.org/doc/Documentation/networking/bonding.txt: All slaves of bond0 have the same MAC address (HWaddr) as bond0 for all modes except TLB and ALB that require a unique MAC address for each slave. Thus, all operations on the slave interfaces after they are added to the bond (except on TLB and ALB modes) that rely on ifcfg will fail with a message like: Device eth3 has different MAC address than expected, ignoring., and no ifup/ifdown will be performed. Currently, we were not noticing this, because we were ignoring completely errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/ shed light on the matter. As you can see in the following example (bonding mode 4) the behavior is just as documented: [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - [root@rhel64 ~]# echo +eth2 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# echo +eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:51:50:49 - [root@rhel64 ~]# echo -eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - Obviously, this means that, for example, when we add a bridge on top of a bond, the ifdown, ifup of the bond slaves will be completely fruitless (although luckily that doesn't prevent them from working). Sorry, thi is not obvious to me. When we change something in a nic, we first take it down (which break it away from the bond), change it, and then take it up again (and back to the bond). I did not understand which flow of configuration leads us to the unexpected mac error. I hope that we can circumvent it. I get the same question. The warning message should be only seen when you run ifup on bonding or its slave, which is already up, otherwise the slave nic's mac address should hold its own permanent mac address. If the bonding is down before, you shouldn't see this message because the nic is not enslaved. To solve this issue on the ifcfg based operation we could either: - Continue ignoring these issues and either not do ifup ifdown for bonding slaves or catch the specific error and ignore it. That's reasonable, for a hack. - Modify the ifcfg files of the slaves after they are enslaved to reflect the MAC addr of /sys/class/net/bond0/address. Modify the ifcfg files after the bond is destroyed to reflect their own addresses as in /sys/class/net/ethx/address I do not undestand this solution at all... Fixing initscripts to expect the permanent mac address instead of the bond's one makes more sense to me. ( /proc/net/bonding/bond0 has Permanent HW addr: ) Livnat made me note that this behavior can be a problem to the anti mac-spoofing rules that we add to iptables, as they rely on the identity device -macaddr to work and, obviously, in most bonding modes that is broken unless the device's macaddr is the one chosen for the bond. Right. I suppose we can open a bug about it: in-guest bond does not work with mac-no-spoofing. I have a vague memory of discussing this with lpeer few months back, but it somehow slipped my mind. I am not sure why we need bonding inside guest. Bonding can provide link failover and bandwidth aggregation. We can achieve that by setting up bonding on host. If we really need it, we could workaround it by defining a netfilter for each vm, which allow the traffic from all the mac addresses belong to it. To be exact, we also could make qemu generate an event to libvirt on guest vif's mac address change, and then libvirt can validate the change and updating it's data accordingly. Well, I think that is all for this issue. We should discuss which is the best approach for this before we move on with patches that account for ifup ifdown return information. Best, Toni
Re: [vdsm] Fwd: Bonding, bridges and ifcfg
On 11/12/12 07:42, Mark Wu wrote: On 12/10/2012 08:24 PM, Antoni Segura Puimedon wrote: Hello everybody, We found some unexpected behavior with bonds and we'd like to discuss it. Please, read the forwarded messages. Best, Toni - Forwarded Message - From: Dan Kenigsberg dan...@redhat.com To: Antoni Segura Puimedon asegu...@redhat.com Cc: Livnat Peer lp...@redhat.com, Igor Lvovsky ilvov...@redhat.com Sent: Monday, December 10, 2012 1:03:48 PM Subject: Re: Bonding, ifcfg and luck On Mon, Dec 10, 2012 at 06:47:58AM -0500, Antoni Segura Puimedon wrote: Hi all, I discussed this briefly with Livnat over the phone and mentioned it to Dan. The issue that we have is that, if I understand correctly our current configNetwork, it could very well be that it works by means of good design with a side-dish of luck. I'll explain myself: By design, as documented in http://www.kernel.org/doc/Documentation/networking/bonding.txt: All slaves of bond0 have the same MAC address (HWaddr) as bond0 for all modes except TLB and ALB that require a unique MAC address for each slave. Thus, all operations on the slave interfaces after they are added to the bond (except on TLB and ALB modes) that rely on ifcfg will fail with a message like: Device eth3 has different MAC address than expected, ignoring., and no ifup/ifdown will be performed. Currently, we were not noticing this, because we were ignoring completely errors in ifdown and ifup, but http://gerrit.ovirt.org/#/c/8415/ shed light on the matter. As you can see in the following example (bonding mode 4) the behavior is just as documented: [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - [root@rhel64 ~]# echo +eth2 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# echo +eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:51:50:49 - [root@rhel64 ~]# echo -eth3 /sys/class/net/bond0/bonding/slaves [root@rhel64 ~]# cat /sys/class/net/eth*/address 52:54:00:a2:b4:50 52:54:00:3f:9b:28 52:54:00:51:50:49 52:54:00:ac:32:1b - Obviously, this means that, for example, when we add a bridge on top of a bond, the ifdown, ifup of the bond slaves will be completely fruitless (although luckily that doesn't prevent them from working). Sorry, thi is not obvious to me. When we change something in a nic, we first take it down (which break it away from the bond), change it, and then take it up again (and back to the bond). I did not understand which flow of configuration leads us to the unexpected mac error. I hope that we can circumvent it. I get the same question. The warning message should be only seen when you run ifup on bonding or its slave, which is already up, otherwise the slave nic's mac address should hold its own permanent mac address. If the bonding is down before, you shouldn't see this message because the nic is not enslaved. To solve this issue on the ifcfg based operation we could either: - Continue ignoring these issues and either not do ifup ifdown for bonding slaves or catch the specific error and ignore it. That's reasonable, for a hack. - Modify the ifcfg files of the slaves after they are enslaved to reflect the MAC addr of /sys/class/net/bond0/address. Modify the ifcfg files after the bond is destroyed to reflect their own addresses as in /sys/class/net/ethx/address I do not undestand this solution at all... Fixing initscripts to expect the permanent mac address instead of the bond's one makes more sense to me. ( /proc/net/bonding/bond0 has Permanent HW addr: ) Livnat made me note that this behavior can be a problem to the anti mac-spoofing rules that we add to iptables, as they rely on the identity device -macaddr to work and, obviously, in most bonding modes that is broken unless the device's macaddr is the one chosen for the bond. Right. I suppose we can open a bug about it: in-guest bond does not work with mac-no-spoofing. I have a vague memory of discussing this with lpeer few months back, but it somehow slipped my mind. I am not sure why we need bonding inside guest. Bonding can provide link failover and bandwidth aggregation. We can achieve that by setting up bonding on host. If we really need it, we could workaround it by defining a netfilter for each vm, which allow the traffic from all the mac addresses belong to it. To be exact, we also could make qemu generate an event to libvirt on guest vif's mac address change, and then libvirt can validate the change and updating it's data accordingly. Network configuration in the guest is not something vdsm does (at least not today) but our