Re: [RFC net-next] iavf: refactor plan proposal
On Tue, Mar 09, 2021 at 09:11:46PM -0800, Jesse Brandeburg wrote: > Leon Romanovsky wrote: > > > > 3) Plan is to make the "new" iavf driver the default iavf once > > >extensive regression testing can be completed. > > > a. Current proposal is to make CONFIG_IAVF have a sub-option > > > CONFIG_IAVF_V2 that lets the user adopt the new code, > > > without changing the config for existing users or breaking > > > them. > > > > I don't think that .config options are considered ABIs, so it is unclear > > what do you mean by saying "disrupting current users". Instead of the > > complication wrote above, do like any other driver does: perform your > > testing, submit the code and switch to the new code at the same time. > > Because this VF driver runs on multiple hardware PFs (they all expose > the same VF device ID) the testing matrix is quite huge and will take > us a while to get through it. We aim to avoid making users's life hard > by having CONFIG_IAVF=m become a surprise new code base behind the back > of the user. Don't you already test your patches against that testing DB? Like Jakub said, do incremental changes and it will be much saner for everyone. > > I've always thought that the .config options *are* a sort of ABI, > because when you do "make oldconfig" it tries to pick up your previous > configuration and if, for instance, a driver changes it's Kconfig name, > it will not pick up the old value of the old driver Kconfig name for > the new build, and with either default or ask the user. The way we're > proposing I think will allow the old driver to stay default until the > user answers Y to the "new option" for the new, iecm based code. I understand the rationale, but no - .config is not ABI at all. There are three types of "users" who are messing with configs: 1. Distro people 2. Kernel developers 3. "Experts" who wants/needs rebuild kernel All of them are expected to be proficient enough to handle changes in CONFIG_* land. In your proposal you are trying to solve non-existent problem of having users who are building their own kernel, but dumb enough do not understand what they are doing. We are removing/adding/renaming CONFIG_* all the time, this is no different. > > > > [1] > > > https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/ > > > > Please don't introduce module parameters in new code. > > Thanks, we certainly won't. :-) > I'm not sure why you commented about module parameters, but the above > link is to the previous submission for a new driver that uses some > common code as a module (iecm) for a new device driver (idpf) we had > sent. The point of this email was to solicit feedback and give notice > about doing a complicated refactor/replace where we end up re-using > iecm for the new version of the iavf code, with the intent to be up > front and working with the community throughout the process. Because of > the complexity, we want do the right thing the first time so we can to > avoid a restart/redesign. I commented simply because it jumped in front of my eyes when I looked on the patches in that link. It was general enough to write it here, rest of my comments are too specific and better to be posted as a reply to the patches itself. Thanks > > Thanks, > Jesse
Re: [RFC net-next] iavf: refactor plan proposal
Jakub Kicinski wrote: > On Mon, 8 Mar 2021 16:28:58 -0800 Jesse Brandeburg wrote: > > Hello, > > > > We plan to refactor the iavf module and would appreciate community and > > maintainer feedback on our plans. We want to do this to realize the > > usefulness of the common code module for multiple drivers. This > > proposal aims to avoid disrupting current users. > > > > The steps we plan are something like: > > 1) Continue upstreaming of the iecm module (common module) and > >the initial feature set for the idpf driver[1] utilizing iecm. > > Oh, that's still going? there wasn't any revision for such a long time > I deleted my notes :-o Argh! sorry about the delay. These proposed driver changes impacted progress on this patch series, we should have done a better job communicating what was going on. > > We are looking to make sure that the mode of our refactoring will meet > > the community's expectations. Any advice or feedback is appreciated. > > Sounds like a slow, drawn out process painful to everyone involved. > > The driver is upstream. My humble preference is that Intel sends small > logical changes we can review, and preserve a meaningful git history. We are attempting to make it as painless and quick as possible. With that said, I see your point and am driving some internal discussions to see what we can do differently. The primary reason for the plan proposed is the code reuse model we've chosen. With the change to the common module, the new iavf is significantly different and replacing the old avf base with the new would take many unnecessary intermediate steps that would be thrown away at the end. The end design will use the code from the common module with hooks to get device specific implementation where necessary. After putting in place the new-avf code we can update the iavf with new functionality which is already present in the common module. Thanks, Jesse
Re: [RFC net-next] iavf: refactor plan proposal
Leon Romanovsky wrote: > > 3) Plan is to make the "new" iavf driver the default iavf once > >extensive regression testing can be completed. > > a. Current proposal is to make CONFIG_IAVF have a sub-option > >CONFIG_IAVF_V2 that lets the user adopt the new code, > >without changing the config for existing users or breaking > >them. > > I don't think that .config options are considered ABIs, so it is unclear > what do you mean by saying "disrupting current users". Instead of the > complication wrote above, do like any other driver does: perform your > testing, submit the code and switch to the new code at the same time. Because this VF driver runs on multiple hardware PFs (they all expose the same VF device ID) the testing matrix is quite huge and will take us a while to get through it. We aim to avoid making users's life hard by having CONFIG_IAVF=m become a surprise new code base behind the back of the user. I've always thought that the .config options *are* a sort of ABI, because when you do "make oldconfig" it tries to pick up your previous configuration and if, for instance, a driver changes it's Kconfig name, it will not pick up the old value of the old driver Kconfig name for the new build, and with either default or ask the user. The way we're proposing I think will allow the old driver to stay default until the user answers Y to the "new option" for the new, iecm based code. > > [1] > > https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/ > > Please don't introduce module parameters in new code. Thanks, we certainly won't. :-) I'm not sure why you commented about module parameters, but the above link is to the previous submission for a new driver that uses some common code as a module (iecm) for a new device driver (idpf) we had sent. The point of this email was to solicit feedback and give notice about doing a complicated refactor/replace where we end up re-using iecm for the new version of the iavf code, with the intent to be up front and working with the community throughout the process. Because of the complexity, we want do the right thing the first time so we can to avoid a restart/redesign. Thanks, Jesse
Re: [RFC net-next] iavf: refactor plan proposal
On Mon, 8 Mar 2021 16:28:58 -0800 Jesse Brandeburg wrote: > Hello, > > We plan to refactor the iavf module and would appreciate community and > maintainer feedback on our plans. We want to do this to realize the > usefulness of the common code module for multiple drivers. This > proposal aims to avoid disrupting current users. > > The steps we plan are something like: > 1) Continue upstreaming of the iecm module (common module) and >the initial feature set for the idpf driver[1] utilizing iecm. Oh, that's still going? there wasn't any revision for such a long time I deleted my notes :-o > 2) Introduce the refactored iavf code as a "new" iavf driver with the >same device ID, but Kconfig default to =n to enable testing. > a. Make this exclusive so if someone opts in to "new" iavf, > then it disables the original iavf (?) > b. If we do make it exclusive in Kconfig can we use the same > name? > 3) Plan is to make the "new" iavf driver the default iavf once >extensive regression testing can be completed. > a. Current proposal is to make CONFIG_IAVF have a sub-option > CONFIG_IAVF_V2 that lets the user adopt the new code, > without changing the config for existing users or breaking > them. > > We are looking to make sure that the mode of our refactoring will meet > the community's expectations. Any advice or feedback is appreciated. Sounds like a slow, drawn out process painful to everyone involved. The driver is upstream. My humble preference is that Intel sends small logical changes we can review, and preserve a meaningful git history.
Re: [RFC net-next] iavf: refactor plan proposal
On Mon, Mar 08, 2021 at 04:28:58PM -0800, Jesse Brandeburg wrote: > Hello, > > We plan to refactor the iavf module and would appreciate community and > maintainer feedback on our plans. We want to do this to realize the > usefulness of the common code module for multiple drivers. This > proposal aims to avoid disrupting current users. > > The steps we plan are something like: > 1) Continue upstreaming of the iecm module (common module) and >the initial feature set for the idpf driver[1] utilizing iecm. > 2) Introduce the refactored iavf code as a "new" iavf driver with the >same device ID, but Kconfig default to =n to enable testing. > a. Make this exclusive so if someone opts in to "new" iavf, > then it disables the original iavf (?) > b. If we do make it exclusive in Kconfig can we use the same > name? > 3) Plan is to make the "new" iavf driver the default iavf once >extensive regression testing can be completed. > a. Current proposal is to make CONFIG_IAVF have a sub-option > CONFIG_IAVF_V2 that lets the user adopt the new code, > without changing the config for existing users or breaking > them. I don't think that .config options are considered ABIs, so it is unclear what do you mean by saying "disrupting current users". Instead of the complication wrote above, do like any other driver does: perform your testing, submit the code and switch to the new code at the same time. > > We are looking to make sure that the mode of our refactoring will meet > the community's expectations. Any advice or feedback is appreciated. > > Thanks, > Jesse, Alice, Alan > > [1] > https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/ Please don't introduce module parameters in new code. Thanks
[RFC net-next] iavf: refactor plan proposal
Hello, We plan to refactor the iavf module and would appreciate community and maintainer feedback on our plans. We want to do this to realize the usefulness of the common code module for multiple drivers. This proposal aims to avoid disrupting current users. The steps we plan are something like: 1) Continue upstreaming of the iecm module (common module) and the initial feature set for the idpf driver[1] utilizing iecm. 2) Introduce the refactored iavf code as a "new" iavf driver with the same device ID, but Kconfig default to =n to enable testing. a. Make this exclusive so if someone opts in to "new" iavf, then it disables the original iavf (?) b. If we do make it exclusive in Kconfig can we use the same name? 3) Plan is to make the "new" iavf driver the default iavf once extensive regression testing can be completed. a. Current proposal is to make CONFIG_IAVF have a sub-option CONFIG_IAVF_V2 that lets the user adopt the new code, without changing the config for existing users or breaking them. We are looking to make sure that the mode of our refactoring will meet the community's expectations. Any advice or feedback is appreciated. Thanks, Jesse, Alice, Alan [1] https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/
Proposal for a new protocol family - AF_MCTP
Hi all, I'm currently working on implementing support for the Management Controller Transport Protocol (MCTP). Briefly, MCTP is a protocol for intra-system communication between a management controller (typically a BMC), and the devices it manages. If you're after the full details, the DMTF have a specification (DSP0236) up at: https://www.dmtf.org/standards/pmci In short, this involves adding a new protocol / address family ("AF_MCTP"), the supporting types for a sockets API, and netlink protocol definitions. At the moment, I'm currently at the design & prototyping stage - so no patches to send just yet! However, if you're super keen, you can have a review of the design outline for the OpenBMC project, up at: https://github.com/jk-ozlabs/openbmc-docs/blob/mctp/designs/mctp/mctp-kernel.md If you'd like to send feedback on any aspects of that, I'm keen to hear them. You can either respond to me via email, or participate in the gerrit review of that document, which is at: https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/40514 Otherwise, if you prefer to review as code instead, I'll be sending patches to netdev once we've done a few passes of the design doc with the OpenBMC community. linux-can folks: the structure of MCTP is a little similar to CAN, and I've been referring to net/can/ a little for the mctp implementation, hence including the list here. If you have any particular hindsight you have from your work, I'd be keen to hear about it too. Cheers, Jeremy
Re: [PATCH net-next 1/2] net/smc: send ISM devices with unique chid in CLC proposal
On Sat, 03 Oct 2020 17:05:38 -0700 (PDT) David Miller wrote: > Series applied, but could you send a proper patch series in the future > with a "[PATCH 0/N] ..." header posting? It must explain what the > patch series does at a high level, how it is doing it, and why it is > doing it that way. > > Thank you. Hi Dave, not sure what went wrong but I sent the header posting along with the patches, see https://lists.openwall.net/netdev/2020/10/02/197 -- Karsten Graul
Re: [PATCH net-next 1/2] net/smc: send ISM devices with unique chid in CLC proposal
From: Karsten Graul Date: Fri, 2 Oct 2020 17:09:26 +0200 > When building a CLC proposal message then the list of ISM devices does > not need to contain multiple devices that have the same chid value, > all these devices use the same function at the end. > Improve smc_find_ism_v2_device_clnt() to collect only ISM devices that > have unique chid values. > > Signed-off-by: Karsten Graul Series applied, but could you send a proper patch series in the future with a "[PATCH 0/N] ..." header posting? It must explain what the patch series does at a high level, how it is doing it, and why it is doing it that way. Thank you.
[PATCH net-next 1/2] net/smc: send ISM devices with unique chid in CLC proposal
When building a CLC proposal message then the list of ISM devices does not need to contain multiple devices that have the same chid value, all these devices use the same function at the end. Improve smc_find_ism_v2_device_clnt() to collect only ISM devices that have unique chid values. Signed-off-by: Karsten Graul --- net/smc/af_smc.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index e874d0e6267f..670e802a73cb 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -599,6 +599,18 @@ static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini) return 0; } +/* is chid unique for the ism devices that are already determined? */ +static bool smc_find_ism_v2_is_unique_chid(u16 chid, struct smc_init_info *ini, + int cnt) +{ + int i = (!ini->ism_dev[0]) ? 1 : 0; + + for (; i < cnt; i++) + if (ini->ism_chid[i] == chid) + return false; + return true; +} + /* determine possible V2 ISM devices (either without PNETID or with PNETID plus * PNETID matching net_device) */ @@ -608,6 +620,7 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc, int rc = SMC_CLC_DECL_NOSMCDDEV; struct smcd_dev *smcd; int i = 1; + u16 chid; if (smcd_indicated(ini->smc_type_v1)) rc = 0; /* already initialized for V1 */ @@ -615,10 +628,13 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc, list_for_each_entry(smcd, &smcd_dev_list.list, list) { if (smcd->going_away || smcd == ini->ism_dev[0]) continue; + chid = smc_ism_get_chid(smcd); + if (!smc_find_ism_v2_is_unique_chid(chid, ini, i)) + continue; if (!smc_pnet_is_pnetid_set(smcd->pnetid) || smc_pnet_is_ndev_pnetid(sock_net(&smc->sk), smcd->pnetid)) { ini->ism_dev[i] = smcd; - ini->ism_chid[i] = smc_ism_get_chid(ini->ism_dev[i]); + ini->ism_chid[i] = chid; ini->is_smcd = true; rc = 0; i++; -- 2.17.1
[PATCH net-next 10/14] net/smc: build and send V2 CLC proposal
From: Ursula Braun The new format of an SMCD V2 CLC proposal is introduced, and building and checking of SMCD V2 CLC proposals is adapted accordingly. Signed-off-by: Ursula Braun Signed-off-by: Karsten Graul --- net/smc/af_smc.c | 2 +- net/smc/smc.h | 6 ++ net/smc/smc_clc.c | 171 -- net/smc/smc_clc.h | 73 ++-- 4 files changed, 210 insertions(+), 42 deletions(-) diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index 1d01a01c7fd5..10374673f75f 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -1301,7 +1301,7 @@ static void smc_find_ism_device_serv(struct smc_sock *new_smc, if (!smcd_indicated(pclc->hdr.typev1)) goto not_found; ini->is_smcd = true; /* prepare ISM check */ - ini->ism_peer_gid[0] = pclc_smcd->gid; + ini->ism_peer_gid[0] = ntohll(pclc_smcd->ism.gid); if (smc_find_ism_device(new_smc, ini)) goto not_found; if (!smc_listen_ism_init(new_smc, ini)) diff --git a/net/smc/smc.h b/net/smc/smc.h index 0b9c904e2282..a1e480a3ec43 100644 --- a/net/smc/smc.h +++ b/net/smc/smc.h @@ -20,6 +20,7 @@ #define SMC_V1 1 /* SMC version V1 */ #define SMC_V2 2 /* SMC version V2 */ +#define SMC_RELEASE0 #define SMCPROTO_SMC 0 /* SMC protocol, IPv4 */ #define SMCPROTO_SMC6 1 /* SMC protocol, IPv6 */ @@ -28,6 +29,8 @@ * devices */ +#define SMC_MAX_EID_LEN32 + extern struct proto smc_proto; extern struct proto smc_proto6; @@ -251,6 +254,9 @@ extern struct workqueue_struct *smc_close_wq; /* wq for close work */ extern u8 local_systemid[SMC_SYSTEMID_LEN]; /* unique system identifier */ +#define ntohll(x) be64_to_cpu(x) +#define htonll(x) cpu_to_be64(x) + /* convert an u32 value into network byte order, store it into a 3 byte field */ static inline void hton24(u8 *net, u32 host) { diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c index 26f1cdd35cb1..037c92a0c2b9 100644 --- a/net/smc/smc_clc.c +++ b/net/smc/smc_clc.c @@ -34,12 +34,52 @@ static const char SMC_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xd9'}; /* eye catcher "SMCD" EBCDIC for CLC messages */ static const char SMCD_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xc4'}; +/* check arriving CLC proposal */ +static bool smc_clc_msg_prop_valid(struct smc_clc_msg_proposal *pclc) +{ + struct smc_clc_msg_proposal_prefix *pclc_prfx; + struct smc_clc_smcd_v2_extension *smcd_v2_ext; + struct smc_clc_msg_hdr *hdr = &pclc->hdr; + struct smc_clc_v2_extension *v2_ext; + + v2_ext = smc_get_clc_v2_ext(pclc); + pclc_prfx = smc_clc_proposal_get_prefix(pclc); + if (hdr->version == SMC_V1) { + if (hdr->typev1 == SMC_TYPE_N) + return false; + if (ntohs(hdr->length) != + sizeof(*pclc) + ntohs(pclc->iparea_offset) + + sizeof(*pclc_prfx) + + pclc_prfx->ipv6_prefixes_cnt * + sizeof(struct smc_clc_ipv6_prefix) + + sizeof(struct smc_clc_msg_trail)) + return false; + } else { + if (ntohs(hdr->length) != + sizeof(*pclc) + + sizeof(struct smc_clc_msg_smcd) + + (hdr->typev1 != SMC_TYPE_N ? + sizeof(*pclc_prfx) + + pclc_prfx->ipv6_prefixes_cnt * + sizeof(struct smc_clc_ipv6_prefix) : 0) + + (hdr->typev2 != SMC_TYPE_N ? + sizeof(*v2_ext) + + v2_ext->hdr.eid_cnt * SMC_MAX_EID_LEN : 0) + + (smcd_indicated(hdr->typev2) ? + sizeof(*smcd_v2_ext) + v2_ext->hdr.ism_gid_cnt * + sizeof(struct smc_clc_smcd_gid_chid) : + 0) + + sizeof(struct smc_clc_msg_trail)) + return false; + } + return true; +} + /* check if received message has a correct header length and contains valid * heading and trailing eyecatchers */ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr *clcm, bool check_trl) { - struct smc_clc_msg_proposal_prefix *pclc_prfx; struct smc_clc_msg_accept_confirm *clc; struct smc_clc_msg_proposal *pclc; struct smc_clc_msg_decline *dclc; @@ -51,13 +91,7 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr *clcm, bool check_trl) switch (clcm->type) { case SMC_C
Re: Yet another ethernet PHY LED control proposal
Hi! > I have been thinking about another way to implement ABI for HW control > of ethernet PHY connected LEDs. > > This proposal is inspired by the fact that for some time there is a > movement in the kernel to do transparent HW offloading of things (DSA > is an example of that). And it is good proposal. > So currently we have the `netdev` trigger. When this is enabled for a > LED, new files will appear in that LED's sysfs directory: > - `device_name` where user is supposed to write interface name > - `link` if set to 1, the LED will be ON if the interface is linked > - `rx` if set to 1, the LED will blink on receive event > - `tx` if set to 1, the LED will blink on transmit event > - `interval` specifies duration of the LED blink > > Now what is interesting is that almost all combinations of link/rx/tx > settings are offloadable to a Marvell PHY! (Not to all LEDs, though...) > > So what if we abandoned the idea of a `hw` trigger, and instead just > allowed a LED trigger to be offloadable, if that specific LED supports > it? > > For the HW mode for different speed we can just expand the `link` sysfs > file ABI, so that if user writes a specific speed to this file, instead > of just "1", the LED will be on if the interface is linked on that > specific speed. Or maybe another sysfs file could be used for "light on > N mbps" setting... > > Afterwards we can figure out other possible modes. > > What do you think? If this can be implemented (and it probably can) it is the best solution :-). Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: PGP signature
Yet another ethernet PHY LED control proposal
Hello, I have been thinking about another way to implement ABI for HW control of ethernet PHY connected LEDs. This proposal is inspired by the fact that for some time there is a movement in the kernel to do transparent HW offloading of things (DSA is an example of that). So currently we have the `netdev` trigger. When this is enabled for a LED, new files will appear in that LED's sysfs directory: - `device_name` where user is supposed to write interface name - `link` if set to 1, the LED will be ON if the interface is linked - `rx` if set to 1, the LED will blink on receive event - `tx` if set to 1, the LED will blink on transmit event - `interval` specifies duration of the LED blink Now what is interesting is that almost all combinations of link/rx/tx settings are offloadable to a Marvell PHY! (Not to all LEDs, though...) So what if we abandoned the idea of a `hw` trigger, and instead just allowed a LED trigger to be offloadable, if that specific LED supports it? For the HW mode for different speed we can just expand the `link` sysfs file ABI, so that if user writes a specific speed to this file, instead of just "1", the LED will be on if the interface is linked on that specific speed. Or maybe another sysfs file could be used for "light on N mbps" setting... Afterwards we can figure out other possible modes. What do you think? Marek
[PATCH net-next 03/10] net/smc: dynamic allocation of CLC proposal buffer
From: Ursula Braun Reduce stack size for smc_listen_work() and smc_clc_send_proposal() by dynamic allocation of the CLC buffer to be received or sent. Signed-off-by: Ursula Braun Signed-off-by: Karsten Graul --- net/smc/af_smc.c | 13 +-- net/smc/smc_clc.c | 88 +++ net/smc/smc_clc.h | 15 3 files changed, 67 insertions(+), 49 deletions(-) diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index 8f6472f4ae21..00e2a4ce0131 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -1276,10 +1276,10 @@ static void smc_listen_work(struct work_struct *work) smc_listen_work); struct socket *newclcsock = new_smc->clcsock; struct smc_clc_msg_accept_confirm cclc; + struct smc_clc_msg_proposal_area *buf; struct smc_clc_msg_proposal *pclc; struct smc_init_info ini = {0}; bool ism_supported = false; - u8 buf[SMC_CLC_MAX_LEN]; int rc = 0; if (new_smc->listen_smc->sk.sk_state != SMC_LISTEN) @@ -1301,8 +1301,13 @@ static void smc_listen_work(struct work_struct *work) /* do inband token exchange - * wait for and receive SMC Proposal CLC message */ - pclc = (struct smc_clc_msg_proposal *)&buf; - rc = smc_clc_wait_msg(new_smc, pclc, SMC_CLC_MAX_LEN, + buf = kzalloc(sizeof(*buf), GFP_KERNEL); + if (!buf) { + rc = SMC_CLC_DECL_MEM; + goto out_decl; + } + pclc = (struct smc_clc_msg_proposal *)buf; + rc = smc_clc_wait_msg(new_smc, pclc, sizeof(*buf), SMC_CLC_PROPOSAL, CLC_WAIT_TIME); if (rc) goto out_decl; @@ -1382,6 +1387,7 @@ static void smc_listen_work(struct work_struct *work) } /* finish worker */ + kfree(buf); if (!ism_supported) { rc = smc_listen_rdma_finish(new_smc, &cclc, ini.first_contact_local); @@ -1397,6 +1403,7 @@ static void smc_listen_work(struct work_struct *work) mutex_unlock(&smc_server_lgr_pending); out_decl: smc_listen_decline(new_smc, rc, ini.first_contact_local); + kfree(buf); } static void smc_tcp_listen_work(struct work_struct *work) diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c index c30fad120089..0c8e74faf5ca 100644 --- a/net/smc/smc_clc.c +++ b/net/smc/smc_clc.c @@ -153,7 +153,6 @@ static int smc_clc_prfx_set(struct socket *clcsock, struct sockaddr_in *addr; int rc = -ENOENT; - memset(prop, 0, sizeof(*prop)); if (!dst) { rc = -ENOTCONN; goto out; @@ -412,76 +411,89 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info) int smc_clc_send_proposal(struct smc_sock *smc, int smc_type, struct smc_init_info *ini) { - struct smc_clc_ipv6_prefix ipv6_prfx[SMC_CLC_MAX_V6_PREFIX]; - struct smc_clc_msg_proposal_prefix pclc_prfx; - struct smc_clc_msg_smcd pclc_smcd; - struct smc_clc_msg_proposal pclc; - struct smc_clc_msg_trail trl; + struct smc_clc_msg_proposal_prefix *pclc_prfx; + struct smc_clc_msg_proposal *pclc_base; + struct smc_clc_msg_proposal_area *pclc; + struct smc_clc_ipv6_prefix *ipv6_prfx; + struct smc_clc_msg_smcd *pclc_smcd; + struct smc_clc_msg_trail *trl; int len, i, plen, rc; int reason_code = 0; struct kvec vec[5]; struct msghdr msg; + pclc = kzalloc(sizeof(*pclc), GFP_KERNEL); + if (!pclc) + return -ENOMEM; + + pclc_base = &pclc->pclc_base; + pclc_smcd = &pclc->pclc_smcd; + pclc_prfx = &pclc->pclc_prfx; + ipv6_prfx = pclc->pclc_prfx_ipv6; + trl = &pclc->pclc_trl; + /* retrieve ip prefixes for CLC proposal msg */ - rc = smc_clc_prfx_set(smc->clcsock, &pclc_prfx, ipv6_prfx); - if (rc) + rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx); + if (rc) { + kfree(pclc); return SMC_CLC_DECL_CNFERR; /* configuration error */ + } /* send SMC Proposal CLC message */ - plen = sizeof(pclc) + sizeof(pclc_prfx) + - (pclc_prfx.ipv6_prefixes_cnt * sizeof(ipv6_prfx[0])) + - sizeof(trl); - memset(&pclc, 0, sizeof(pclc)); - memcpy(pclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)); - pclc.hdr.type = SMC_CLC_PROPOSAL; - pclc.hdr.version = SMC_V1; /* SMC version */ - pclc.hdr.path = smc_type; + plen = sizeof(*pclc_base) + sizeof(*pclc_prfx) + + (pclc_prfx->ipv6_prefixes_cnt * sizeof(ipv6_prfx[0])) + + sizeof(*trl); + memcpy(pclc_base->hdr.eyecatcher, SMC_EYECATCHER, + sizeof(SMC_EYECATCHER)); + pclc_ba
Re: [RFC] bonding driver terminology change proposal
On Thu, Jul 16, 2020 at 1:43 AM Jarod Wilson wrote: > > On Wed, Jul 15, 2020 at 11:18 PM Andrew Lunn wrote: ... > > I really think that before we consider changes like this, somebody > > needs to work on git tooling, so that it knows when mass renames have > > happened, and can do the same sort of renames when cherry-picking > > across the flag day. Without that, people trying to maintain stable > > kernels are going to be very unhappy. > > I'm not familiar enough with git's internals to have a clue where to > begin for something like that, but I suspect you're right. Doing > blanket renames in stable branches sounds like a terrible idea, even > if it would circumvent the cherry-pick issues. I guess now is as good > a time as any to start poking around at git's internals... I haven't forgotten about this, just been tied up with other work. I spent a bit of time getting lost in git's internals, and the best idea I've had suggested to me is some sort of cherry-pick hook that executes an external script to massage variables back to old names for -stable backporting. Could live somewhere in-tree, and maintainers would have to know about it, but it would be reasonably painless. Ideally, I was thinking a semantic patch to filter the backported patch through, but haven't yet spent enough time playing with coccinelle to know if that's actually a viable idea, since it's designed to run on C code, not a patch, as I understand it. Worst-case, it'd be a shell script doing some awk/sed/whatever. -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Thu, 16 Jul 2020 11:59:47 -0700 (PDT) David Miller wrote: > From: Jarod Wilson > Date: Wed, 15 Jul 2020 23:06:55 -0400 > > > On Mon, Jul 13, 2020 at 9:00 PM David Miller wrote: > >> > >> From: Michal Kubecek > >> Date: Tue, 14 Jul 2020 00:00:16 +0200 > >> > >> > Could we, please, avoid breaking existing userspace tools and scripts? > >> > >> I will not let UAPI breakage, don't worry. > > > > Seeking some clarification here. Does the output of > > /proc/net/bonding/ fall under that umbrella as well? > > Yes, anything user facing must not break. > For iproute2, would like better wording on the command parameters (but accept the old names so as not to break scripts). The old names can be highlighted as for compatibility only or removed from the usage manual and usage. Internally, variable names and function names can change iproute2 since the internal API's are not considered part of user API.
Re: [RFC] bonding driver terminology change proposal
From: Jarod Wilson Date: Wed, 15 Jul 2020 23:06:55 -0400 > On Mon, Jul 13, 2020 at 9:00 PM David Miller wrote: >> >> From: Michal Kubecek >> Date: Tue, 14 Jul 2020 00:00:16 +0200 >> >> > Could we, please, avoid breaking existing userspace tools and scripts? >> >> I will not let UAPI breakage, don't worry. > > Seeking some clarification here. Does the output of > /proc/net/bonding/ fall under that umbrella as well? Yes, anything user facing must not break.
Re: [RFC] bonding driver terminology change proposal
On Wed, Jul 15, 2020 at 11:18 PM Andrew Lunn wrote: > > On Wed, Jul 15, 2020 at 11:04:16PM -0400, Jarod Wilson wrote: > > On Mon, Jul 13, 2020 at 8:26 PM Andrew Lunn wrote: > > > > > > Hi Jarod > > > > > > Do you have this change scripted? Could you apply the script to v5.4 > > > and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How > > > many result in conflicts? > > > > > > Could you do the same with v4.19...v4.19.132, which has 20 fixes. > > > > > > This will give us an idea of the maintenance overhead such a change is > > > going to cause, and how good git is at figuring out this sort of > > > thing. > > > > Okay, I have some fugly bash scripts that use sed to do the majority > > of the work here, save some manual bits done to add duplicate > > interfaces w/new names and some aliases, and everything is compiling > > and functions in a basic smoke test here. > > > > Summary on the 5.4 git cherry-pick conflict resolution after applying > > changes: not that good. 7 of the 8 bonding fixes in the 5.4 stable > > branch required fixing when straight cherry-picking. Dumping the > > patches, running a sed script over them, and then git am'ing them > > works pretty well though. > > Hi Jarad > > That is what i was expecting. > > I really think that before we consider changes like this, somebody > needs to work on git tooling, so that it knows when mass renames have > happened, and can do the same sort of renames when cherry-picking > across the flag day. Without that, people trying to maintain stable > kernels are going to be very unhappy. I'm not familiar enough with git's internals to have a clue where to begin for something like that, but I suspect you're right. Doing blanket renames in stable branches sounds like a terrible idea, even if it would circumvent the cherry-pick issues. I guess now is as good a time as any to start poking around at git's internals... -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Wed, Jul 15, 2020 at 11:04:16PM -0400, Jarod Wilson wrote: > On Mon, Jul 13, 2020 at 8:26 PM Andrew Lunn wrote: > > > > Hi Jarod > > > > Do you have this change scripted? Could you apply the script to v5.4 > > and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How > > many result in conflicts? > > > > Could you do the same with v4.19...v4.19.132, which has 20 fixes. > > > > This will give us an idea of the maintenance overhead such a change is > > going to cause, and how good git is at figuring out this sort of > > thing. > > Okay, I have some fugly bash scripts that use sed to do the majority > of the work here, save some manual bits done to add duplicate > interfaces w/new names and some aliases, and everything is compiling > and functions in a basic smoke test here. > > Summary on the 5.4 git cherry-pick conflict resolution after applying > changes: not that good. 7 of the 8 bonding fixes in the 5.4 stable > branch required fixing when straight cherry-picking. Dumping the > patches, running a sed script over them, and then git am'ing them > works pretty well though. Hi Jarad That is what i was expecting. I really think that before we consider changes like this, somebody needs to work on git tooling, so that it knows when mass renames have happened, and can do the same sort of renames when cherry-picking across the flag day. Without that, people trying to maintain stable kernels are going to be very unhappy. Andrew
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 9:00 PM David Miller wrote: > > From: Michal Kubecek > Date: Tue, 14 Jul 2020 00:00:16 +0200 > > > Could we, please, avoid breaking existing userspace tools and scripts? > > I will not let UAPI breakage, don't worry. Seeking some clarification here. Does the output of /proc/net/bonding/ fall under that umbrella as well? I'm sure there are people that do parse it for monitoring, and thus I assume that it does, but want to be certain. I think this is the only remaining thing I need to address in a local test conversion build. -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 8:26 PM Andrew Lunn wrote: > > Hi Jarod > > Do you have this change scripted? Could you apply the script to v5.4 > and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How > many result in conflicts? > > Could you do the same with v4.19...v4.19.132, which has 20 fixes. > > This will give us an idea of the maintenance overhead such a change is > going to cause, and how good git is at figuring out this sort of > thing. Okay, I have some fugly bash scripts that use sed to do the majority of the work here, save some manual bits done to add duplicate interfaces w/new names and some aliases, and everything is compiling and functions in a basic smoke test here. Summary on the 5.4 git cherry-pick conflict resolution after applying changes: not that good. 7 of the 8 bonding fixes in the 5.4 stable branch required fixing when straight cherry-picking. Dumping the patches, running a sed script over them, and then git am'ing them works pretty well though. I didn't try 4.19 (yet?), I assume it'll just be more of the same. -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Wed, Jul 15, 2020 at 8:57 AM Edward Cree wrote: > > Once again, the opinions below are my own and definitely do not > represent anything my employer would be seen dead in the same > room as. > > On 13/07/2020 23:41, Stephen Hemminger wrote: > > As far as userspace, maybe keep the old API's but provide deprecation nags. > Why would you need to deprecate the old APIs? > If the user echoes 'slave' into some sysfs file (or whatever), that > indicates that they don't have any problem with using the word. > So there's no reason toever remove that support — its _mere > existence_ isn't problematic for anyone not actively seeking to be > offended. > Which I think is more evidence that this change is not motivated by > practical concerns but by a kind of performative ritual purity. > > This is dumb. I suspect you all, including Jarod, know that this > is dumb, but you're either going along with it or keeping your > head down in the hope that it will all blow over and you can go > back to normal. Unfortunately, it doesn't work like that; the > activists who push this stuff are never satisfied; making > concessions to them results not in peace but in further demands; > and just as the corporations today are caving to the current > demands for fear of being singled out by the mob, so they will > cave again to the next round of demands, and you'll be back in > the same position, trying to deal with bosses wanting you to > break uAPI without even a technical reason. > And next time around, the mob will be bolder and the bosses more > pliant, because by giving in this time we'll have signalled that > we're weak and easily dominated. I would advise anyone still in > doubt of this point to read Kipling's poem "Dane-geld". > And we'll all be left wondering why kernel development is so > soulless and joyless that no-one, of _any_ colour, aspires to > become a kernel hacker any more. > > It's not too late to stop the crazy, if we all just stop > pretending it's sane. No, it isn't a practical code concern motivating this change, it's actually quite impractical from a code standpoint and has no technical merit. I understand your position, but having seen many emotional responses to issues surrounding this, I think it's a worthwhile effort that many people actually do appreciate. Even if I'm not personally offended by the terminology, as a white male, I don't think I possess the life experiences to downplay the negative impact ongoing use of terms like "slave" might have on people that are actual descendants of slavery. Embracing and helping move forward social change seems like a responsible thing to do here, as long as we can do it without breaking the kernel and UAPI. -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
Once again, the opinions below are my own and definitely do not represent anything my employer would be seen dead in the same room as. On 13/07/2020 23:41, Stephen Hemminger wrote: > As far as userspace, maybe keep the old API's but provide deprecation nags. Why would you need to deprecate the old APIs? If the user echoes 'slave' into some sysfs file (or whatever), that indicates that they don't have any problem with using the word. So there's no reason toever remove that support — its _mere existence_ isn't problematic for anyone not actively seeking to be offended. Which I think is more evidence that this change is not motivated by practical concerns but by a kind of performative ritual purity. This is dumb. I suspect you all, including Jarod, know that this is dumb, but you're either going along with it or keeping your head down in the hope that it will all blow over and you can go back to normal. Unfortunately, it doesn't work like that; the activists who push this stuff are never satisfied; making concessions to them results not in peace but in further demands; and just as the corporations today are caving to the current demands for fear of being singled out by the mob, so they will cave again to the next round of demands, and you'll be back in the same position, trying to deal with bosses wanting you to break uAPI without even a technical reason. And next time around, the mob will be bolder and the bosses more pliant, because by giving in this time we'll have signalled that we're weak and easily dominated. I would advise anyone still in doubt of this point to read Kipling's poem "Dane-geld". And we'll all be left wondering why kernel development is so soulless and joyless that no-one, of _any_ colour, aspires to become a kernel hacker any more. It's not too late to stop the crazy, if we all just stop pretending it's sane. -ed
Re: [RFC] bonding driver terminology change proposal
On Tue, Jul 14, 2020 at 4:39 PM Marcelo Ricardo Leitner wrote: > > On Tue, Jul 14, 2020 at 09:17:48PM +0200, Toke Høiland-Jørgensen wrote: > > Jarod Wilson writes: > > > > > As part of an effort to help enact social change, Red Hat is > > > committing to efforts to eliminate any problematic terminology from > > > any of the software that it ships and supports. Front and center for > > > me personally in that effort is the bonding driver's use of the terms > > > master and slave, and to a lesser extent, bond and bonding, due to > > > bondage being another term for slavery. Most people in computer > > > science understand these terms aren't intended to be offensive or > > > oppressive, and have well understood meanings in computing, but > > > nonetheless, they still present an open wound, and a barrier for > > > participation and inclusion to some. > > > > > > To start out with, I'd like to attempt to eliminate as much of the use > > > of master and slave in the bonding driver as possible. For the most > > > part, I think this can be done without breaking UAPI, but may require > > > changes to anything accessing bond info via proc or sysfs. > > > > > > My initial thought was to rename master to aggregator and slaves to > > > ports, but... that gets really messy with the existing 802.3ad bonding > > > code using both extensively already. I've given thought to a number of > > > other possible combinations, but the one that I'm liking the most is > > > master -> bundle and slave -> cable, for a number of reasons. I'd > > > considered cable and wire, as a cable is a grouping of individual > > > wires, but we're grouping together cables, really -- each bonded > > > ethernet interface has a cable connected, so a bundle of cables makes > > > sense visually and figuratively. Additionally, it's a swap made easier > > > in the codebase by master and bundle and slave and cable having the > > > same number of characters, respectively. Granted though, "bundle" > > > doesn't suggest "runs the show" the way "master" or something like > > > maybe "director" or "parent" does, but those lack the visual aspect > > > present with a bundle of cables. Using parent/child could work too > > > though, it's perhaps closer to the master/slave terminology currently > > > in use as far as literal meaning. > > > > I've always thought of it as a "bond device" which has other netdevs as > > "components" (as in 'things that are part of'). So maybe > > "main"/"component" or something to that effect? > > Same here, and it's pretty much like how I see the bridge as well. > "bridge device" and "legs". I did toy with the idea of "torso" or "thorax" for the bond aggregate device and "legs" for the bond components, but at this point, I guess it's mostly bikeshedding, the bigger issue is "how messy would it be?". I've scripted most of the changes, but not all of them. Still working on it... :) -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Tue, Jul 14, 2020 at 09:17:48PM +0200, Toke Høiland-Jørgensen wrote: > Jarod Wilson writes: > > > As part of an effort to help enact social change, Red Hat is > > committing to efforts to eliminate any problematic terminology from > > any of the software that it ships and supports. Front and center for > > me personally in that effort is the bonding driver's use of the terms > > master and slave, and to a lesser extent, bond and bonding, due to > > bondage being another term for slavery. Most people in computer > > science understand these terms aren't intended to be offensive or > > oppressive, and have well understood meanings in computing, but > > nonetheless, they still present an open wound, and a barrier for > > participation and inclusion to some. > > > > To start out with, I'd like to attempt to eliminate as much of the use > > of master and slave in the bonding driver as possible. For the most > > part, I think this can be done without breaking UAPI, but may require > > changes to anything accessing bond info via proc or sysfs. > > > > My initial thought was to rename master to aggregator and slaves to > > ports, but... that gets really messy with the existing 802.3ad bonding > > code using both extensively already. I've given thought to a number of > > other possible combinations, but the one that I'm liking the most is > > master -> bundle and slave -> cable, for a number of reasons. I'd > > considered cable and wire, as a cable is a grouping of individual > > wires, but we're grouping together cables, really -- each bonded > > ethernet interface has a cable connected, so a bundle of cables makes > > sense visually and figuratively. Additionally, it's a swap made easier > > in the codebase by master and bundle and slave and cable having the > > same number of characters, respectively. Granted though, "bundle" > > doesn't suggest "runs the show" the way "master" or something like > > maybe "director" or "parent" does, but those lack the visual aspect > > present with a bundle of cables. Using parent/child could work too > > though, it's perhaps closer to the master/slave terminology currently > > in use as far as literal meaning. > > I've always thought of it as a "bond device" which has other netdevs as > "components" (as in 'things that are part of'). So maybe > "main"/"component" or something to that effect? Same here, and it's pretty much like how I see the bridge as well. "bridge device" and "legs". Marcelo
Re: [RFC] bonding driver terminology change proposal
Jarod Wilson writes: > As part of an effort to help enact social change, Red Hat is > committing to efforts to eliminate any problematic terminology from > any of the software that it ships and supports. Front and center for > me personally in that effort is the bonding driver's use of the terms > master and slave, and to a lesser extent, bond and bonding, due to > bondage being another term for slavery. Most people in computer > science understand these terms aren't intended to be offensive or > oppressive, and have well understood meanings in computing, but > nonetheless, they still present an open wound, and a barrier for > participation and inclusion to some. > > To start out with, I'd like to attempt to eliminate as much of the use > of master and slave in the bonding driver as possible. For the most > part, I think this can be done without breaking UAPI, but may require > changes to anything accessing bond info via proc or sysfs. > > My initial thought was to rename master to aggregator and slaves to > ports, but... that gets really messy with the existing 802.3ad bonding > code using both extensively already. I've given thought to a number of > other possible combinations, but the one that I'm liking the most is > master -> bundle and slave -> cable, for a number of reasons. I'd > considered cable and wire, as a cable is a grouping of individual > wires, but we're grouping together cables, really -- each bonded > ethernet interface has a cable connected, so a bundle of cables makes > sense visually and figuratively. Additionally, it's a swap made easier > in the codebase by master and bundle and slave and cable having the > same number of characters, respectively. Granted though, "bundle" > doesn't suggest "runs the show" the way "master" or something like > maybe "director" or "parent" does, but those lack the visual aspect > present with a bundle of cables. Using parent/child could work too > though, it's perhaps closer to the master/slave terminology currently > in use as far as literal meaning. I've always thought of it as a "bond device" which has other netdevs as "components" (as in 'things that are part of'). So maybe "main"/"component" or something to that effect? -Toke
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 8:55 PM Jay Vosburgh wrote: > > Stephen Hemminger wrote: > > >On Tue, 14 Jul 2020 00:00:16 +0200 > >Michal Kubecek wrote: > > > >> On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote: > >> > To start out with, I'd like to attempt to eliminate as much of the use > >> > of master and slave in the bonding driver as possible. For the most > >> > part, I think this can be done without breaking UAPI, but may require > >> > changes to anything accessing bond info via proc or sysfs. > >> > >> Could we, please, avoid breaking existing userspace tools and scripts? > >> Massive code churn is one thing and we could certainly bite the bullet > >> and live with it (even if I'm still not convinced it would be as great > >> idea as some present it) but trading theoretical offense for real and > >> palpable harm to existing users is something completely different. > >> > >> Or is "don't break userspace" no longer the "first commandment" of linux > >> kernel development? > >> > >> Michal Kubecek > > > >Please consider using same wording as current standard for link aggregration. > >Current version is 802.1AX and it uses the terms: > > Multiplexer / Aggregator > > Well, 802.1AX only defines LACP, and the bonding driver does > more than just LACP. Also, Multiplexer, in 802.1AX, is a function of > various components, e.g., each Aggregator has a Multiplexer, as do other > components. > > As "channel bonding" is a long-established term of art, I don't > see an issue with something like "bond" and "port," which parallels the > bridge / port terminology. I did look at aggregator and port as options, but the overlap with the bonding 802.3ad code would mean first reworking a bunch of that code to free up those terms for more general bonding use. I think "bonding" should be okay to keep around as well, and am kind of on the fence with "master", since master of ceremonies, masters degress, master keys, etc are all similar enough to what a master device in a bond represents, and the main objectionable language is primarily "slave". One option would be to rename "port" to "laggport" or "adport" or something like that in the 802.3ad code, and then make use of "port" in place of slave (which mirrors what's done in the team driver). > [...] > >As far as userspace, maybe keep the old API's but provide deprecation nags. > >And don't document the old API values. > > Unless the community stance on not breaking user space has > changed, the extant APIs must be maintained. In the context of bonding, > this would include "ip link" command line arguments, sysfs and procsfs > interfaces, as well as netlink attribute names. There are also exported > kernel APIs that bonding utilizes, netdev_master_upper_dev_link, et al. To some people, this could be a case that warranted breaking UAPIs. In an ideal world, that would be nice, but obviously, breaking the world to get there isn't good either, so I think maintaining them all is hopefully still understandable. > Additionally, just to be absolutely clear, is the proposal here > intending to undertake a rather significant search and replace of the > text strings "master" and "slave" within the bonding driver source? > This in addition to whatever API changes end up being done. If so, then > I would also like to know the answer to Andrew's question regarding > patch conflicts in order to gauge the future maintenance cost. Correct, this would be full search-and-replace, with minor tweaks here and there -- bond_enslave -> bond_connect or something like that, since bond_encable wouldn't make sense, and replacing references to ifenslave in the code isn't helpful, since ifenslave is still going to be called ifenslave. As of yet, no, I don't have this scripted, but I can certainly give that a go. I'm not terribly familiar with coccinelle, and if that would be the way to script it, or if a simple bash/perl/whatever script would suffice. -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 6:00 PM Michal Kubecek wrote: > > On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote: > > To start out with, I'd like to attempt to eliminate as much of the use > > of master and slave in the bonding driver as possible. For the most > > part, I think this can be done without breaking UAPI, but may require > > changes to anything accessing bond info via proc or sysfs. > > Could we, please, avoid breaking existing userspace tools and scripts? > Massive code churn is one thing and we could certainly bite the bullet > and live with it (even if I'm still not convinced it would be as great > idea as some present it) but trading theoretical offense for real and > palpable harm to existing users is something completely different. > > Or is "don't break userspace" no longer the "first commandment" of linux > kernel development? Definitely looking to minimize breakage here, and it sounds like it'll be to the point of "none", or this won't fly. I think this may require having "legacy" aliases for certain interfaces and the like, to both provide a less problematic interface name as the new default, but prevent breaking any existing setups. -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 5:36 PM Eric Dumazet wrote: > > On 7/13/20 11:51 AM, Jarod Wilson wrote: > > As part of an effort to help enact social change, Red Hat is > > committing to efforts to eliminate any problematic terminology from > > any of the software that it ships and supports. Front and center for > > me personally in that effort is the bonding driver's use of the terms > > master and slave, and to a lesser extent, bond and bonding, due to > > bondage being another term for slavery. Most people in computer > > science understand these terms aren't intended to be offensive or > > oppressive, and have well understood meanings in computing, but > > nonetheless, they still present an open wound, and a barrier for > > participation and inclusion to some. > > > > To start out with, I'd like to attempt to eliminate as much of the use > > of master and slave in the bonding driver as possible. For the most > > part, I think this can be done without breaking UAPI, but may require > > changes to anything accessing bond info via proc or sysfs. > > > > My initial thought was to rename master to aggregator and slaves to > > ports, but... that gets really messy with the existing 802.3ad bonding > > code using both extensively already. I've given thought to a number of > > other possible combinations, but the one that I'm liking the most is > > master -> bundle and slave -> cable, for a number of reasons. I'd > > considered cable and wire, as a cable is a grouping of individual > > wires, but we're grouping together cables, really -- each bonded > > ethernet interface has a cable connected, so a bundle of cables makes > > sense visually and figuratively. Additionally, it's a swap made easier > > in the codebase by master and bundle and slave and cable having the > > same number of characters, respectively. Granted though, "bundle" > > doesn't suggest "runs the show" the way "master" or something like > > maybe "director" or "parent" does, but those lack the visual aspect > > present with a bundle of cables. Using parent/child could work too > > though, it's perhaps closer to the master/slave terminology currently > > in use as far as literal meaning. > > > > So... Thoughts? > > > > So you considered : aggregator/ports, bundle/cable. > > I thought about cord/strand, since this is less likely to be used already in > networking land > (like worker, thread, fiber, or wire ...) > > Although a cord with two strands is probably not very common :/ I'd also thought about cable and wire, since there are multiple physical wires inside an ethernet cable, but you typically connect one cable per port, so a bundle of cables seemed to make more sense. :) I also had a few other ideas I played with, including a bundle of pipes and a pipework of pipes (which is apparently a thing, but not very common either, outside of maybe plumbers?). -- Jarod Wilson ja...@redhat.com
Re: [RFC] bonding driver terminology change proposal
From: Michal Kubecek Date: Tue, 14 Jul 2020 00:00:16 +0200 > Could we, please, avoid breaking existing userspace tools and scripts? I will not let UAPI breakage, don't worry.
Re: [RFC] bonding driver terminology change proposal
Stephen Hemminger wrote: >On Tue, 14 Jul 2020 00:00:16 +0200 >Michal Kubecek wrote: > >> On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote: >> > To start out with, I'd like to attempt to eliminate as much of the use >> > of master and slave in the bonding driver as possible. For the most >> > part, I think this can be done without breaking UAPI, but may require >> > changes to anything accessing bond info via proc or sysfs. >> >> Could we, please, avoid breaking existing userspace tools and scripts? >> Massive code churn is one thing and we could certainly bite the bullet >> and live with it (even if I'm still not convinced it would be as great >> idea as some present it) but trading theoretical offense for real and >> palpable harm to existing users is something completely different. >> >> Or is "don't break userspace" no longer the "first commandment" of linux >> kernel development? >> >> Michal Kubecek > >Please consider using same wording as current standard for link aggregration. >Current version is 802.1AX and it uses the terms: > Multiplexer / Aggregator Well, 802.1AX only defines LACP, and the bonding driver does more than just LACP. Also, Multiplexer, in 802.1AX, is a function of various components, e.g., each Aggregator has a Multiplexer, as do other components. As "channel bonding" is a long-established term of art, I don't see an issue with something like "bond" and "port," which parallels the bridge / port terminology. [...] >As far as userspace, maybe keep the old API's but provide deprecation nags. >And don't document the old API values. Unless the community stance on not breaking user space has changed, the extant APIs must be maintained. In the context of bonding, this would include "ip link" command line arguments, sysfs and procsfs interfaces, as well as netlink attribute names. There are also exported kernel APIs that bonding utilizes, netdev_master_upper_dev_link, et al. Additionally, just to be absolutely clear, is the proposal here intending to undertake a rather significant search and replace of the text strings "master" and "slave" within the bonding driver source? This in addition to whatever API changes end up being done. If so, then I would also like to know the answer to Andrew's question regarding patch conflicts in order to gauge the future maintenance cost. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC] bonding driver terminology change proposal
From: Jarod Wilson Date: Mon, 13 Jul 2020 14:51:39 -0400 > To start out with, I'd like to attempt to eliminate as much of the use > of master and slave in the bonding driver as possible. For the most > part, I think this can be done without breaking UAPI, but may require > changes to anything accessing bond info via proc or sysfs. You can change what you want internally to the driver in order to meet this objective, but I am positively sure that external facing UAPI has to be retained.
Re: [RFC] bonding driver terminology change proposal
Hi Jarod Do you have this change scripted? Could you apply the script to v5.4 and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How many result in conflicts? Could you do the same with v4.19...v4.19.132, which has 20 fixes. This will give us an idea of the maintenance overhead such a change is going to cause, and how good git is at figuring out this sort of thing. Andrew
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 03:41:18PM -0700, Stephen Hemminger wrote: > On Tue, 14 Jul 2020 00:00:16 +0200 > Michal Kubecek wrote: > > > On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote: > > > To start out with, I'd like to attempt to eliminate as much of the use > > > of master and slave in the bonding driver as possible. For the most > > > part, I think this can be done without breaking UAPI, but may require > > > changes to anything accessing bond info via proc or sysfs. > > > > Could we, please, avoid breaking existing userspace tools and scripts? > > Massive code churn is one thing and we could certainly bite the bullet > > and live with it (even if I'm still not convinced it would be as great > > idea as some present it) but trading theoretical offense for real and > > palpable harm to existing users is something completely different. > > > > Or is "don't break userspace" no longer the "first commandment" of linux > > kernel development? > > > > Michal Kubecek > > Please consider using same wording as current standard for link aggregration. > Current version is 802.1AX and it uses the terms: > Multiplexer / Aggregator But both of these are replacements for "master", right? > As far as userspace, maybe keep the old API's but provide deprecation nags. > And don't document the old API values. I'm not a fan of nagging users. And even less of a fan of undocumented keyword and value aliases. Michal
Re: [RFC] bonding driver terminology change proposal
On Tue, 14 Jul 2020 00:00:16 +0200 Michal Kubecek wrote: > On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote: > > To start out with, I'd like to attempt to eliminate as much of the use > > of master and slave in the bonding driver as possible. For the most > > part, I think this can be done without breaking UAPI, but may require > > changes to anything accessing bond info via proc or sysfs. > > Could we, please, avoid breaking existing userspace tools and scripts? > Massive code churn is one thing and we could certainly bite the bullet > and live with it (even if I'm still not convinced it would be as great > idea as some present it) but trading theoretical offense for real and > palpable harm to existing users is something completely different. > > Or is "don't break userspace" no longer the "first commandment" of linux > kernel development? > > Michal Kubecek Please consider using same wording as current standard for link aggregration. Current version is 802.1AX and it uses the terms: Multiplexer / Aggregator There are no uses of master or slave in 802.1Ax standard. As far as userspace, maybe keep the old API's but provide deprecation nags. And don't document the old API values.
Re: [RFC] bonding driver terminology change proposal
On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote: > To start out with, I'd like to attempt to eliminate as much of the use > of master and slave in the bonding driver as possible. For the most > part, I think this can be done without breaking UAPI, but may require > changes to anything accessing bond info via proc or sysfs. Could we, please, avoid breaking existing userspace tools and scripts? Massive code churn is one thing and we could certainly bite the bullet and live with it (even if I'm still not convinced it would be as great idea as some present it) but trading theoretical offense for real and palpable harm to existing users is something completely different. Or is "don't break userspace" no longer the "first commandment" of linux kernel development? Michal Kubecek
Re: [RFC] bonding driver terminology change proposal
On 7/13/20 11:51 AM, Jarod Wilson wrote: > As part of an effort to help enact social change, Red Hat is > committing to efforts to eliminate any problematic terminology from > any of the software that it ships and supports. Front and center for > me personally in that effort is the bonding driver's use of the terms > master and slave, and to a lesser extent, bond and bonding, due to > bondage being another term for slavery. Most people in computer > science understand these terms aren't intended to be offensive or > oppressive, and have well understood meanings in computing, but > nonetheless, they still present an open wound, and a barrier for > participation and inclusion to some. > > To start out with, I'd like to attempt to eliminate as much of the use > of master and slave in the bonding driver as possible. For the most > part, I think this can be done without breaking UAPI, but may require > changes to anything accessing bond info via proc or sysfs. > > My initial thought was to rename master to aggregator and slaves to > ports, but... that gets really messy with the existing 802.3ad bonding > code using both extensively already. I've given thought to a number of > other possible combinations, but the one that I'm liking the most is > master -> bundle and slave -> cable, for a number of reasons. I'd > considered cable and wire, as a cable is a grouping of individual > wires, but we're grouping together cables, really -- each bonded > ethernet interface has a cable connected, so a bundle of cables makes > sense visually and figuratively. Additionally, it's a swap made easier > in the codebase by master and bundle and slave and cable having the > same number of characters, respectively. Granted though, "bundle" > doesn't suggest "runs the show" the way "master" or something like > maybe "director" or "parent" does, but those lack the visual aspect > present with a bundle of cables. Using parent/child could work too > though, it's perhaps closer to the master/slave terminology currently > in use as far as literal meaning. > > So... Thoughts? > So you considered : aggregator/ports, bundle/cable. I thought about cord/strand, since this is less likely to be used already in networking land (like worker, thread, fiber, or wire ...) Although a cord with two strands is probably not very common :/
[RFC] bonding driver terminology change proposal
As part of an effort to help enact social change, Red Hat is committing to efforts to eliminate any problematic terminology from any of the software that it ships and supports. Front and center for me personally in that effort is the bonding driver's use of the terms master and slave, and to a lesser extent, bond and bonding, due to bondage being another term for slavery. Most people in computer science understand these terms aren't intended to be offensive or oppressive, and have well understood meanings in computing, but nonetheless, they still present an open wound, and a barrier for participation and inclusion to some. To start out with, I'd like to attempt to eliminate as much of the use of master and slave in the bonding driver as possible. For the most part, I think this can be done without breaking UAPI, but may require changes to anything accessing bond info via proc or sysfs. My initial thought was to rename master to aggregator and slaves to ports, but... that gets really messy with the existing 802.3ad bonding code using both extensively already. I've given thought to a number of other possible combinations, but the one that I'm liking the most is master -> bundle and slave -> cable, for a number of reasons. I'd considered cable and wire, as a cable is a grouping of individual wires, but we're grouping together cables, really -- each bonded ethernet interface has a cable connected, so a bundle of cables makes sense visually and figuratively. Additionally, it's a swap made easier in the codebase by master and bundle and slave and cable having the same number of characters, respectively. Granted though, "bundle" doesn't suggest "runs the show" the way "master" or something like maybe "director" or "parent" does, but those lack the visual aspect present with a bundle of cables. Using parent/child could work too though, it's perhaps closer to the master/slave terminology currently in use as far as literal meaning. So... Thoughts? For reference, a work-in-progress adaptation from master/slave to bundle/cable has a diffstat that is currently summarized as: 37 files changed, 2607 insertions(+), 2571 deletions(-) -- Jarod Wilson ja...@redhat.com
Business Proposal - Please Reply
Hello My name is Yuval Rose. I have an urgent lucrative business opportunity for you worth over 15 Milli0n US D0llars. I got your details on the internet when I was searching for a reliable person that can handle this deal and I believe you can handle it. Waiting for your speedy reply for further and complete details. Send reply to: j...@gutermanpartners.com Best Regards Yuval Toronto-Canada
INVESTMENT PROPOSAL.
It’s my pleasure to contact you through this media because I need an investment assistance in your country. However I have a profitable investment proposal with good interest to share with you, amounted the sum of (Twenty Eight Million Four Hundred Thousand United State Dollar ($28.400.000.00). If you are willing to handle this project kindly reply urgent to enable me provide you more information about the investment funds and the project. I am waiting to hear from you through this my private email(hadeliss...@gmail.com) so we can proceed further. Best Regards. Mr. Hadel Issa
Re: Proposal: r8152 firmware patching framework
(Narrowing the recipient list for now) On Tue, Sep 3, 2019 at 3:50 PM David Miller wrote: > > From: Prashant Malani > Date: Tue, 3 Sep 2019 14:32:01 -0700 > > > I've moved David to the TO list to hopefully get his suggestions and > > guidance about how to design this in a upstream-compatible way. > > I am not an expert in this area so please do not solicit my opinion. Noted. My apologies. > > Thank you.
Re: Proposal: r8152 firmware patching framework
From: Prashant Malani Date: Tue, 3 Sep 2019 14:32:01 -0700 > I've moved David to the TO list to hopefully get his suggestions and > guidance about how to design this in a upstream-compatible way. I am not an expert in this area so please do not solicit my opinion. Thank you.
Re: Proposal: r8152 firmware patching framework
Hi Bambi, Thank you for your response. We'd be more than happy to assist in working out a solution that would be acceptable by the upstream maintainers. I think having a maintainable and safe way to deploy firmware fixes would be much appreciated by hardware users as well as upstream devs, and certainly more manageable than big static byte-arrays in the source code! I've moved David to the TO list to hopefully get his suggestions and guidance about how to design this in a upstream-compatible way. I'd be happy to implement it too (I feel this can occur concurrent to Hayes' upstreaming efforts). David, could you kindly advise the best way to incorporate deploying these firmware patches? This change link gives an idea of what we're dealing with: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953 My original strawman is to just have a simple firmware format like so: ... The driver code can have parts to deal with each section in an appropriate fashion (e.g is each data entry a word or a byte? does this section have a key which needs to be written to a certain register etc.) We'd be grateful if you can offer your advice about best practices (or suggestions about who might be a good reviewer), so that we can have a design in place before sending out any patches. Thanks and best regards, -Prashant On Tue, Sep 3, 2019 at 2:01 AM Bambi Yeh wrote: > > Hi Prashant: > > We will try to implement your requests. > Based on our experience, upstream reviewer often reject our modification if > they have any concern. > Do you think you can talk to them about this idea and see if they will accept > it or not? > Or if you can help on this after we submit it? > > Also, Hayes is now updating our current upstream driver and it goes back and > forth for a while. > So we will need some time to finish it and the target schedule to have your > request done is in the end of this month. > > Thank you very much. > > Best Regards, > Bambi Yeh > > -Original Message- > From: Hayes Wang > Sent: Monday, September 2, 2019 2:31 PM > To: Amber Chen ; Prashant Malani > > Cc: David Miller ; netdev@vger.kernel.org; Bambi Yeh > ; Ryankao ; Jackc > ; Albertk ; marcoc...@google.com; > nic_swsd ; Grant Grundler > Subject: RE: Proposal: r8152 firmware patching framework > > Prashant Malani > > > > > > (Adding a few more Realtek folks) > > > > > > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ? > > > > > >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani > > wrote: > > >> > > >> Hi, > > >> > > >> The r8152 driver source code distributed by Realtek (on > > >> www.realtek.com) contains firmware patches. This involves binary > > >> byte-arrays being written byte/word-wise to the hardware memory > > >> Example: grund...@chromium.org (cc-ed) has an experimental patch > > which > > >> includes the firmware patching code which was distributed with the > > >> Realtek source : > > >> > > https://chromium-review.googlesource.com/c/chromiumos/third_party/kern > > el > > /+/1417953 > > >> > > >> It would be nice to have a way to incorporate these firmware fixes > > >> into the upstream code. Since having indecipherable byte-arrays is > > >> not possible upstream, I propose the following: > > >> - We use the assistance of Realtek to come up with a format which > > >> the firmware patch files can follow (this can be documented in the > > >> comments). > > >> - A real simple format could look like this: > > >> + > > >> > > ... > N > > >... > > >>+ The driver would be able to understand how to > > >> parse each section (e.g is each data entry a byte or a word?) > > >> > > >> - We use request_firmware() to load the firmware, parse it and > > >> write the data to the relevant registers. > > I plan to finish the patches which I am going to submit, first. Then, I could > focus on this. However, I don't think I would start this quickly. There are > many preparations and they would take me a lot of time. > > Best Regards, > Hayes > >
RE: Proposal: r8152 firmware patching framework
Hi Prashant: We will try to implement your requests. Based on our experience, upstream reviewer often reject our modification if they have any concern. Do you think you can talk to them about this idea and see if they will accept it or not? Or if you can help on this after we submit it? Also, Hayes is now updating our current upstream driver and it goes back and forth for a while. So we will need some time to finish it and the target schedule to have your request done is in the end of this month. Thank you very much. Best Regards, Bambi Yeh -Original Message- From: Hayes Wang Sent: Monday, September 2, 2019 2:31 PM To: Amber Chen ; Prashant Malani Cc: David Miller ; netdev@vger.kernel.org; Bambi Yeh ; Ryankao ; Jackc ; Albertk ; marcoc...@google.com; nic_swsd ; Grant Grundler Subject: RE: Proposal: r8152 firmware patching framework Prashant Malani > > > > (Adding a few more Realtek folks) > > > > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ? > > > >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani > wrote: > >> > >> Hi, > >> > >> The r8152 driver source code distributed by Realtek (on > >> www.realtek.com) contains firmware patches. This involves binary > >> byte-arrays being written byte/word-wise to the hardware memory > >> Example: grund...@chromium.org (cc-ed) has an experimental patch > which > >> includes the firmware patching code which was distributed with the > >> Realtek source : > >> > https://chromium-review.googlesource.com/c/chromiumos/third_party/kern > el > /+/1417953 > >> > >> It would be nice to have a way to incorporate these firmware fixes > >> into the upstream code. Since having indecipherable byte-arrays is > >> not possible upstream, I propose the following: > >> - We use the assistance of Realtek to come up with a format which > >> the firmware patch files can follow (this can be documented in the > >> comments). > >> - A real simple format could look like this: > >> + > >> > ... N > >... > >>+ The driver would be able to understand how to > >> parse each section (e.g is each data entry a byte or a word?) > >> > >> - We use request_firmware() to load the firmware, parse it and > >> write the data to the relevant registers. I plan to finish the patches which I am going to submit, first. Then, I could focus on this. However, I don't think I would start this quickly. There are many preparations and they would take me a lot of time. Best Regards, Hayes
RE: Proposal: r8152 firmware patching framework
Prashant Malani > > > > (Adding a few more Realtek folks) > > > > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ? > > > >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani > wrote: > >> > >> Hi, > >> > >> The r8152 driver source code distributed by Realtek (on > >> www.realtek.com) contains firmware patches. This involves binary > >> byte-arrays being written byte/word-wise to the hardware memory > >> Example: grund...@chromium.org (cc-ed) has an experimental patch > which > >> includes the firmware patching code which was distributed with the > >> Realtek source : > >> > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel > /+/1417953 > >> > >> It would be nice to have a way to incorporate these firmware fixes > >> into the upstream code. Since having indecipherable byte-arrays is not > >> possible upstream, I propose the following: > >> - We use the assistance of Realtek to come up with a format which the > >> firmware patch files can follow (this can be documented in the > >> comments). > >> - A real simple format could look like this: > >> + > >> > ... >... > >>+ The driver would be able to understand how to parse > >> each section (e.g is each data entry a byte or a word?) > >> > >> - We use request_firmware() to load the firmware, parse it and write > >> the data to the relevant registers. I plan to finish the patches which I am going to submit, first. Then, I could focus on this. However, I don't think I would start this quickly. There are many preparations and they would take me a lot of time. Best Regards, Hayes
Re: Proposal: r8152 firmware patching framework
+ acct mgr, Stephen > Prashant Malani 於 2019年8月31日 上午6:24 寫道: > > (Adding a few more Realtek folks) > > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ? > >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani >> wrote: >> >> Hi, >> >> The r8152 driver source code distributed by Realtek (on >> www.realtek.com) contains firmware patches. This involves binary >> byte-arrays being written byte/word-wise to the hardware memory >> Example: grund...@chromium.org (cc-ed) has an experimental patch which >> includes the firmware patching code which was distributed with the >> Realtek source : >> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953 >> >> It would be nice to have a way to incorporate these firmware fixes >> into the upstream code. Since having indecipherable byte-arrays is not >> possible upstream, I propose the following: >> - We use the assistance of Realtek to come up with a format which the >> firmware patch files can follow (this can be documented in the >> comments). >> - A real simple format could look like this: >> + >> .. >>+ The driver would be able to understand how to parse >> each section (e.g is each data entry a byte or a word?) >> >> - We use request_firmware() to load the firmware, parse it and write >> the data to the relevant registers. >> >> I'm unfamiliar with what the preferred method of firmware patching is, >> so I hope the maintainers can help suggest the best path forward. >> >> As an aside: It would be great if Realtek could publish a list of >> fixes that the firmware patches implement (I think a list on the >> driver download page on the Realtek website would be an excellent >> starting point). >> >> Thanks and Best regards, >> >> -Prashant > > --Please consider the environment before printing this e-mail.
Re: Proposal: r8152 firmware patching framework
(Adding a few more Realtek folks) Friendly ping. Any thoughts / feedback, Realtek folks (and others) ? On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani wrote: > > Hi, > > The r8152 driver source code distributed by Realtek (on > www.realtek.com) contains firmware patches. This involves binary > byte-arrays being written byte/word-wise to the hardware memory > Example: grund...@chromium.org (cc-ed) has an experimental patch which > includes the firmware patching code which was distributed with the > Realtek source : > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953 > > It would be nice to have a way to incorporate these firmware fixes > into the upstream code. Since having indecipherable byte-arrays is not > possible upstream, I propose the following: > - We use the assistance of Realtek to come up with a format which the > firmware patch files can follow (this can be documented in the > comments). >- A real simple format could look like this: >+ > .. > + The driver would be able to understand how to parse > each section (e.g is each data entry a byte or a word?) > > - We use request_firmware() to load the firmware, parse it and write > the data to the relevant registers. > > I'm unfamiliar with what the preferred method of firmware patching is, > so I hope the maintainers can help suggest the best path forward. > > As an aside: It would be great if Realtek could publish a list of > fixes that the firmware patches implement (I think a list on the > driver download page on the Realtek website would be an excellent > starting point). > > Thanks and Best regards, > > -Prashant
Proposal: r8152 firmware patching framework
Hi, The r8152 driver source code distributed by Realtek (on www.realtek.com) contains firmware patches. This involves binary byte-arrays being written byte/word-wise to the hardware memory Example: grund...@chromium.org (cc-ed) has an experimental patch which includes the firmware patching code which was distributed with the Realtek source : https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953 It would be nice to have a way to incorporate these firmware fixes into the upstream code. Since having indecipherable byte-arrays is not possible upstream, I propose the following: - We use the assistance of Realtek to come up with a format which the firmware patch files can follow (this can be documented in the comments). - A real simple format could look like this: + .. + The driver would be able to understand how to parse each section (e.g is each data entry a byte or a word?) - We use request_firmware() to load the firmware, parse it and write the data to the relevant registers. I'm unfamiliar with what the preferred method of firmware patching is, so I hope the maintainers can help suggest the best path forward. As an aside: It would be great if Realtek could publish a list of fixes that the firmware patches implement (I think a list on the driver download page on the Realtek website would be an excellent starting point). Thanks and Best regards, -Prashant
MY $25,000,000.00 INVESTMENT PROPOSAL WITH YOU AND IN YOUR COUNTRY.
-- Dear, With due respect this is not spam or Scam mail, because I have contacted you before and there was no response from you,I apologise if the contents of this mail are contrary to your moral ethics, which I feel may be of great disturbance to your person, but please treat this with absolute confidentiality, believing that this email reaches you in good faith. My contacting you is not a mistake or a coincidence because God can use any person known or unknown to accomplish great things. I am a lawyer and I have an investment business proposal to offer you. It is not official but should be considered as legal and confidential business. I have a customer's deposit of $US25 million dollars ready to be moved for investment if you can partner with us. We are ready to offer you 10% of this total amount as your compensation for supporting the transaction to completion. If you are interested to help me please reply me with your full details as stated below: (1) Your full names: (2) Your address: (3) Your occupation: (4) Your mobile telephone number: (5) Your nationality: (6) Your present location: (7) Your age: So that I will provide you more details on what to do and what is required for successful completion. Note: DO NOT REPLY ME IF YOU ARE NOT INTERESTED AND WITHOUT THE ABOVE MENTIONED DETAILS Sincèrement vôtre, Avocat Etienne Eku Esq.(Lawfirm) Procureur principal. De Cabinet d’avocats de l’Afrique de l’ouest. Skype:westafricalawfirm
Re: [RFC v2] vsock: proposal to support multiple transports at runtime
On Mon, Aug 19, 2019 at 02:09:11PM +0100, Stefan Hajnoczi wrote: > On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote: > > > > Hi all, > > this is a v2 of a proposal addressing the comments made by Dexuan, Stefan, > > and Jorgen. > > > > v1: https://www.spinics.net/lists/netdev/msg570274.html > > > > > > > > We can define two types of transport that we have to handle at the same time > > (e.g. in a nested VM we would have both types of transport running > > together): > > > > - 'host->guest' transport, it runs in the host and it is used to communicate > > with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also > > runs in the guest who has nested guests, to communicate with them. > > > > [Phase 2] > > We can support multiple 'host->guest' transport running at the same time, > > but on x86 only one hypervisor uses VMX at any given time. > > > > - 'guest->host' transport, it runs in the guest and it is used to > > communicate > > with the host. > > > > > > The main goal is to find a way to decide what transport use in these cases: > > 1. connect() / sendto() > > > >a. use the 'host->guest' transport, if the destination is the guest > > (dest_cid > VMADDR_CID_HOST). > > > > [Phase 2] > > In order to support multiple 'host->guest' transports running at the > > same > > time, we should assign CIDs uniquely across all transports. In this > > way, > > a packet generated by the host side will get directed to the > > appropriate > > transport based on the CID. > > > >b. use the 'guest->host' transport, if the destination is the host or the > > hypervisor. > > (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR) > > > > > > 2. listen() / recvfrom() > > > >a. use the 'host->guest' transport, if the socket is bound to > > VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > > 'guest->host' transport. > > We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > > address this case. > > > > [Phase 2] > > We can support network namespaces to create independent AF_VSOCK > > addressing domains: > > - could be used to partition VMs between hypervisors or at a finer > > granularity; > > - could be used to isolate host applications from guest applications > > using the same ports with CID_ANY; > > > >b. use the 'guest->host' transport, if the socket is bound to local CID > > different from the VMADDR_CID_HOST (guest CID get with > > IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to > > be > > backward compatible). > > Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. > > > >c. shared port space between transports > > For incoming requests or packets, we should be able to choose which > > transport use, looking at the 'port' requested. > > > > - stream sockets already support shared port space between transports > > (one port can be assigned to only one transport) > > > > [Phase 2] > > - datagram sockets will support it, but for now VMCI transport is the > > default transport for any host side datagram socket (KVM and Hyper-V > > do not yet support datagrams sockets) > > > > We will make the loading of af_vsock.ko independent of the transports to > > allow to: > >- create a AF_VSOCK socket without any loaded transports; > >- listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded > > transports; > > > > Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the > > vmci_transport.ko to the af_vsock.ko. > > [Jorgen will check if this will impact the existing VMware products] > > > > Notes: > >- For Hyper-V sockets, the host can only be Windows. No changes should > > be required on the Windows host to support the changes on this > > proposal. > > > >- Communication between guests are not allowed on any transports, so we > > can > > drop packets sent from a guest to another guest (dest_cid > > > VMADDR_CID_HOST) if the 'host->guest' transport is not available. >
Re: [RFC v2] vsock: proposal to support multiple transports at runtime
On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote: > > Hi all, > this is a v2 of a proposal addressing the comments made by Dexuan, Stefan, > and Jorgen. > > v1: https://www.spinics.net/lists/netdev/msg570274.html > > > > We can define two types of transport that we have to handle at the same time > (e.g. in a nested VM we would have both types of transport running together): > > - 'host->guest' transport, it runs in the host and it is used to communicate > with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also > runs in the guest who has nested guests, to communicate with them. > > [Phase 2] > We can support multiple 'host->guest' transport running at the same time, > but on x86 only one hypervisor uses VMX at any given time. > > - 'guest->host' transport, it runs in the guest and it is used to communicate > with the host. > > > The main goal is to find a way to decide what transport use in these cases: > 1. connect() / sendto() > >a. use the 'host->guest' transport, if the destination is the guest > (dest_cid > VMADDR_CID_HOST). > > [Phase 2] > In order to support multiple 'host->guest' transports running at the > same > time, we should assign CIDs uniquely across all transports. In this way, > a packet generated by the host side will get directed to the appropriate > transport based on the CID. > >b. use the 'guest->host' transport, if the destination is the host or the > hypervisor. > (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR) > > > 2. listen() / recvfrom() > >a. use the 'host->guest' transport, if the socket is bound to > VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > 'guest->host' transport. > We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > address this case. > > [Phase 2] > We can support network namespaces to create independent AF_VSOCK > addressing domains: > - could be used to partition VMs between hypervisors or at a finer >granularity; > - could be used to isolate host applications from guest applications >using the same ports with CID_ANY; > >b. use the 'guest->host' transport, if the socket is bound to local CID > different from the VMADDR_CID_HOST (guest CID get with > IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be > backward compatible). > Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. > >c. shared port space between transports > For incoming requests or packets, we should be able to choose which > transport use, looking at the 'port' requested. > > - stream sockets already support shared port space between transports > (one port can be assigned to only one transport) > > [Phase 2] > - datagram sockets will support it, but for now VMCI transport is the > default transport for any host side datagram socket (KVM and Hyper-V > do not yet support datagrams sockets) > > We will make the loading of af_vsock.ko independent of the transports to > allow to: >- create a AF_VSOCK socket without any loaded transports; >- listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded > transports; > > Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the > vmci_transport.ko to the af_vsock.ko. > [Jorgen will check if this will impact the existing VMware products] > > Notes: >- For Hyper-V sockets, the host can only be Windows. No changes should > be required on the Windows host to support the changes on this proposal. > >- Communication between guests are not allowed on any transports, so we can > drop packets sent from a guest to another guest (dest_cid > > VMADDR_CID_HOST) if the 'host->guest' transport is not available. > >- [Phase 2] tag used to identify things that can be done at a later stage, > but that should be taken into account during this design. > >- Namespace support will be developed in [Phase 2] or in a separate > project. > > > > Comments and suggestions are welcome. > I'll be on PTO for next two weeks, so sorry in advance if I'll answer later. > > If we agree on this proposal, when I get back, I'll start working on the code > to get a first PATCH RFC. Stefano, I've reviewed your proposal and it looks good for solving nested virtualization. The tricky implementation details will be supporting listen sockets, especially with VMADDR_CID_ANY so they can be accessed from both transports. Stefan signature.asc Description: PGP signature
RES: PROPOSAL.
De: José Luiz Fabris Enviado: terça-feira, 30 de julho de 2019 18:37 Para: José Luiz Fabris Assunto: PROPOSAL. Good Day, I am Mrs.Margaret Ko May-Yee Leung Deputy Managing Director and Executive Director of Chong Hing Bank Limited. I write briefly to seek your collaboration in a multi-million transaction with good return for us on participation reply to my private email address below. Please before we proceed further, I'd like to know your FIRST and LAST name so I will cross check with what I have on my file before proceeding with the details of our proposal. E-mail: margaretkoleung...@gmail.com for more details send FIRST and LAST name to My private email addreess: margaretkoleung...@gmail.com Thank you and I look forward to hearing from you shortly. Regards, Dir. Margaret Ko May-Yee Leung. Esta mensagem (incluindo anexos) contém informação confidencial destinada a um usuário específico e seu conteúdo é protegido por lei. Se você não é o destinatário correto deve apagar esta mensagem. O emitente desta mensagem é responsável por seu conteúdo e endereçamento. Cabe ao destinatário cuidar quanto ao tratamento adequado. A divulgação, reprodução e/ou distribuição sem a devida autorização ou qualquer outra ação sem conformidade com as normas internas do Ifes são proibidas e passíveis de sanção disciplinar, cível e criminal.
INVESTMENT PROPOSAL!
I got your contact while on a search for a reliable and trustworthy partner who is to help me co-ordinate a business over there in your country. I am interested in having an investment in your country based on long-term business venture that has a good return on investment [ROI] under your supervision.You will be required to; [1]. Receive the funds. [2]. Invest and Manage the funds profitably. Though am interested in mechanized farm or any other viable business, and I do not know if your country is a very good market for such investment, so i needed a very good advice on what kind of investment has a good return and profitable there in your country that we can both start apart from the mechanized farming I already have in mind. If you are interested, kindly contact me via:-stamfrancoi...@gmail.com,for more details. Regards, Francois.
Re: [RFC] longer netdev names proposal
On Fri, Jun 28, 2019 at 03:55:53PM +0200, Jiri Pirko wrote: > Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote: > > > >What is your user case for having multiple IFLA_ALT_NAME for the same > >IFLA_NAME? > > I don't know about specific usecase for having more. Perhaps Michal > does. One use case that comes to my mind are the "predictable names" implemented by udev/systemd which can be based on different naming schemes (bus address, BIOS numbering, MAC address etc.) and it's not always obvious which scheme is going to be used. I have even seen multiple times that one schemed was used during system installation and another in the installed system so that network configuration created by installer did not work. For block devices, current practice is not to rename the device and only create multiple symlinks based on different naming schemes (by id, by uuid, by label, etc.). With support for multiple altnames, we could also identify the network device in different ways (all applicable ones). Michal
Re: [RFC] longer netdev names proposal
Fri, Jun 28, 2019 at 05:44:47PM CEST, step...@networkplumber.org wrote: >On Fri, 28 Jun 2019 15:55:53 +0200 >Jiri Pirko wrote: > >> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote: >> >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote: >> >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote: >> >> >On Thu, 27 Jun 2019 20:39:48 +0200 >> >> >Michal Kubecek wrote: >> >> > >> >> >> > >> >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet" >> >> >> > # ip link show "Onboard Ethernet" >> >> >> > Device "Onboard Ethernet" does not exist. >> >> >> > >> >> >> > So it does not really appear to be an alias, it is a label. To be >> >> >> > truly useful, it needs to be more than a label, it needs to be a real >> >> >> > alias which you can use. >> >> >> >> >> >> That's exactly what I meant: to be really useful, one should be able to >> >> >> use the alias(es) for setting device options, for adding routes, in >> >> >> netfilter rules etc. >> >> >> >> >> >> Michal >> >> > >> >> >The kernel doesn't enforce uniqueness of alias. >> >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily >> >> >fixed). >> >> > >> >> >If it did, then handling it in iproute would be something like: >> >> >> >> I think that it is desired for kernel to work with "real alias" as a >> >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". >> >> Userspace mapping like you did here might be perhaps okay for iproute2, >> >> but I think that we need something and easy to use for all. >> >> >> >> Let's call it "altname". Get would return: >> >> >> >> IFLA_NAME eth0 >> >> IFLA_ALT_NAME_LIST >> >>IFLA_ALT_NAME eth0 >> >>IFLA_ALT_NAME somethingelse >> >>IFLA_ALT_NAME somenamethatisreallylong >> > >> >Hi Jiri >> > >> >What is your user case for having multiple IFLA_ALT_NAME for the same >> >IFLA_NAME? >> >> I don't know about specific usecase for having more. Perhaps Michal >> does. >> >> From the implementation perspective it is handy to have the ifname as >> the first alt name in kernel, so the userspace would just pass >> IFLA_ALT_NAME always. Also for avoiding name collisions etc. > >I like the alternate name proposal. The kernel would have to impose >uniqueness. >Does alt_name have to be unique across both regular and alt_name? Yes. That is my idea. To have one big hashtable to contain them all. >Having multiple names list seems less interesting but it could be useful. Yeah. Okay, I'm going to jump on this.
Re: [RFC] longer netdev names proposal
On Fri, 28 Jun 2019 15:55:53 +0200 Jiri Pirko wrote: > Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote: > >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote: > >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote: > >> >On Thu, 27 Jun 2019 20:39:48 +0200 > >> >Michal Kubecek wrote: > >> > > >> >> > > >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet" > >> >> > # ip link show "Onboard Ethernet" > >> >> > Device "Onboard Ethernet" does not exist. > >> >> > > >> >> > So it does not really appear to be an alias, it is a label. To be > >> >> > truly useful, it needs to be more than a label, it needs to be a real > >> >> > alias which you can use. > >> >> > >> >> That's exactly what I meant: to be really useful, one should be able to > >> >> use the alias(es) for setting device options, for adding routes, in > >> >> netfilter rules etc. > >> >> > >> >> Michal > >> > > >> >The kernel doesn't enforce uniqueness of alias. > >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed). > >> > > >> >If it did, then handling it in iproute would be something like: > >> > >> I think that it is desired for kernel to work with "real alias" as a > >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". > >> Userspace mapping like you did here might be perhaps okay for iproute2, > >> but I think that we need something and easy to use for all. > >> > >> Let's call it "altname". Get would return: > >> > >> IFLA_NAME eth0 > >> IFLA_ALT_NAME_LIST > >> IFLA_ALT_NAME eth0 > >>IFLA_ALT_NAME somethingelse > >>IFLA_ALT_NAME somenamethatisreallylong > > > >Hi Jiri > > > >What is your user case for having multiple IFLA_ALT_NAME for the same > >IFLA_NAME? > > I don't know about specific usecase for having more. Perhaps Michal > does. > > From the implementation perspective it is handy to have the ifname as > the first alt name in kernel, so the userspace would just pass > IFLA_ALT_NAME always. Also for avoiding name collisions etc. I like the alternate name proposal. The kernel would have to impose uniqueness. Does alt_name have to be unique across both regular and alt_name? Having multiple names list seems less interesting but it could be useful.
Re: [RFC] longer netdev names proposal
Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote: >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote: >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote: >> >On Thu, 27 Jun 2019 20:39:48 +0200 >> >Michal Kubecek wrote: >> > >> >> > >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet" >> >> > # ip link show "Onboard Ethernet" >> >> > Device "Onboard Ethernet" does not exist. >> >> > >> >> > So it does not really appear to be an alias, it is a label. To be >> >> > truly useful, it needs to be more than a label, it needs to be a real >> >> > alias which you can use. >> >> >> >> That's exactly what I meant: to be really useful, one should be able to >> >> use the alias(es) for setting device options, for adding routes, in >> >> netfilter rules etc. >> >> >> >> Michal >> > >> >The kernel doesn't enforce uniqueness of alias. >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed). >> > >> >If it did, then handling it in iproute would be something like: >> >> I think that it is desired for kernel to work with "real alias" as a >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". >> Userspace mapping like you did here might be perhaps okay for iproute2, >> but I think that we need something and easy to use for all. >> >> Let's call it "altname". Get would return: >> >> IFLA_NAME eth0 >> IFLA_ALT_NAME_LIST >>IFLA_ALT_NAME eth0 >>IFLA_ALT_NAME somethingelse >>IFLA_ALT_NAME somenamethatisreallylong > >Hi Jiri > >What is your user case for having multiple IFLA_ALT_NAME for the same >IFLA_NAME? I don't know about specific usecase for having more. Perhaps Michal does. >From the implementation perspective it is handy to have the ifname as the first alt name in kernel, so the userspace would just pass IFLA_ALT_NAME always. Also for avoiding name collisions etc. > > Thanks > Andrew >
Re: [RFC] longer netdev names proposal
On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote: > Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote: > >On Thu, 27 Jun 2019 20:39:48 +0200 > >Michal Kubecek wrote: > > > >> > > >> > $ ip li set dev enp3s0 alias "Onboard Ethernet" > >> > # ip link show "Onboard Ethernet" > >> > Device "Onboard Ethernet" does not exist. > >> > > >> > So it does not really appear to be an alias, it is a label. To be > >> > truly useful, it needs to be more than a label, it needs to be a real > >> > alias which you can use. > >> > >> That's exactly what I meant: to be really useful, one should be able to > >> use the alias(es) for setting device options, for adding routes, in > >> netfilter rules etc. > >> > >> Michal > > > >The kernel doesn't enforce uniqueness of alias. > >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed). > > > >If it did, then handling it in iproute would be something like: > > I think that it is desired for kernel to work with "real alias" as a > handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". > Userspace mapping like you did here might be perhaps okay for iproute2, > but I think that we need something and easy to use for all. > > Let's call it "altname". Get would return: > > IFLA_NAME eth0 > IFLA_ALT_NAME_LIST >IFLA_ALT_NAME eth0 >IFLA_ALT_NAME somethingelse >IFLA_ALT_NAME somenamethatisreallylong Hi Jiri What is your user case for having multiple IFLA_ALT_NAME for the same IFLA_NAME? Thanks Andrew
Re: [RFC] longer netdev names proposal
Fri, Jun 28, 2019 at 01:42:12PM CEST, mkube...@suse.cz wrote: >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote: >> >> I think that it is desired for kernel to work with "real alias" as a >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". >> Userspace mapping like you did here might be perhaps okay for iproute2, >> but I think that we need something and easy to use for all. >> >> Let's call it "altname". Get would return: >> >> IFLA_NAME eth0 >> IFLA_ALT_NAME_LIST >>IFLA_ALT_NAME eth0 >>IFLA_ALT_NAME somethingelse >>IFLA_ALT_NAME somenamethatisreallylong >> >> then userspace would pass with a request (get/set/del): >> IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong >> or >> IFLA_NAME eth0 if it is talking with older kernel >> >> Then following would do exactly the same: >> ip link set eth0 addr 11:22:33:44:55:66 >> ip link set somethingelse addr 11:22:33:44:55:66 >> ip link set somenamethatisreallylong addr 11:22:33:44:55:66 > >Yes, this sounds nice. > >> We would have to figure out the iproute2 iface to add/del altnames: >> ip link add eth0 altname somethingelse >> ip link del eth0 altname somethingelse >> this might be also: >> ip link del somethingelse altname somethingelse > >This would be a bit confusing, IMHO, as so far > > ip link add $name ... > >always means we want to add or delete new device $name which would not >be the case here. How about the other way around: > > ip link add somethingelse altname_for eth0 > >(preferrably with a better keyword than "altname_for" :-) ). Or maybe > > ip altname add somethingelse dev eth0 > ip altname del somethingelse dev eth0 Yeah, I like this. Let's see how it will work during the implementation.
Re: [RFC] longer netdev names proposal
On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote: > > I think that it is desired for kernel to work with "real alias" as a > handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". > Userspace mapping like you did here might be perhaps okay for iproute2, > but I think that we need something and easy to use for all. > > Let's call it "altname". Get would return: > > IFLA_NAME eth0 > IFLA_ALT_NAME_LIST >IFLA_ALT_NAME eth0 >IFLA_ALT_NAME somethingelse >IFLA_ALT_NAME somenamethatisreallylong > > then userspace would pass with a request (get/set/del): > IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong > or > IFLA_NAME eth0 if it is talking with older kernel > > Then following would do exactly the same: > ip link set eth0 addr 11:22:33:44:55:66 > ip link set somethingelse addr 11:22:33:44:55:66 > ip link set somenamethatisreallylong addr 11:22:33:44:55:66 Yes, this sounds nice. > We would have to figure out the iproute2 iface to add/del altnames: > ip link add eth0 altname somethingelse > ip link del eth0 altname somethingelse > this might be also: > ip link del somethingelse altname somethingelse This would be a bit confusing, IMHO, as so far ip link add $name ... always means we want to add or delete new device $name which would not be the case here. How about the other way around: ip link add somethingelse altname_for eth0 (preferrably with a better keyword than "altname_for" :-) ). Or maybe ip altname add somethingelse dev eth0 ip altname del somethingelse dev eth0 Michal
Re: [RFC] longer netdev names proposal
Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote: >On Thu, 27 Jun 2019 20:39:48 +0200 >Michal Kubecek wrote: > >> > >> > $ ip li set dev enp3s0 alias "Onboard Ethernet" >> > # ip link show "Onboard Ethernet" >> > Device "Onboard Ethernet" does not exist. >> > >> > So it does not really appear to be an alias, it is a label. To be >> > truly useful, it needs to be more than a label, it needs to be a real >> > alias which you can use. >> >> That's exactly what I meant: to be really useful, one should be able to >> use the alias(es) for setting device options, for adding routes, in >> netfilter rules etc. >> >> Michal > >The kernel doesn't enforce uniqueness of alias. >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed). > >If it did, then handling it in iproute would be something like: I think that it is desired for kernel to work with "real alias" as a handle. Userspace could either pass ifindex, IFLA_NAME or "real alias". Userspace mapping like you did here might be perhaps okay for iproute2, but I think that we need something and easy to use for all. Let's call it "altname". Get would return: IFLA_NAME eth0 IFLA_ALT_NAME_LIST IFLA_ALT_NAME eth0 IFLA_ALT_NAME somethingelse IFLA_ALT_NAME somenamethatisreallylong then userspace would pass with a request (get/set/del): IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong or IFLA_NAME eth0 if it is talking with older kernel Then following would do exactly the same: ip link set eth0 addr 11:22:33:44:55:66 ip link set somethingelse addr 11:22:33:44:55:66 ip link set somenamethatisreallylong addr 11:22:33:44:55:66 We would have to figure out the iproute2 iface to add/del altnames: ip link add eth0 altname somethingelse ip link del eth0 altname somethingelse this might be also: ip link del somethingelse altname somethingelse How does this sound?
Re: [RFC] longer netdev names proposal
Thu, Jun 27, 2019 at 09:35:27PM CEST, d...@redhat.com wrote: >On Thu, 2019-06-27 at 12:20 -0700, Stephen Hemminger wrote: >> On Thu, 27 Jun 2019 20:39:48 +0200 >> Michal Kubecek wrote: >> >> > > $ ip li set dev enp3s0 alias "Onboard Ethernet" >> > > # ip link show "Onboard Ethernet" >> > > Device "Onboard Ethernet" does not exist. >> > > >> > > So it does not really appear to be an alias, it is a label. To be >> > > truly useful, it needs to be more than a label, it needs to be a >> > > real >> > > alias which you can use. >> > >> > That's exactly what I meant: to be really useful, one should be >> > able to >> > use the alias(es) for setting device options, for adding routes, in >> > netfilter rules etc. >> > >> > Michal >> >> The kernel doesn't enforce uniqueness of alias. > >Can we even enforce unique aliases/labels? Given that the kernel hasn't >enforced that in the past there's a good possibility of breaking stuff >if it started. (unfortunately) Correct. I think that Michal's idea to introduce "real aliases" is very intereting. However, the existing "alias" as we have it does not seem right to be used. Also because of the UAPI. We have IFLA_IFALIAS which is a single value. For "real aliases" we need nested array. [...]
Re: [RFC] longer netdev names proposal
Thu, Jun 27, 2019 at 07:14:31PM CEST, dsah...@gmail.com wrote: >On 6/27/19 3:43 AM, Jiri Pirko wrote: >> Hi all. >> >> In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for >> netdevice name length. Now when we have PF and VF representors >> with port names like "pfXvfY", it became quite common to hit this limit: >> 0123456789012345 >> enp131s0f1npf0vf6 >> enp131s0f1npf0vf22 > >QinQ (stacked vlans) is another example. There are more usecases for this, yes. > >> >> Since IFLA_NAME is just a string, I though it might be possible to use >> it to carry longer names as it is. However, the userspace tools, like >> iproute2, are doing checks before print out. So for example in output of >> "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is >> completely avoided. >> >> So here is a proposal that might work: >> 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than >>IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel, >>user should be prepared for any string size. >> 2) Add a file in sysfs that would indicate that NAME_EXT is supported by >>the kernel. > >no sysfs files. > >Johannes added infrastructure to retrieve the policy. That is a more >flexible and robust option for determining what the kernel supports. Sure, udev can query rtnetlink. I just proposed it as an option, anyway, it's implementation detail. > > >> 3) Udev is going to look for the sysfs indication file. In case when >>kernel supports long names, it will do rename to longer name, setting >>IFLA_NAME_EXT. If not, it does what it does now - fail. >> 4) There are two cases that can happen during rename: >>A) The name is shorter than IFNAMSIZ >> -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: >> original IFLA_NAME = eth0 >> original IFLA_NAME_EXT = eth0 >> renamed IFLA_NAME = enp5s0f1npf0vf1 >> renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 >>B) The name is longer tha IFNAMSIZ >> -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would >> contain the new one: >> original IFLA_NAME = eth0 >> original IFLA_NAME_EXT = eth0 >> renamed IFLA_NAME = eth0 >> renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 > >so kernel side there will be 2 names for the same net_device? Yes. However, updated tools (which would be eventually all) are going to show only the ext one. > >> >> This would allow the old tools to work with "eth0" and the new >> tools would work with "enp131s0f1npf0vf22". In sysfs, there would >> be symlink from one name to another. > >I would prefer a solution that does not rely on sysfs hooks. Please note that this /sys/class/net/ifacename dirs are already created. What I propose is to have symlink from ext to the short name or vice versa. The solution really does not "rely" on this... > >> >> Also, there might be a warning added to kernel if someone works >> with IFLA_NAME that the userspace tool should be upgraded. > >that seems like spam and confusion for the first few years of a new api. Spam? warn_once? > >> >> Eventually, only IFLA_NAME_EXT is going to be used by everyone. >> >> I'm aware there are other places where similar new attribute >> would have to be introduced too (ip rule for example). >> I'm not saying this is a simple work. >> >> Question is what to do with the ioctl api (get ifindex etc). I would >> probably leave it as is and push tools to use rtnetlink instead. > >The ioctl API is going to be a limiter here. ifconfig is still quite >prevalent and net-snmp still uses ioctl (as just 2 common examples). >snmp showing one set of names and rtnetlink s/w showing another is going >to be really confusing. I don't see other way though, do you? The ioctl names are unextendable :/
Re: [RFC] longer netdev names proposal
On Thu, 2019-06-27 at 12:20 -0700, Stephen Hemminger wrote: > On Thu, 27 Jun 2019 20:39:48 +0200 > Michal Kubecek wrote: > > > > $ ip li set dev enp3s0 alias "Onboard Ethernet" > > > # ip link show "Onboard Ethernet" > > > Device "Onboard Ethernet" does not exist. > > > > > > So it does not really appear to be an alias, it is a label. To be > > > truly useful, it needs to be more than a label, it needs to be a > > > real > > > alias which you can use. > > > > That's exactly what I meant: to be really useful, one should be > > able to > > use the alias(es) for setting device options, for adding routes, in > > netfilter rules etc. > > > > Michal > > The kernel doesn't enforce uniqueness of alias. Can we even enforce unique aliases/labels? Given that the kernel hasn't enforced that in the past there's a good possibility of breaking stuff if it started. (unfortunately) Dan > Also current kernel RTM_GETLINK doesn't do filter by alias (easily > fixed). > > If it did, then handling it in iproute would be something like: > > diff --git a/lib/ll_map.c b/lib/ll_map.c > index e0ed54bf77c9..c798ba542224 100644 > --- a/lib/ll_map.c > +++ b/lib/ll_map.c > @@ -26,15 +26,18 @@ > struct ll_cache { > struct hlist_node idx_hash; > struct hlist_node name_hash; > + struct hlist_node alias_hash; > unsignedflags; > unsignedindex; > unsigned short type; > - charname[]; > + char*alias; > + charname[IFNAMSIZ]; > }; > > #define IDXMAP_SIZE 1024 > static struct hlist_head idx_head[IDXMAP_SIZE]; > static struct hlist_head name_head[IDXMAP_SIZE]; > +static struct hlist_head alias_head[IDXMAP_SIZE]; > > static struct ll_cache *ll_get_by_index(unsigned index) > { > @@ -77,10 +80,26 @@ static struct ll_cache *ll_get_by_name(const char > *name) > return NULL; > } > > +static struct ll_cache *ll_get_by_alias(const char *alias) > +{ > + struct hlist_node *n; > + unsigned h = namehash(alias) & (IDXMAP_SIZE - 1); > + > + hlist_for_each(n, &alias_head[h]) { > + struct ll_cache *im > + = container_of(n, struct ll_cache, alias_hash); > + > + if (strcmp(im->alias, alias) == 0) > + return im; > + } > + > + return NULL; > +} > + > int ll_remember_index(struct nlmsghdr *n, void *arg) > { > unsigned int h; > - const char *ifname; > + const char *ifname, *ifalias; > struct ifinfomsg *ifi = NLMSG_DATA(n); > struct ll_cache *im; > struct rtattr *tb[IFLA_MAX+1]; > @@ -96,6 +115,10 @@ int ll_remember_index(struct nlmsghdr *n, void > *arg) > if (im) { > hlist_del(&im->name_hash); > hlist_del(&im->idx_hash); > + if (im->alias) { > + hlist_del(&im->alias_hash); > + free(im->alias); > + } > free(im); > } > return 0; > @@ -106,6 +129,8 @@ int ll_remember_index(struct nlmsghdr *n, void > *arg) > if (ifname == NULL) > return 0; > > + ifalias = tb[IFLA_IFALIAS] ? rta_getattr_str(tb[IFLA_IFALIAS]) > : NULL; > + > if (im) { > /* change to existing entry */ > if (strcmp(im->name, ifname) != 0) { > @@ -114,6 +139,14 @@ int ll_remember_index(struct nlmsghdr *n, void > *arg) > hlist_add_head(&im->name_hash, &name_head[h]); > } > > + if (im->alias) { > + hlist_del(&im->alias_hash); > + if (ifalias) { > + h = namehash(ifalias) & (IDXMAP_SIZE - > 1); > + hlist_add_head(&im->alias_hash, > &alias_head[h]); > + } > + } > + > im->flags = ifi->ifi_flags; > return 0; > } > @@ -132,6 +165,12 @@ int ll_remember_index(struct nlmsghdr *n, void > *arg) > h = namehash(ifname) & (IDXMAP_SIZE - 1); > hlist_add_head(&im->name_hash, &name_head[h]); > > + if (ifalias) { > + im->alias = strdup(ifalias); > + h = namehash(ifalias) & (IDXMAP_SIZE - 1); > + hlist_add_head(&im->alias_hash, &alias_head[h]); > + } > + > return 0; > } > > @@ -152,7 +191,7 @@ static unsigned int ll_idx_a2n(const char *name) > return idx; > } > > -static int ll_link_get(const char *name, int index) > +static int ll_link_get(const char *name, const char *alias, int > index) > { > struct { > struct nlmsghdr n; > @@ -176,6 +215,9 @@ static int ll_link_get(const char *name, int > index) > if (name) > addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, > strlen(name) + 1); > + if (alias) > + addattr_l(&req.n, sizeof(req), IFLA_IF
Re: [RFC] longer netdev names proposal
On Thu, 27 Jun 2019 20:39:48 +0200 Michal Kubecek wrote: > > > > $ ip li set dev enp3s0 alias "Onboard Ethernet" > > # ip link show "Onboard Ethernet" > > Device "Onboard Ethernet" does not exist. > > > > So it does not really appear to be an alias, it is a label. To be > > truly useful, it needs to be more than a label, it needs to be a real > > alias which you can use. > > That's exactly what I meant: to be really useful, one should be able to > use the alias(es) for setting device options, for adding routes, in > netfilter rules etc. > > Michal The kernel doesn't enforce uniqueness of alias. Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed). If it did, then handling it in iproute would be something like: diff --git a/lib/ll_map.c b/lib/ll_map.c index e0ed54bf77c9..c798ba542224 100644 --- a/lib/ll_map.c +++ b/lib/ll_map.c @@ -26,15 +26,18 @@ struct ll_cache { struct hlist_node idx_hash; struct hlist_node name_hash; + struct hlist_node alias_hash; unsignedflags; unsignedindex; unsigned short type; - charname[]; + char*alias; + charname[IFNAMSIZ]; }; #define IDXMAP_SIZE1024 static struct hlist_head idx_head[IDXMAP_SIZE]; static struct hlist_head name_head[IDXMAP_SIZE]; +static struct hlist_head alias_head[IDXMAP_SIZE]; static struct ll_cache *ll_get_by_index(unsigned index) { @@ -77,10 +80,26 @@ static struct ll_cache *ll_get_by_name(const char *name) return NULL; } +static struct ll_cache *ll_get_by_alias(const char *alias) +{ + struct hlist_node *n; + unsigned h = namehash(alias) & (IDXMAP_SIZE - 1); + + hlist_for_each(n, &alias_head[h]) { + struct ll_cache *im + = container_of(n, struct ll_cache, alias_hash); + + if (strcmp(im->alias, alias) == 0) + return im; + } + + return NULL; +} + int ll_remember_index(struct nlmsghdr *n, void *arg) { unsigned int h; - const char *ifname; + const char *ifname, *ifalias; struct ifinfomsg *ifi = NLMSG_DATA(n); struct ll_cache *im; struct rtattr *tb[IFLA_MAX+1]; @@ -96,6 +115,10 @@ int ll_remember_index(struct nlmsghdr *n, void *arg) if (im) { hlist_del(&im->name_hash); hlist_del(&im->idx_hash); + if (im->alias) { + hlist_del(&im->alias_hash); + free(im->alias); + } free(im); } return 0; @@ -106,6 +129,8 @@ int ll_remember_index(struct nlmsghdr *n, void *arg) if (ifname == NULL) return 0; + ifalias = tb[IFLA_IFALIAS] ? rta_getattr_str(tb[IFLA_IFALIAS]) : NULL; + if (im) { /* change to existing entry */ if (strcmp(im->name, ifname) != 0) { @@ -114,6 +139,14 @@ int ll_remember_index(struct nlmsghdr *n, void *arg) hlist_add_head(&im->name_hash, &name_head[h]); } + if (im->alias) { + hlist_del(&im->alias_hash); + if (ifalias) { + h = namehash(ifalias) & (IDXMAP_SIZE - 1); + hlist_add_head(&im->alias_hash, &alias_head[h]); + } + } + im->flags = ifi->ifi_flags; return 0; } @@ -132,6 +165,12 @@ int ll_remember_index(struct nlmsghdr *n, void *arg) h = namehash(ifname) & (IDXMAP_SIZE - 1); hlist_add_head(&im->name_hash, &name_head[h]); + if (ifalias) { + im->alias = strdup(ifalias); + h = namehash(ifalias) & (IDXMAP_SIZE - 1); + hlist_add_head(&im->alias_hash, &alias_head[h]); + } + return 0; } @@ -152,7 +191,7 @@ static unsigned int ll_idx_a2n(const char *name) return idx; } -static int ll_link_get(const char *name, int index) +static int ll_link_get(const char *name, const char *alias, int index) { struct { struct nlmsghdr n; @@ -176,6 +215,9 @@ static int ll_link_get(const char *name, int index) if (name) addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1); + if (alias) + addattr_l(&req.n, sizeof(req), IFLA_IFALIAS, alias, + strlen(alias) + 1); if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n, &answer) < 0) goto out; @@ -206,7 +248,7 @@ const char *ll_index_to_name(unsigned int idx) if (im) return im->name; - if (ll_link_get(NULL, idx) == idx) { + if (ll_link_get(NULL, NULL, idx) == idx) { im = ll_get
Re: [RFC] longer netdev names proposal
On Thu, Jun 27, 2019 at 08:35:38PM +0200, Andrew Lunn wrote: > On Thu, Jun 27, 2019 at 11:23:05AM -0700, Stephen Hemminger wrote: > > On Thu, 27 Jun 2019 20:08:03 +0200 Michal Kubecek wrote: > > > > > It often feels as a deficiency that unlike block devices where we can > > > keep one name and create multiple symlinks based on different naming > > > schemes, network devices can have only one name. There are aliases but > > > AFAIK they are only used (and can be only used) for SNMP. IMHO this > > > limitation is part of the mess that left us with so-called "predictable > > > names" which are in practice neither persistent nor predictable. > > > > > > So perhaps we could introduce actual aliases (or altnames or whatever we > > > would call them) for network devices that could be used to identify > > > a network device whenever both kernel and userspace tool supports them. > > > Old (and ancient) tools would have to use the one canonical name limited > > > to current IFNAMSIZ, new tools would allow using any alias which could > > > be longer. > > > > That is already there in current network model. > > # ip li set dev eno1 alias 'Onboard Ethernet' > > # ip li show dev eno1 > > 2: eno1: mtu 1500 qdisc mq state UP mode > > DEFAULT group default qlen 1000 > > link/ether ac:1f:6b:74:38:c0 brd ff:ff:ff:ff:ff:ff > > alias Onboard Ethernet > > $ ip li set dev enp3s0 alias "Onboard Ethernet" > # ip link show "Onboard Ethernet" > Device "Onboard Ethernet" does not exist. > > So it does not really appear to be an alias, it is a label. To be > truly useful, it needs to be more than a label, it needs to be a real > alias which you can use. That's exactly what I meant: to be really useful, one should be able to use the alias(es) for setting device options, for adding routes, in netfilter rules etc. Michal
Re: [RFC] longer netdev names proposal
On Thu, Jun 27, 2019 at 11:23:05AM -0700, Stephen Hemminger wrote: > On Thu, 27 Jun 2019 20:08:03 +0200 > Michal Kubecek wrote: > > > It often feels as a deficiency that unlike block devices where we can > > keep one name and create multiple symlinks based on different naming > > schemes, network devices can have only one name. There are aliases but > > AFAIK they are only used (and can be only used) for SNMP. IMHO this > > limitation is part of the mess that left us with so-called "predictable > > names" which are in practice neither persistent nor predictable. > > > > So perhaps we could introduce actual aliases (or altnames or whatever we > > would call them) for network devices that could be used to identify > > a network device whenever both kernel and userspace tool supports them. > > Old (and ancient) tools would have to use the one canonical name limited > > to current IFNAMSIZ, new tools would allow using any alias which could > > be longer. > > > > Michal > > > That is already there in current network model. > # ip li set dev eno1 alias 'Onboard Ethernet' > # ip li show dev eno1 > 2: eno1: mtu 1500 qdisc mq state UP mode > DEFAULT group default qlen 1000 > link/ether ac:1f:6b:74:38:c0 brd ff:ff:ff:ff:ff:ff > alias Onboard Ethernet Hi Stephen $ ip li set dev enp3s0 alias "Onboard Ethernet" # ip link show "Onboard Ethernet" Device "Onboard Ethernet" does not exist. So it does not really appear to be an alias, it is a label. To be truly useful, it needs to be more than a label, it needs to be a real alias which you can use. Andrew
Re: [RFC] longer netdev names proposal
On Thu, 27 Jun 2019 20:08:03 +0200 Michal Kubecek wrote: > It often feels as a deficiency that unlike block devices where we can > keep one name and create multiple symlinks based on different naming > schemes, network devices can have only one name. There are aliases but > AFAIK they are only used (and can be only used) for SNMP. IMHO this > limitation is part of the mess that left us with so-called "predictable > names" which are in practice neither persistent nor predictable. > > So perhaps we could introduce actual aliases (or altnames or whatever we > would call them) for network devices that could be used to identify > a network device whenever both kernel and userspace tool supports them. > Old (and ancient) tools would have to use the one canonical name limited > to current IFNAMSIZ, new tools would allow using any alias which could > be longer. > > Michal That is already there in current network model. # ip li set dev eno1 alias 'Onboard Ethernet' # ip li show dev eno1 2: eno1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:74:38:c0 brd ff:ff:ff:ff:ff:ff alias Onboard Ethernet
Re: [RFC] longer netdev names proposal
On Thu, Jun 27, 2019 at 11:14:31AM -0600, David Ahern wrote: > > 4) There are two cases that can happen during rename: > >A) The name is shorter than IFNAMSIZ > > -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: > > original IFLA_NAME = eth0 > > original IFLA_NAME_EXT = eth0 > > renamed IFLA_NAME = enp5s0f1npf0vf1 > > renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 > >B) The name is longer tha IFNAMSIZ > > -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would > > contain the new one: > > original IFLA_NAME = eth0 > > original IFLA_NAME_EXT = eth0 > > renamed IFLA_NAME = eth0 > > renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 > > so kernel side there will be 2 names for the same net_device? It often feels as a deficiency that unlike block devices where we can keep one name and create multiple symlinks based on different naming schemes, network devices can have only one name. There are aliases but AFAIK they are only used (and can be only used) for SNMP. IMHO this limitation is part of the mess that left us with so-called "predictable names" which are in practice neither persistent nor predictable. So perhaps we could introduce actual aliases (or altnames or whatever we would call them) for network devices that could be used to identify a network device whenever both kernel and userspace tool supports them. Old (and ancient) tools would have to use the one canonical name limited to current IFNAMSIZ, new tools would allow using any alias which could be longer. Michal
Re: [RFC] longer netdev names proposal
On Thu, 27 Jun 2019 10:48:08 -0700 Jakub Kicinski wrote: > On Thu, 27 Jun 2019 11:43:27 +0200, Jiri Pirko wrote: > > Hi all. > > > > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for > > netdevice name length. Now when we have PF and VF representors > > with port names like "pfXvfY", it became quite common to hit this limit: > > 0123456789012345 > > enp131s0f1npf0vf6 > > enp131s0f1npf0vf22 > > > > Since IFLA_NAME is just a string, I though it might be possible to use > > it to carry longer names as it is. However, the userspace tools, like > > iproute2, are doing checks before print out. So for example in output of > > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is > > completely avoided. > > > > So here is a proposal that might work: > > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than > >IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel, > >user should be prepared for any string size. > > 2) Add a file in sysfs that would indicate that NAME_EXT is supported by > >the kernel. > > 3) Udev is going to look for the sysfs indication file. In case when > >kernel supports long names, it will do rename to longer name, setting > >IFLA_NAME_EXT. If not, it does what it does now - fail. > > 4) There are two cases that can happen during rename: > >A) The name is shorter than IFNAMSIZ > > -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: > > original IFLA_NAME = eth0 > > original IFLA_NAME_EXT = eth0 > > renamed IFLA_NAME = enp5s0f1npf0vf1 > > renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 > >B) The name is longer tha IFNAMSIZ > > -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would > > contain the new one: > > original IFLA_NAME = eth0 > > original IFLA_NAME_EXT = eth0 > > renamed IFLA_NAME = eth0 > > renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 > > I think B is the only way, A risks duplicate IFLA_NAMEs over ioctl, > right? And maybe there is some crazy application out there which > mixes netlink and ioctl. > > I guess it's not worse than status quo, given that today renames > will fail and we will either get truncated names or eth0s.. > > > This would allow the old tools to work with "eth0" and the new > > tools would work with "enp131s0f1npf0vf22". In sysfs, there would > > be symlink from one name to another. > > > > Also, there might be a warning added to kernel if someone works > > with IFLA_NAME that the userspace tool should be upgraded. > > > > Eventually, only IFLA_NAME_EXT is going to be used by everyone. > > > > I'm aware there are other places where similar new attribute > > would have to be introduced too (ip rule for example). > > I'm not saying this is a simple work. > > > > Question is what to do with the ioctl api (get ifindex etc). I would > > probably leave it as is and push tools to use rtnetlink instead. > > > > Any ideas why this would not work? Any ideas how to solve this > > differently? > > Since we'd have to update all user space to make use of the new names > I'd be tempted to move to a more structured device identification. > > 5: enp131s0f1npf0vf6: ... > > vs: > > 5: eth5 (parent enp131s0f1 pf 0 vf 6 peer X*): ... > > * ;) > > And allow filtering/selection of device based on more attributes than > just name and ifindex. In practice in container workloads, for example, > the names are already very much insufficient to identify the device. > Refocusing on attributes is probably a big effort and not that practical > for traditional CLI users? IDK > > Anyway, IMHO your scheme is strictly better than status quo. Or Cisco style naming ;-) Ethernet0/0 There is a better solution for human use already. the field ifalias allows arbitrary values and hooked into SNMP. Why not have userspace fill in this field with something by default?
Re: [RFC] longer netdev names proposal
On Thu, 27 Jun 2019 11:43:27 +0200, Jiri Pirko wrote: > Hi all. > > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for > netdevice name length. Now when we have PF and VF representors > with port names like "pfXvfY", it became quite common to hit this limit: > 0123456789012345 > enp131s0f1npf0vf6 > enp131s0f1npf0vf22 > > Since IFLA_NAME is just a string, I though it might be possible to use > it to carry longer names as it is. However, the userspace tools, like > iproute2, are doing checks before print out. So for example in output of > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is > completely avoided. > > So here is a proposal that might work: > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than >IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel, >user should be prepared for any string size. > 2) Add a file in sysfs that would indicate that NAME_EXT is supported by >the kernel. > 3) Udev is going to look for the sysfs indication file. In case when >kernel supports long names, it will do rename to longer name, setting >IFLA_NAME_EXT. If not, it does what it does now - fail. > 4) There are two cases that can happen during rename: >A) The name is shorter than IFNAMSIZ > -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: > original IFLA_NAME = eth0 > original IFLA_NAME_EXT = eth0 > renamed IFLA_NAME = enp5s0f1npf0vf1 > renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 >B) The name is longer tha IFNAMSIZ > -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would > contain the new one: > original IFLA_NAME = eth0 > original IFLA_NAME_EXT = eth0 > renamed IFLA_NAME = eth0 > renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 I think B is the only way, A risks duplicate IFLA_NAMEs over ioctl, right? And maybe there is some crazy application out there which mixes netlink and ioctl. I guess it's not worse than status quo, given that today renames will fail and we will either get truncated names or eth0s.. > This would allow the old tools to work with "eth0" and the new > tools would work with "enp131s0f1npf0vf22". In sysfs, there would > be symlink from one name to another. > > Also, there might be a warning added to kernel if someone works > with IFLA_NAME that the userspace tool should be upgraded. > > Eventually, only IFLA_NAME_EXT is going to be used by everyone. > > I'm aware there are other places where similar new attribute > would have to be introduced too (ip rule for example). > I'm not saying this is a simple work. > > Question is what to do with the ioctl api (get ifindex etc). I would > probably leave it as is and push tools to use rtnetlink instead. > > Any ideas why this would not work? Any ideas how to solve this > differently? Since we'd have to update all user space to make use of the new names I'd be tempted to move to a more structured device identification. 5: enp131s0f1npf0vf6: ... vs: 5: eth5 (parent enp131s0f1 pf 0 vf 6 peer X*): ... * ;) And allow filtering/selection of device based on more attributes than just name and ifindex. In practice in container workloads, for example, the names are already very much insufficient to identify the device. Refocusing on attributes is probably a big effort and not that practical for traditional CLI users? IDK Anyway, IMHO your scheme is strictly better than status quo.
Re: [RFC] longer netdev names proposal
On 6/27/19 3:43 AM, Jiri Pirko wrote: > Hi all. > > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for > netdevice name length. Now when we have PF and VF representors > with port names like "pfXvfY", it became quite common to hit this limit: > 0123456789012345 > enp131s0f1npf0vf6 > enp131s0f1npf0vf22 QinQ (stacked vlans) is another example. > > Since IFLA_NAME is just a string, I though it might be possible to use > it to carry longer names as it is. However, the userspace tools, like > iproute2, are doing checks before print out. So for example in output of > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is > completely avoided. > > So here is a proposal that might work: > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than >IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel, >user should be prepared for any string size. > 2) Add a file in sysfs that would indicate that NAME_EXT is supported by >the kernel. no sysfs files. Johannes added infrastructure to retrieve the policy. That is a more flexible and robust option for determining what the kernel supports. > 3) Udev is going to look for the sysfs indication file. In case when >kernel supports long names, it will do rename to longer name, setting >IFLA_NAME_EXT. If not, it does what it does now - fail. > 4) There are two cases that can happen during rename: >A) The name is shorter than IFNAMSIZ > -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: > original IFLA_NAME = eth0 > original IFLA_NAME_EXT = eth0 > renamed IFLA_NAME = enp5s0f1npf0vf1 > renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 >B) The name is longer tha IFNAMSIZ > -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would > contain the new one: > original IFLA_NAME = eth0 > original IFLA_NAME_EXT = eth0 > renamed IFLA_NAME = eth0 > renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 so kernel side there will be 2 names for the same net_device? > > This would allow the old tools to work with "eth0" and the new > tools would work with "enp131s0f1npf0vf22". In sysfs, there would > be symlink from one name to another. I would prefer a solution that does not rely on sysfs hooks. > > Also, there might be a warning added to kernel if someone works > with IFLA_NAME that the userspace tool should be upgraded. that seems like spam and confusion for the first few years of a new api. > > Eventually, only IFLA_NAME_EXT is going to be used by everyone. > > I'm aware there are other places where similar new attribute > would have to be introduced too (ip rule for example). > I'm not saying this is a simple work. > > Question is what to do with the ioctl api (get ifindex etc). I would > probably leave it as is and push tools to use rtnetlink instead. The ioctl API is going to be a limiter here. ifconfig is still quite prevalent and net-snmp still uses ioctl (as just 2 common examples). snmp showing one set of names and rtnetlink s/w showing another is going to be really confusing.
Re: [RFC] longer netdev names proposal
On Thu, 2019-06-27 at 08:29 -0700, Stephen Hemminger wrote: > On Thu, 27 Jun 2019 11:43:27 +0200 > Jiri Pirko wrote: > > > Hi all. > > > > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit > > for > > netdevice name length. Now when we have PF and VF representors > > with port names like "pfXvfY", it became quite common to hit this > > limit: > > 0123456789012345 > > enp131s0f1npf0vf6 > > enp131s0f1npf0vf22 > > > > Since IFLA_NAME is just a string, I though it might be possible to > > use > > it to carry longer names as it is. However, the userspace tools, > > like > > iproute2, are doing checks before print out. So for example in > > output of > > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is > > completely avoided. > > > > So here is a proposal that might work: > > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer > > than > >IFNAMSIZE, say 64 bytes. The max size should be only defined in > > kernel, > >user should be prepared for any string size. > > 2) Add a file in sysfs that would indicate that NAME_EXT is > > supported by > >the kernel. > > 3) Udev is going to look for the sysfs indication file. In case > > when > >kernel supports long names, it will do rename to longer name, > > setting > >IFLA_NAME_EXT. If not, it does what it does now - fail. > > 4) There are two cases that can happen during rename: > >A) The name is shorter than IFNAMSIZ > > -> both IFLA_NAME and IFLA_NAME_EXT would contain the same > > string: > > original IFLA_NAME = eth0 > > original IFLA_NAME_EXT = eth0 > > renamed IFLA_NAME = enp5s0f1npf0vf1 > > renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 > >B) The name is longer tha IFNAMSIZ > > -> IFLA_NAME would contain the original one, IFLA_NAME_EXT > > would > > contain the new one: > > original IFLA_NAME = eth0 > > original IFLA_NAME_EXT = eth0 > > renamed IFLA_NAME = eth0 > > renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 It makes me a bit uncomfortable to allow IFLA_NAME and IFLA_NAME_EXT to be completely different. That sounds like a big source of confusion and debugging problems in production. Dan > > This would allow the old tools to work with "eth0" and the new > > tools would work with "enp131s0f1npf0vf22". In sysfs, there would > > be symlink from one name to another. > > > > Also, there might be a warning added to kernel if someone works > > with IFLA_NAME that the userspace tool should be upgraded. > > > > Eventually, only IFLA_NAME_EXT is going to be used by everyone. > > > > I'm aware there are other places where similar new attribute > > would have to be introduced too (ip rule for example). > > I'm not saying this is a simple work. > > > > Question is what to do with the ioctl api (get ifindex etc). I > > would > > probably leave it as is and push tools to use rtnetlink instead. > > > > Any ideas why this would not work? Any ideas how to solve this > > differently? > > > > Thanks! > > > > Jiri > > > > I looked into this in the past, but then rejected it because > there are so many tools that use names, not just iproute2. > Plus long names are very user unfriendly.
Re: [RFC] longer netdev names proposal
On Thu, 27 Jun 2019 11:43:27 +0200 Jiri Pirko wrote: > Hi all. > > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for > netdevice name length. Now when we have PF and VF representors > with port names like "pfXvfY", it became quite common to hit this limit: > 0123456789012345 > enp131s0f1npf0vf6 > enp131s0f1npf0vf22 > > Since IFLA_NAME is just a string, I though it might be possible to use > it to carry longer names as it is. However, the userspace tools, like > iproute2, are doing checks before print out. So for example in output of > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is > completely avoided. > > So here is a proposal that might work: > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than >IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel, >user should be prepared for any string size. > 2) Add a file in sysfs that would indicate that NAME_EXT is supported by >the kernel. > 3) Udev is going to look for the sysfs indication file. In case when >kernel supports long names, it will do rename to longer name, setting >IFLA_NAME_EXT. If not, it does what it does now - fail. > 4) There are two cases that can happen during rename: >A) The name is shorter than IFNAMSIZ > -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: > original IFLA_NAME = eth0 > original IFLA_NAME_EXT = eth0 > renamed IFLA_NAME = enp5s0f1npf0vf1 > renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 >B) The name is longer tha IFNAMSIZ > -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would > contain the new one: > original IFLA_NAME = eth0 > original IFLA_NAME_EXT = eth0 > renamed IFLA_NAME = eth0 > renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 > > This would allow the old tools to work with "eth0" and the new > tools would work with "enp131s0f1npf0vf22". In sysfs, there would > be symlink from one name to another. > > Also, there might be a warning added to kernel if someone works > with IFLA_NAME that the userspace tool should be upgraded. > > Eventually, only IFLA_NAME_EXT is going to be used by everyone. > > I'm aware there are other places where similar new attribute > would have to be introduced too (ip rule for example). > I'm not saying this is a simple work. > > Question is what to do with the ioctl api (get ifindex etc). I would > probably leave it as is and push tools to use rtnetlink instead. > > Any ideas why this would not work? Any ideas how to solve this > differently? > > Thanks! > > Jiri > I looked into this in the past, but then rejected it because there are so many tools that use names, not just iproute2. Plus long names are very user unfriendly.
[RFC] longer netdev names proposal
Hi all. In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for netdevice name length. Now when we have PF and VF representors with port names like "pfXvfY", it became quite common to hit this limit: 0123456789012345 enp131s0f1npf0vf6 enp131s0f1npf0vf22 Since IFLA_NAME is just a string, I though it might be possible to use it to carry longer names as it is. However, the userspace tools, like iproute2, are doing checks before print out. So for example in output of "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is completely avoided. So here is a proposal that might work: 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel, user should be prepared for any string size. 2) Add a file in sysfs that would indicate that NAME_EXT is supported by the kernel. 3) Udev is going to look for the sysfs indication file. In case when kernel supports long names, it will do rename to longer name, setting IFLA_NAME_EXT. If not, it does what it does now - fail. 4) There are two cases that can happen during rename: A) The name is shorter than IFNAMSIZ -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string: original IFLA_NAME = eth0 original IFLA_NAME_EXT = eth0 renamed IFLA_NAME = enp5s0f1npf0vf1 renamed IFLA_NAME_EXT = enp5s0f1npf0vf1 B) The name is longer tha IFNAMSIZ -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would contain the new one: original IFLA_NAME = eth0 original IFLA_NAME_EXT = eth0 renamed IFLA_NAME = eth0 renamed IFLA_NAME_EXT = enp131s0f1npf0vf22 This would allow the old tools to work with "eth0" and the new tools would work with "enp131s0f1npf0vf22". In sysfs, there would be symlink from one name to another. Also, there might be a warning added to kernel if someone works with IFLA_NAME that the userspace tool should be upgraded. Eventually, only IFLA_NAME_EXT is going to be used by everyone. I'm aware there are other places where similar new attribute would have to be introduced too (ip rule for example). I'm not saying this is a simple work. Question is what to do with the ioctl api (get ifindex etc). I would probably leave it as is and push tools to use rtnetlink instead. Any ideas why this would not work? Any ideas how to solve this differently? Thanks! Jiri
[RFC v2] vsock: proposal to support multiple transports at runtime
Hi all, this is a v2 of a proposal addressing the comments made by Dexuan, Stefan, and Jorgen. v1: https://www.spinics.net/lists/netdev/msg570274.html We can define two types of transport that we have to handle at the same time (e.g. in a nested VM we would have both types of transport running together): - 'host->guest' transport, it runs in the host and it is used to communicate with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also runs in the guest who has nested guests, to communicate with them. [Phase 2] We can support multiple 'host->guest' transport running at the same time, but on x86 only one hypervisor uses VMX at any given time. - 'guest->host' transport, it runs in the guest and it is used to communicate with the host. The main goal is to find a way to decide what transport use in these cases: 1. connect() / sendto() a. use the 'host->guest' transport, if the destination is the guest (dest_cid > VMADDR_CID_HOST). [Phase 2] In order to support multiple 'host->guest' transports running at the same time, we should assign CIDs uniquely across all transports. In this way, a packet generated by the host side will get directed to the appropriate transport based on the CID. b. use the 'guest->host' transport, if the destination is the host or the hypervisor. (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR) 2. listen() / recvfrom() a. use the 'host->guest' transport, if the socket is bound to VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no 'guest->host' transport. We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to address this case. [Phase 2] We can support network namespaces to create independent AF_VSOCK addressing domains: - could be used to partition VMs between hypervisors or at a finer granularity; - could be used to isolate host applications from guest applications using the same ports with CID_ANY; b. use the 'guest->host' transport, if the socket is bound to local CID different from the VMADDR_CID_HOST (guest CID get with IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be backward compatible). Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. c. shared port space between transports For incoming requests or packets, we should be able to choose which transport use, looking at the 'port' requested. - stream sockets already support shared port space between transports (one port can be assigned to only one transport) [Phase 2] - datagram sockets will support it, but for now VMCI transport is the default transport for any host side datagram socket (KVM and Hyper-V do not yet support datagrams sockets) We will make the loading of af_vsock.ko independent of the transports to allow to: - create a AF_VSOCK socket without any loaded transports; - listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded transports; Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the vmci_transport.ko to the af_vsock.ko. [Jorgen will check if this will impact the existing VMware products] Notes: - For Hyper-V sockets, the host can only be Windows. No changes should be required on the Windows host to support the changes on this proposal. - Communication between guests are not allowed on any transports, so we can drop packets sent from a guest to another guest (dest_cid > VMADDR_CID_HOST) if the 'host->guest' transport is not available. - [Phase 2] tag used to identify things that can be done at a later stage, but that should be taken into account during this design. - Namespace support will be developed in [Phase 2] or in a separate project. Comments and suggestions are welcome. I'll be on PTO for next two weeks, so sorry in advance if I'll answer later. If we agree on this proposal, when I get back, I'll start working on the code to get a first PATCH RFC. Cheers, Stefano
Re: [RFC] vsock: proposal to support multiple transports at runtime
On Fri, May 31, 2019 at 09:24:49AM +, Jorgen Hansen wrote: > On 30 May 2019, at 13:19, Stefano Garzarella wrote: > > > > On Tue, May 28, 2019 at 04:01:00PM +, Jorgen Hansen wrote: > >>> On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote: > >>>> On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote: > > > >>>>> > >>>>> > >>>>> 2. listen() / recvfrom() > >>>>> > >>>>>a. use the 'host side transport', if the socket is bound to > >>>>> VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > >>>>> guest transport. > >>>>> We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order > >>>>> to > >>>>> address this case. > >>>>> If we want to support multiple 'host side transport' running at > >>>>> the > >>>>> same time, we should find a way to allow an application to bound a > >>>>> specific host transport (e.g. adding new > >>>>> VMADDR_CID_LISTEN_FROM_KVM, > >>>>> VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV) > >>>> > >>>> Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE, > >>>> VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible. What if my service > >>>> should only be available to a subset of VMware VMs? > >>> > >>> You're right, it is not very flexible. > >> > >> When I was last looking at this, I was considering a proposal where > >> the incoming traffic would determine which transport to use for > >> CID_ANY in the case of multiple transports. For stream sockets, we > >> already have a shared port space, so if we receive a connection > >> request for < port N, CID_ANY>, that connection would use the > >> transport of the incoming request. The transport could either be a > >> host->guest transport or the guest->host transport. This is a bit > >> harder to do for datagrams since the VSOCK port is decided by the > >> transport itself today. For VMCI, a VMCI datagram handler is allocated > >> for each datagram socket, and the ID of that handler is used as the > >> port. So we would potentially have to register the same datagram port > >> with all transports. > > > > So, do you think we should implement a shared port space also for > > datagram sockets? > > Yes, having the two socket types work the same way seems cleaner to me. We > should at least cover it in the design. > Okay, I'll add this point on a v2 of this proposal! > > For now only the VMWare implementation supports the datagram sockets, > > but in the future we could support it also on KVM and HyperV, so I think > > we should consider it in this proposal. > > So for now, it sounds like we could make the VMCI transport the default > transport for any host side datagram socket, then. > Yes, make sense. > >> > >> The use of network namespaces would be complimentary to this, and > >> could be used to partition VMs between hypervisors or at a finer > >> granularity. This could also be used to isolate host applications from > >> guest applications using the same ports with CID_ANY if necessary. > >> > > > > Another point to the netns support, I'll put it in the proposal (or it > > could go in parallel with the multi-transport support). > > > > It should be fine to put in the proposal that we rely on namespaces to > provide this support, but pursue namespaces as a separate project. Sure. I'll send a v2 adding all the points discussed to be sure that we are aligned. Then I'll start working on it if we agree on the proposal. Thanks, Stefano
Re: [RFC] vsock: proposal to support multiple transports at runtime
On 30 May 2019, at 13:19, Stefano Garzarella wrote: > > On Tue, May 28, 2019 at 04:01:00PM +, Jorgen Hansen wrote: >>> On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote: >>>> On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote: > >>>>> >>>>> >>>>> 2. listen() / recvfrom() >>>>> >>>>>a. use the 'host side transport', if the socket is bound to >>>>> VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no >>>>> guest transport. >>>>> We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to >>>>> address this case. >>>>> If we want to support multiple 'host side transport' running at the >>>>> same time, we should find a way to allow an application to bound a >>>>> specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM, >>>>> VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV) >>>> >>>> Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE, >>>> VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible. What if my service >>>> should only be available to a subset of VMware VMs? >>> >>> You're right, it is not very flexible. >> >> When I was last looking at this, I was considering a proposal where >> the incoming traffic would determine which transport to use for >> CID_ANY in the case of multiple transports. For stream sockets, we >> already have a shared port space, so if we receive a connection >> request for < port N, CID_ANY>, that connection would use the >> transport of the incoming request. The transport could either be a >> host->guest transport or the guest->host transport. This is a bit >> harder to do for datagrams since the VSOCK port is decided by the >> transport itself today. For VMCI, a VMCI datagram handler is allocated >> for each datagram socket, and the ID of that handler is used as the >> port. So we would potentially have to register the same datagram port >> with all transports. > > So, do you think we should implement a shared port space also for > datagram sockets? Yes, having the two socket types work the same way seems cleaner to me. We should at least cover it in the design. > For now only the VMWare implementation supports the datagram sockets, > but in the future we could support it also on KVM and HyperV, so I think > we should consider it in this proposal. So for now, it sounds like we could make the VMCI transport the default transport for any host side datagram socket, then. >> >> The use of network namespaces would be complimentary to this, and >> could be used to partition VMs between hypervisors or at a finer >> granularity. This could also be used to isolate host applications from >> guest applications using the same ports with CID_ANY if necessary. >> > > Another point to the netns support, I'll put it in the proposal (or it > could go in parallel with the multi-transport support). > It should be fine to put in the proposal that we rely on namespaces to provide this support, but pursue namespaces as a separate project. Thanks, Jorgen
Re: [RFC] vsock: proposal to support multiple transports at runtime
On Tue, May 28, 2019 at 04:01:00PM +, Jorgen Hansen wrote: > > On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote: > > > On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote: > > > > Hi guys, > > > > I'm currently interested on implement a multi-transport support for > > > > VSOCK in > > > > order to handle nested VMs. > > Thanks for picking this up! > :) > > > > > > > > As Stefan suggested me, I started to look at this discussion: > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2017%2F8%2F17%2F551&data=02%7C01%7Cjhansen%40vmware.com%7Cc2a340a868bb4525c6d408d6e2905909%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636945506938670252&sdata=kl820ZF1AAOXEyCZYoNPpYmLVyvK3ISr1GT0oDODEn4%3D&reserved=0 > > > > Below I tried to summarize a proposal for a discussion, following the > > > > ideas > > > > from Dexuan, Jorgen, and Stefan. > > > > > > > > > > > > We can define two types of transport that we have to handle at the same > > > > time > > > > (e.g. in a nested VM we would have both types of transport running > > > > together): > > > > > > > > - 'host side transport', it runs in the host and it is used to > > > > communicate with > > > > the guests of a specific hypervisor (KVM, VMWare or HyperV) > > > > > > > > Should we support multiple 'host side transport' running at the same > > > > time? > > > > > > > > - 'guest side transport'. it runs in the guest and it is used to > > > > communicate > > > > with the host transport > > > > > > I find this terminology confusing. Perhaps "host->guest" (your 'host > > > side transport') and "guest->host" (your 'guest side transport') is > > > clearer? > > > > I agree, "host->guest" and "guest->host" are better, I'll use them. > > > > > > > > Or maybe the nested virtualization terminology of L2 transport (your > > > 'host side transport') and L0 transport (your 'guest side transport')? > > > Here we are the L1 guest and L0 is the host and L2 is our nested guest. > > > > > > > I'm confused, if L2 is the nested guest, it should be the > > 'guest side transport'. Did I miss anything? > > > > Maybe it is another point to your first proposal :) > > > > > > > > > > > > > > The main goal is to find a way to decide what transport use in these > > > > cases: > > > > 1. connect() / sendto() > > > > > > > > a. use the 'host side transport', if the destination is the guest > > > >(dest_cid > VMADDR_CID_HOST). > > > >If we want to support multiple 'host side transport' running at > > > > the > > > >same time, we should assign CIDs uniquely across all transports. > > > >In this way, a packet generated by the host side will get > > > > directed > > > >to the appropriate transport based on the CID > > > > > > The multiple host side transport case is unlikely to be necessary on x86 > > > where only one hypervisor uses VMX at any given time. But eventually it > > > may happen so it's wise to at least allow it in the design. > > > > > > > Okay, I was in doubt, but I'll keep it in the design. > > > > > > > > > > b. use the 'guest side transport', if the destination is the host > > > >(dest_cid == VMADDR_CID_HOST) > > > > > > Makes sense to me. > > > > > Agreed. With the addition that VMADDR_CID_HYPERVISOR is also routed as > "guest->host/guest side transport". > Yes, I had it in mind, but I forgot to write it in the proposal. > >> > > >> > > >> > 2. listen() / recvfrom() > > > > > >> > a. use the 'host side transport', if the socket is bound to > > > >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > > > >guest transport. > > > >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order > > > > to > > > >address this case. > > > >If we want t
Re: [RFC] vsock: proposal to support multiple transports at runtime
> On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote: > > On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote: > > > Hi guys, > > > I'm currently interested on implement a multi-transport support for VSOCK > > > in > > > order to handle nested VMs. Thanks for picking this up! > > > > > > As Stefan suggested me, I started to look at this discussion: > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2017%2F8%2F17%2F551&data=02%7C01%7Cjhansen%40vmware.com%7Cc2a340a868bb4525c6d408d6e2905909%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636945506938670252&sdata=kl820ZF1AAOXEyCZYoNPpYmLVyvK3ISr1GT0oDODEn4%3D&reserved=0 > > > Below I tried to summarize a proposal for a discussion, following the > > > ideas > > > from Dexuan, Jorgen, and Stefan. > > > > > > > > > We can define two types of transport that we have to handle at the same > > > time > > > (e.g. in a nested VM we would have both types of transport running > > > together): > > > > > > - 'host side transport', it runs in the host and it is used to > > > communicate with > > > the guests of a specific hypervisor (KVM, VMWare or HyperV) > > > > > > Should we support multiple 'host side transport' running at the same > > > time? > > > > > > - 'guest side transport'. it runs in the guest and it is used to > > > communicate > > > with the host transport > > > > I find this terminology confusing. Perhaps "host->guest" (your 'host > > side transport') and "guest->host" (your 'guest side transport') is > > clearer? > > I agree, "host->guest" and "guest->host" are better, I'll use them. > > > > > Or maybe the nested virtualization terminology of L2 transport (your > > 'host side transport') and L0 transport (your 'guest side transport')? > > Here we are the L1 guest and L0 is the host and L2 is our nested guest. > > > > I'm confused, if L2 is the nested guest, it should be the > 'guest side transport'. Did I miss anything? > > Maybe it is another point to your first proposal :) > > > > > > > > > > The main goal is to find a way to decide what transport use in these > > > cases: > > > 1. connect() / sendto() > > > > > > a. use the 'host side transport', if the destination is the guest > > >(dest_cid > VMADDR_CID_HOST). > > >If we want to support multiple 'host side transport' running at the > > >same time, we should assign CIDs uniquely across all transports. > > >In this way, a packet generated by the host side will get directed > > >to the appropriate transport based on the CID > > > > The multiple host side transport case is unlikely to be necessary on x86 > > where only one hypervisor uses VMX at any given time. But eventually it > > may happen so it's wise to at least allow it in the design. > > > > Okay, I was in doubt, but I'll keep it in the design. > > > > > > > b. use the 'guest side transport', if the destination is the host > > >(dest_cid == VMADDR_CID_HOST) > > > > Makes sense to me. > > Agreed. With the addition that VMADDR_CID_HYPERVISOR is also routed as "guest->host/guest side transport". >> > >> > >> > 2. listen() / recvfrom() > > > >> > a. use the 'host side transport', if the socket is bound to > > >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > > >guest transport. > > >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > > >address this case. > > >If we want to support multiple 'host side transport' running at the > > >same time, we should find a way to allow an application to bound a > > >specific host transport (e.g. adding new > > > VMADDR_CID_LISTEN_FROM_KVM, > > >VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV) > > > > Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE, > > VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible. What if my service > > should only be available to a subset of VMware VMs? > > You're right, it is not very flexible. When I was
Re: [RFC] vsock: proposal to support multiple transports at runtime
On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote: > On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote: > > Hi guys, > > I'm currently interested on implement a multi-transport support for VSOCK in > > order to handle nested VMs. > > > > As Stefan suggested me, I started to look at this discussion: > > https://lkml.org/lkml/2017/8/17/551 > > Below I tried to summarize a proposal for a discussion, following the ideas > > from Dexuan, Jorgen, and Stefan. > > > > > > We can define two types of transport that we have to handle at the same time > > (e.g. in a nested VM we would have both types of transport running > > together): > > > > - 'host side transport', it runs in the host and it is used to communicate > > with > > the guests of a specific hypervisor (KVM, VMWare or HyperV) > > > > Should we support multiple 'host side transport' running at the same time? > > > > - 'guest side transport'. it runs in the guest and it is used to communicate > > with the host transport > > I find this terminology confusing. Perhaps "host->guest" (your 'host > side transport') and "guest->host" (your 'guest side transport') is > clearer? I agree, "host->guest" and "guest->host" are better, I'll use them. > > Or maybe the nested virtualization terminology of L2 transport (your > 'host side transport') and L0 transport (your 'guest side transport')? > Here we are the L1 guest and L0 is the host and L2 is our nested guest. > I'm confused, if L2 is the nested guest, it should be the 'guest side transport'. Did I miss anything? Maybe it is another point to your first proposal :) > > > > > > The main goal is to find a way to decide what transport use in these cases: > > 1. connect() / sendto() > > > > a. use the 'host side transport', if the destination is the guest > >(dest_cid > VMADDR_CID_HOST). > >If we want to support multiple 'host side transport' running at the > >same time, we should assign CIDs uniquely across all transports. > >In this way, a packet generated by the host side will get directed > >to the appropriate transport based on the CID > > The multiple host side transport case is unlikely to be necessary on x86 > where only one hypervisor uses VMX at any given time. But eventually it > may happen so it's wise to at least allow it in the design. > Okay, I was in doubt, but I'll keep it in the design. > > > > b. use the 'guest side transport', if the destination is the host > >(dest_cid == VMADDR_CID_HOST) > > Makes sense to me. > > > > > > > 2. listen() / recvfrom() > > > > a. use the 'host side transport', if the socket is bound to > >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > >guest transport. > >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > >address this case. > >If we want to support multiple 'host side transport' running at the > >same time, we should find a way to allow an application to bound a > >specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM, > >VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV) > > Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE, > VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible. What if my service > should only be available to a subset of VMware VMs? You're right, it is not very flexible. > > Instead it might be more appropriate to use network namespaces to create > independent AF_VSOCK addressing domains. Then you could have two > separate groups of VMware VMs and selectively listen to just one group. > Does AF_VSOCK support network namespace or it could be another improvement to take care? (IIUC is not currently supported) A possible issue that I'm seeing with netns is if they are used for other purpose (e.g. to isolate the network of a VM), we should have multiple instances of the application, one per netns. > > > > b. use the 'guest side transport', if the socket is bound to local CID > >different from the VMADDR_CID_HOST (guest CID get with > >IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY > >(to be backward compatible). > >Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. >
Re: [RFC] vsock: proposal to support multiple transports at runtime
On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote: > Hi guys, > I'm currently interested on implement a multi-transport support for VSOCK in > order to handle nested VMs. > > As Stefan suggested me, I started to look at this discussion: > https://lkml.org/lkml/2017/8/17/551 > Below I tried to summarize a proposal for a discussion, following the ideas > from Dexuan, Jorgen, and Stefan. > > > We can define two types of transport that we have to handle at the same time > (e.g. in a nested VM we would have both types of transport running together): > > - 'host side transport', it runs in the host and it is used to communicate > with > the guests of a specific hypervisor (KVM, VMWare or HyperV) > > Should we support multiple 'host side transport' running at the same time? > > - 'guest side transport'. it runs in the guest and it is used to communicate > with the host transport I find this terminology confusing. Perhaps "host->guest" (your 'host side transport') and "guest->host" (your 'guest side transport') is clearer? Or maybe the nested virtualization terminology of L2 transport (your 'host side transport') and L0 transport (your 'guest side transport')? Here we are the L1 guest and L0 is the host and L2 is our nested guest. > > > The main goal is to find a way to decide what transport use in these cases: > 1. connect() / sendto() > > a. use the 'host side transport', if the destination is the guest > (dest_cid > VMADDR_CID_HOST). > If we want to support multiple 'host side transport' running at the > same time, we should assign CIDs uniquely across all transports. > In this way, a packet generated by the host side will get directed > to the appropriate transport based on the CID The multiple host side transport case is unlikely to be necessary on x86 where only one hypervisor uses VMX at any given time. But eventually it may happen so it's wise to at least allow it in the design. > > b. use the 'guest side transport', if the destination is the host > (dest_cid == VMADDR_CID_HOST) Makes sense to me. > > > 2. listen() / recvfrom() > > a. use the 'host side transport', if the socket is bound to > VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > guest transport. > We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > address this case. > If we want to support multiple 'host side transport' running at the > same time, we should find a way to allow an application to bound a > specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM, > VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV) Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible. What if my service should only be available to a subset of VMware VMs? Instead it might be more appropriate to use network namespaces to create independent AF_VSOCK addressing domains. Then you could have two separate groups of VMware VMs and selectively listen to just one group. > > b. use the 'guest side transport', if the socket is bound to local CID > different from the VMADDR_CID_HOST (guest CID get with > IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY > (to be backward compatible). > Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. Two additional topics: 1. How will loading af_vsock.ko change? In particular, can an application create a socket in af_vsock.ko without any loaded transport? Can it enter listen state without any loaded transport (this seems useful with VMADDR_CID_ANY)? 2. Does your proposed behavior match VMware's existing nested vsock semantics? signature.asc Description: PGP signature
Re: [RFC] vsock: proposal to support multiple transports at runtime
Hi Dexuan, On Thu, May 16, 2019 at 09:48:11PM +, Dexuan Cui wrote: > > From: Stefano Garzarella > > Sent: Tuesday, May 14, 2019 1:16 AM > > To: netdev@vger.kernel.org; Stefan Hajnoczi ; Dexuan > > > > Hi guys, > > I'm currently interested on implement a multi-transport support for VSOCK in > > order to handle nested VMs. > > Hi Stefano, > Thanks for reviving the discussion! :-) > You're welcome :) > I don't know a lot about the details of kvm/vmware sockets, but please let me > share my understanding about them, and let me also share some details about > hyper-v sockets, which I think should be the simplest: > > 1) For hyper-v sockets, the "host" can only be Windows. We can do nothing on > the > Windows host, and I guess we need to do nothing there. I agree that for the Windows host we shouldn't change anything. > > 2) For hyper-v sockets, I think we only care about Linux guest, and the guest > can > only talk to the host; a guest can not talk to another guest running on the > same host. Also for KVM (virtio) a guest can talk only with the host. > > 3) On a hyper-v host, if the guest is running kvm/vmware (i.e. nested > virtualization), > I think in the "KVM guest" the Linux hyper-v transport driver needs to load > so that > the guest can talk to the host (I'm not sure about "vmware guest" in this > case); > the "KVM guest" also needs to load the kvm transport drivers so that it can > talk > to its child VMs (I'm not sure abut "vmware guest" in this case). Okay, so since in the "KVM guest" we will have both hyper-v and kvm transports, we should implement a way to decide what transport use in the cases that I described in the first email. > > 4) On kvm/vmware, if the guest is a Windows guest, I think we can do nothing > in > the guest; Yes, the driver in Windows guest shouldn't change. > if the guest is Linux guest, I think the kvm/vmware transport drivers > should load; if the Linux guest is running kvm/vmware (nested > virtualization), I > think the proper "to child VMs" versions of the kvm/vmware transport drivers > need to load. Exactly, and for the KVM side is the vhost-vsock driver. So, as the point 3, we should support at least two transports running in Linux at the same time. Thank you very much to share these information! Cheers, Stefano
RE: [RFC] vsock: proposal to support multiple transports at runtime
> From: Stefano Garzarella > Sent: Tuesday, May 14, 2019 1:16 AM > To: netdev@vger.kernel.org; Stefan Hajnoczi ; Dexuan > > Hi guys, > I'm currently interested on implement a multi-transport support for VSOCK in > order to handle nested VMs. Hi Stefano, Thanks for reviving the discussion! :-) I don't know a lot about the details of kvm/vmware sockets, but please let me share my understanding about them, and let me also share some details about hyper-v sockets, which I think should be the simplest: 1) For hyper-v sockets, the "host" can only be Windows. We can do nothing on the Windows host, and I guess we need to do nothing there. 2) For hyper-v sockets, I think we only care about Linux guest, and the guest can only talk to the host; a guest can not talk to another guest running on the same host. 3) On a hyper-v host, if the guest is running kvm/vmware (i.e. nested virtualization), I think in the "KVM guest" the Linux hyper-v transport driver needs to load so that the guest can talk to the host (I'm not sure about "vmware guest" in this case); the "KVM guest" also needs to load the kvm transport drivers so that it can talk to its child VMs (I'm not sure abut "vmware guest" in this case). 4) On kvm/vmware, if the guest is a Windows guest, I think we can do nothing in the guest; if the guest is Linux guest, I think the kvm/vmware transport drivers should load; if the Linux guest is running kvm/vmware (nested virtualization), I think the proper "to child VMs" versions of the kvm/vmware transport drivers need to load. Thanks, -- Dexuan
[RFC] vsock: proposal to support multiple transports at runtime
Hi guys, I'm currently interested on implement a multi-transport support for VSOCK in order to handle nested VMs. As Stefan suggested me, I started to look at this discussion: https://lkml.org/lkml/2017/8/17/551 Below I tried to summarize a proposal for a discussion, following the ideas from Dexuan, Jorgen, and Stefan. We can define two types of transport that we have to handle at the same time (e.g. in a nested VM we would have both types of transport running together): - 'host side transport', it runs in the host and it is used to communicate with the guests of a specific hypervisor (KVM, VMWare or HyperV) Should we support multiple 'host side transport' running at the same time? - 'guest side transport'. it runs in the guest and it is used to communicate with the host transport The main goal is to find a way to decide what transport use in these cases: 1. connect() / sendto() a. use the 'host side transport', if the destination is the guest (dest_cid > VMADDR_CID_HOST). If we want to support multiple 'host side transport' running at the same time, we should assign CIDs uniquely across all transports. In this way, a packet generated by the host side will get directed to the appropriate transport based on the CID b. use the 'guest side transport', if the destination is the host (dest_cid == VMADDR_CID_HOST) 2. listen() / recvfrom() a. use the 'host side transport', if the socket is bound to VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no guest transport. We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to address this case. If we want to support multiple 'host side transport' running at the same time, we should find a way to allow an application to bound a specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV) b. use the 'guest side transport', if the socket is bound to local CID different from the VMADDR_CID_HOST (guest CID get with IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be backward compatible). Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. Thanks in advance for your comments and suggestions. Cheers, Stefano
Investment Proposal.
Greetings, We are consultancy firm situated in Bahrain currently looking to finance new or existing projects in any industry. Currently we are sourcing for investment opportunities for our review and consideration and we provide financial and strategic advisory services to growing companies and entrepreneurs both private and institutional investors and I would be delighted to discuss further. Should you wish to know more about the investment funding, feel free to contact us. Regards, Saleh H A Hussain Consultant P.O. Box 11674, Manama Kingdom of Bahrain. www.shcbahrain.com
Re: kernel tls interface with user space modification proposal
Hi Vakul, +TLS maintainers I suggest you send this to TLS maintainers if you want to get more feedback, and it would be best to tag this as RFC. On 3/5/2019 9:56 AM, Vakul Garg wrote: > Hi > > The present interface of kernel tls with user space has few shortcomings. > > The biggest one is that when we need to add a ciphersuite in kernel tls, then > we need to define new structures for passing cryptographic parameters > required by record layer. > And the user space ssl stack also has to be modified because it tries to use > kernel tls only for a given set of ciphers implemented it it. > As all TLS versions below 1.2 are being deprecated, and with TLS1.3 supporting only 5 ciphersuites based on AES-GCM, AES-CCM and Chacha-Poly. I think that it is safe to go forward based on the existing model for these ciphers, while not supporting any other (older ciphers). If we were to try and support all the available ciphers, then it might make sense to have a generic infrastructure for this. > A better schema could be that if kernel tls support is compiled/enabled in > user space SSL stack, it tries to use it for all record layer ciphers. > If kernel tls does not support a given cipher, then setsockopt fails and SSL > stack can fallback to non-ktls mode for the session and subsequent ones using > same cipher type. > > This would require passing the crypto material in a generic form which is same for all cipher types. > > My proposal is that at the sestsockopt interface, instead of passing discrete > keys/salt/IV etc of certain lengths (which are different for each cipher), we > pass the cipher type and the full keyblock (128 bytes). > Thereafter, kernel tls chops the keyblock into keys/iv/salt which are defined > by the given cipher type. > > (The keyblock is derived by SSL stack from master secret and then segmented > in to keys/IV/salt). > Does this work for TLS1.3? > This would keep the interface between ktls and user space software > independent of cipher types supported by kernel tls. > > Further, it is redundant to pass same TLS version, cipher type info in both > Rx and Tx direction. > > I propose that we define an additional setsockopt interface for passing > crypto params in both directions. Additional interfaces double the maintenance effort, and I'm not sure it is interesting to support any of the ciphers besides the once used by TLS1.3. > This setsockopt() would be invoked by SSL stack after handshake is deemed > completed to start record protocol offload in both directions. > > struct tls_rec_prot_info { >unsigned short version; >unsigned short cipher_type; >unsigned char keyblock[128]; >unsigned char tx_seq[8]; >unsigned char rx_seq[8]; >}; > > setsockopt(sock, SOL_TLS, TLS_INFO, &rec_prot_info, sizeof(rec_prot_info)); > > Kindly advise. > > Regards > > Vakul >
[RFC][Proposal] BPF Control MAP
In this proposal I am going to address the lack of a unified user API for accessing and manipulating BPF system attributes, while this proposal is generic and will work on any BPF subsystem (eBPF attach points), I will mostly focus on XDP use cases. So lately I started working on three different XDP open issues, namely XDP statistic, XDP redirect and XDP meta-data, while the details of these issues are not really relevant for the sake of this proposal, all of them share one common problem: the lack of unified user interface to manipulate and access their attributes. Examples: 1. Query XDP statistics. 2. XDP resource management, Setup XDP-redirect TX resources. 3. Setup and query XDP-metadata - (BTF data structure). Jesper Brouer, explains some of these issues in details at: https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org Yes I considered, netlink, devlink, ethtool, sysctrl, etc .. but each one of them has it's own drawback, they are networking specific and will not serve the BPF general purpose. What we want is, all of the BPF related knobs to be present in BPF user tools: bcc, bpftool and libbpf. Ideally we don't want these tools to integrate with all different subsystem's UAPIs, especially the wide variety of the networking UAPIs, and imagine what other subsystems are going to be using .. So what seems to be the right path here is a unified BPF control/configuration user interface, which will hook the caller with the targeted subsystem. To be aligned with all existing BPF tools I am going to propose the use of BPF syscall (No, not a new BPF syscall command, I am not planing to reinvent the wheel - "again" -). What i am going to suggest is to use an already existing API which runs on top of the BPF syscall, BPF MAPs API with just a small tweak. Enter: BPF control MAP: A special type of MAP "BPF_MAP_TYPE_CONTROL", this map will not behave like other maps in the essence of having a user defined data structure behind it, we are going to use it just to hook the user with the targeted underlying subsystem and delegate user commands to it through map operations (create/update_elem/lookup_elem/etc ...) Requirements and implementation details: 1) Hook the user with the targeted subsystem: - On create map, user selects the BPF_MAP_TYPE_CONTROL map type and sets map_attr.ctrl_type to be the subsystem he wants to access and manipulate (KPROBE/CGROUP/SOCKET_FILTER/XDP/etc..). 2) Set and Get operations of a specific BPF subsystem or an object in that subsystem (for example a netdev in XDP). - user will use the file descriptor retrieved on map creation to access (Set/Get) the BPF subsystem attributes via map update_elem and lookup_elem operations, the key will be the object id (example: ifindex, or just the type of configuration to access) keys and values are subsystem dependent. 3) Iterate through the different attributes/objects of the subsystem, Use case: XDP BPF subsystem, get ALL netdevs XDP attributes/statistics. can be easily achieved with: bpf_map_get_next_key. Advantages & Motivation: Why BPF MAP and not just a plain new BPF syscall command or any other existing UAPI: 0) All BPF users love maps and got used to them, and simply, everything is a map, system objects can be keys and their attributes can be values. 1) **BTF** integration, any map (key, value) pair can be described in BTF in kernel level and can be attached to the map the user creates, this will be a huge advantage for user forward compatibility, and for development convenience to not copy kernel uapi headers on each attribute set updates, and simplify ABI compatibility. New values or attributes can be dumped/parsed in user space with zero effort, no need to constantly update user space tools. 2) BPF maps already laid the groundwork for our requirements as the infrastructure and has the semantics that we are looking for (set/get). 3) Already integrated in user-space tools and libraries such ash bcc/libbpf and friends, what is missing is just this small tweak (in the kernel) to hook one special map type with the underlying BPF subsystems. Thoughts ? [Some EXTRAs] Example use cases (XDP only for now): 1) Query XDP stats of all XDP netdevs: xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_STATS); while (bpf_map_get_next_key(xdp_ctrl, &ifindex, &next_ifindex) == 0) { bpf_map_lookup_elem(xdp_ctrl, &next_ifindex, &stats); // we don't even need to know stats format in this case btf_pretty_print(xdp_ctrl->btf, &stats); ifindex = next_ifindex; } 2) Setup XDP tx redirect resources on egress netdev (netdev with no XDP program). xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_ATTR); xdp_attr->command = SETUP_REDIRECT; xdp_attr->rings.num = 12; xdp_attr->rings.size = 128; bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr); 3) Turn On/Off
kernel tls interface with user space modification proposal
Hi The present interface of kernel tls with user space has few shortcomings. The biggest one is that when we need to add a ciphersuite in kernel tls, then we need to define new structures for passing cryptographic parameters required by record layer. And the user space ssl stack also has to be modified because it tries to use kernel tls only for a given set of ciphers implemented it it. A better schema could be that if kernel tls support is compiled/enabled in user space SSL stack, it tries to use it for all record layer ciphers. If kernel tls does not support a given cipher, then setsockopt fails and SSL stack can fallback to non-ktls mode for the session and subsequent ones using same cipher type. This would require passing the crypto material in a generic form which is same for all cipher types. My proposal is that at the sestsockopt interface, instead of passing discrete keys/salt/IV etc of certain lengths (which are different for each cipher), we pass the cipher type and the full keyblock (128 bytes). Thereafter, kernel tls chops the keyblock into keys/iv/salt which are defined by the given cipher type. (The keyblock is derived by SSL stack from master secret and then segmented in to keys/IV/salt). This would keep the interface between ktls and user space software independent of cipher types supported by kernel tls. Further, it is redundant to pass same TLS version, cipher type info in both Rx and Tx direction. I propose that we define an additional setsockopt interface for passing crypto params in both directions. This setsockopt() would be invoked by SSL stack after handshake is deemed completed to start record protocol offload in both directions. struct tls_rec_prot_info { unsigned short version; unsigned short cipher_type; unsigned char keyblock[128]; unsigned char tx_seq[8]; unsigned char rx_seq[8]; }; setsockopt(sock, SOL_TLS, TLS_INFO, &rec_prot_info, sizeof(rec_prot_info)); Kindly advise. Regards Vakul
Re: Business Proposal
Dear Friend, My name is Mr. Edward Yuan, a consultant/broker. I know you might be a bit apprehensive because you do not know me. Nevertheless, I have a proposal on behalf of a client, a lucrative business that might be of mutual benefit to you. If interested in this proposition please kindly and urgently contact me for more details. Best Regards. Mr. Edward Yuan. --- This email has been checked for viruses by AVG. https://www.avg.com
Re: Security enhancement proposal for kernel TLS
On 08/02/18 05:23 PM, Vakul Garg wrote: > > I agree that Boris' patch does what you say it does - it sets keys > > immediately > > after CCS instead of after FINISHED message. I disagree that the kernel tls > > implementation currently requires that specific ordering, nor do I think > > that it > > should require that ordering. > > The current kernel implementation assumes record sequence number to start > from '0'. > If keys have to be set after FINISHED message, then record sequence number > need to > be communicated from user space TLS stack to kernel. IIRC, sequence number is > not > part of the interface through which key is transferred. The setsockopt call struct takes the key, iv, salt, and seqno: struct tls12_crypto_info_aes_gcm_128 { struct tls_crypto_info info; unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; };
RE: Security enhancement proposal for kernel TLS
> -Original Message- > From: Dave Watson [mailto:davejwat...@fb.com] > Sent: Thursday, August 2, 2018 2:17 AM > To: Vakul Garg > Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris > Pismenny > Subject: Re: Security enhancement proposal for kernel TLS > > On 07/31/18 10:45 AM, Vakul Garg wrote: > > > > IIUC, with the upstream implementation of tls record layer in > > > > kernel, the decryption of tls FINISHED message happens in kernel. > > > > Therefore the keys are already being sent to kernel tls socket > > > > before handshake is > > > completed. > > > > > > This is incorrect. > > > > Let us first reach a common ground on this. > > > > The kernel TLS implementation can decrypt only after setting the keys on > the socket. > > The TLS message 'finished' (which is encrypted) is received after receiving > 'CCS' > > message. After the user space TLS library receives CCS message, it > > sets the keys on kernel TLS socket. Therefore, the next message in the > > socket receive queue which is TLS finished gets decrypted in kernel only. > > > > Please refer to following Boris's patch on openssl. The commit log says: > > " We choose to set this option at the earliest - just after CCS is > > complete". > > I agree that Boris' patch does what you say it does - it sets keys immediately > after CCS instead of after FINISHED message. I disagree that the kernel tls > implementation currently requires that specific ordering, nor do I think that > it > should require that ordering. The current kernel implementation assumes record sequence number to start from '0'. If keys have to be set after FINISHED message, then record sequence number need to be communicated from user space TLS stack to kernel. IIRC, sequence number is not part of the interface through which key is transferred.
Re: Security enhancement proposal for kernel TLS
On 07/31/18 10:45 AM, Vakul Garg wrote: > > > IIUC, with the upstream implementation of tls record layer in kernel, > > > the decryption of tls FINISHED message happens in kernel. Therefore > > > the keys are already being sent to kernel tls socket before handshake is > > completed. > > > > This is incorrect. > > Let us first reach a common ground on this. > > The kernel TLS implementation can decrypt only after setting the keys on the > socket. > The TLS message 'finished' (which is encrypted) is received after receiving > 'CCS' > message. After the user space TLS library receives CCS message, it sets the > keys > on kernel TLS socket. Therefore, the next message in the socket receive queue > which is TLS finished gets decrypted in kernel only. > > Please refer to following Boris's patch on openssl. The commit log says: > " We choose to set this option at the earliest - just after CCS is complete". I agree that Boris' patch does what you say it does - it sets keys immediately after CCS instead of after FINISHED message. I disagree that the kernel tls implementation currently requires that specific ordering, nor do I think that it should require that ordering.
RE: Security enhancement proposal for kernel TLS
> -Original Message- > From: Dave Watson [mailto:davejwat...@fb.com] > Sent: Tuesday, July 31, 2018 2:46 AM > To: Vakul Garg > Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris > Pismenny > Subject: Re: Security enhancement proposal for kernel TLS > > On 07/30/18 06:31 AM, Vakul Garg wrote: > > > It's not entirely clear how your TLS handshake daemon works - Why is > > > it necessary to set the keys in the kernel tls socket before the > > > handshake is completed? > > > > IIUC, with the upstream implementation of tls record layer in kernel, > > the decryption of tls FINISHED message happens in kernel. Therefore > > the keys are already being sent to kernel tls socket before handshake is > completed. > > This is incorrect. Let us first reach a common ground on this. The kernel TLS implementation can decrypt only after setting the keys on the socket. The TLS message 'finished' (which is encrypted) is received after receiving 'CCS' message. After the user space TLS library receives CCS message, it sets the keys on kernel TLS socket. Therefore, the next message in the socket receive queue which is TLS finished gets decrypted in kernel only. Please refer to following Boris's patch on openssl. The commit log says: " We choose to set this option at the earliest - just after CCS is complete". -- commit a01dd062a32c687630b2a860b4bb053008f09ff5 Author: Boris Pismenny Date: Sun Mar 11 16:18:27 2018 +0200 ssl: Linux TLS Rx Offload This patch adds support for the Linux TLS Rx socket option. It completes the previous patch for TLS Tx offload. If the socket option is successful, then the receive data-path of the TCP socket is implemented by the kernel. We choose to set this option at the earliest - just after CCS is complete. -- The fact that keys are handed over to kernel TLS socket can also be verified by putting a log in tls_sw_recvmsg(). I would stop here for you to confirm my observation first. Regards. Vakul > Currently the kernel TLS implementation decrypts > everything after you set the keys on the socket. I'm suggesting that you > don't set the keys on the socket until after the FINISHED message. > > > > Or, why do you need to hand off the fd to the client program before > > > the handshake is completed? > > > > The fd is always owned by the client program.. > > > > In my proposal, the applications poll their own tcp socket using > read/recvmsg etc. > > If they get handshake record, they forward it to the entity running > handshake agent. > > The handshake agent could be a linux daemon or could run on a separate > > security processor like 'Secure element' or say arm trustzone etc. The > > applications forward any handshake message it gets backs from > > handshake agent to the connected tcp socket. Therefore, the > > applications act as a forwarder of the handshake messages between the > peer tls endpoint and handshake agent. > > The received data messages are absorbed by the applications themselves > > (bypassing ssl stack completely). Similarly, the applications tx data > > directly > by writing on their socket. > > > > > Waiting until after handshake solves both of these issues. > > > > The security sensitive check which is 'Wait for handshake to finish > > completely before accepting data' should not be the onus of the > > application. We have enough examples in past where application > > programmers made mistakes in setting up tls correctly. The idea is to > isolate tls session setting up from the applications. > > It's not clear to me what you gain by putting this 'handshake finished' > notification in the kernel instead of in the client's tls library - you're > already > forwarding the handshake start notification to the daemon, why can't the > daemon notify them back in userspace that > the handshake is finished? > > If you did want to put the notification in the kernel, how would you handle > poll on the socket, since probably both the handshake daemon and client > might be polling the socket, but one for control messages and one for data? > > The original kernel TLS RFC did split these to two separate sockets, but we > decided it was too complicated, and that's not how userspace TLS clients > function today. > > Do you have an implementation of this? There are a bunch of tricky corner > cases here, it might make more sense to have something concrete to discuss. > > > Further, as per tls RFC it is ok to piggy
Re: Security enhancement proposal for kernel TLS
On 07/30/18 06:31 AM, Vakul Garg wrote: > > It's not entirely clear how your TLS handshake daemon works - Why is > > it necessary to set the keys in the kernel tls socket before the handshake > > is > > completed? > > IIUC, with the upstream implementation of tls record layer in kernel, the > decryption of tls FINISHED message happens in kernel. Therefore the keys are > already being sent to kernel tls socket before handshake is completed. This is incorrect. Currently the kernel TLS implementation decrypts everything after you set the keys on the socket. I'm suggesting that you don't set the keys on the socket until after the FINISHED message. > > Or, why do you need to hand off the fd to the client program > > before the handshake is completed? > > The fd is always owned by the client program.. > > In my proposal, the applications poll their own tcp socket using read/recvmsg > etc. > If they get handshake record, they forward it to the entity running handshake > agent. > The handshake agent could be a linux daemon or could run on a separate > security > processor like 'Secure element' or say arm trustzone etc. The applications > forward any handshake message it gets backs from handshake agent to the > connected tcp socket. Therefore, the applications act as a forwarder of the > handshake > messages between the peer tls endpoint and handshake agent. > The received data messages are absorbed by the applications themselves > (bypassing ssl stack > completely). Similarly, the applications tx data directly by writing on their > socket. > > > Waiting until after handshake solves both of these issues. > > The security sensitive check which is 'Wait for handshake to finish > completely before > accepting data' should not be the onus of the application. We have enough > examples > in past where application programmers made mistakes in setting up tls > correctly. The idea > is to isolate tls session setting up from the applications. It's not clear to me what you gain by putting this 'handshake finished' notification in the kernel instead of in the client's tls library - you're already forwarding the handshake start notification to the daemon, why can't the daemon notify them back in userspace that the handshake is finished? If you did want to put the notification in the kernel, how would you handle poll on the socket, since probably both the handshake daemon and client might be polling the socket, but one for control messages and one for data? The original kernel TLS RFC did split these to two separate sockets, but we decided it was too complicated, and that's not how userspace TLS clients function today. Do you have an implementation of this? There are a bunch of tricky corner cases here, it might make more sense to have something concrete to discuss. > Further, as per tls RFC it is ok to piggyback the data records after the > finished handshake > message. This is called early data. But then it is the responsibility of > applications to first > complete finished message processing before accepting the data records. > > The proposal is to disallow application world seeing data records > before handshake finishes. You're talking about the TLS 1.3 0-RTT feature, which is indeed an interesting case. For in-process TLS libraries, it's fairly easy to punt, and don't set the kernel TLS keys until after the 0-RTT data + handshake message. For an OOB handshake daemon it might indeed make more sense to leave the data in kernelspace ... somehow. > > > - The handshake state should fallback to 'unverified' in case a control > > record is seen again by kernel TLS (e.g. in case of renegotiation, post > > handshake client auth etc). > > > > Currently kernel tls sockets return an error unless you explicitly handle > > the > > control record for exactly this reason. > > IIRC, any kind handshake message post handshake-completion is a problem for > kernel tls. > This includes renegotiation, post handshake client-auth etc. > > Please correct me if I am wrong. You are correct, but currently kernel TLS sockets return an error unless you explicitly handle the control message. This should be enough already to implement your proposal.
RE: Security enhancement proposal for kernel TLS
Sorry for a delayed response. Kindly see inline. > -Original Message- > From: Dave Watson [mailto:davejwat...@fb.com] > Sent: Wednesday, July 25, 2018 9:30 PM > To: Vakul Garg > Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris > Pismenny > Subject: Re: Security enhancement proposal for kernel TLS > > You would probably get more responses if you cc the relevant people. > Comments inline > > On 07/22/18 12:49 PM, Vakul Garg wrote: > > The kernel based TLS record layer allows the user space world to use a > decoupled TLS implementation. > > The applications need not be linked with TLS stack. > > The TLS handshake can be done by a TLS daemon on the behalf of > applications. > > > > Presently, as soon as the handshake process derives keys, it pushes the > negotiated keys to kernel TLS . > > Thereafter the applications can directly read and write data on their TCP > socket (without having to use SSL apis). > > > > With the current kernel TLS implementation, there is a security problem. > > Since the kernel TLS socket does not have information about the state > > of handshake, it allows applications to be able to receive data from the > peer TLS endpoint even when the handshake verification has not been > completed by the SSL daemon. > > It is a security problem if applications can receive data if verification > > of the > handshake transcript is not completed (done with processing tls FINISHED > message). > > > > My proposal: > > - Kernel TLS should maintain state of handshake (verified or > unverified). > > In un-verified state, data records should not be allowed pass through > to the applications. > > > > - Add a new control interface using which that the user space SSL > stack can tell the TLS socket that handshake has been verified and DATA > records can flow. > > In 'unverified' state, only control records should be allowed to pass > and reception DATA record should be pause the receive side record > decryption. > > It's not entirely clear how your TLS handshake daemon works - Why is > it necessary to set the keys in the kernel tls socket before the handshake is > completed? IIUC, with the upstream implementation of tls record layer in kernel, the decryption of tls FINISHED message happens in kernel. Therefore the keys are already being sent to kernel tls socket before handshake is completed. > Or, why do you need to hand off the fd to the client program > before the handshake is completed? The fd is always owned by the client program.. The client program opens up the socket, TCP bind/connect it and then hands it over to SSL stack as a transport handle for exchanging handshake messages. This is how it works today whether we use kernel TLS or not. I do not propose to change it. In my proposal, the applications poll their own tcp socket using read/recvmsg etc. If they get handshake record, they forward it to the entity running handshake agent. The handshake agent could be a linux daemon or could run on a separate security processor like 'Secure element' or say arm trustzone etc. The applications forward any handshake message it gets backs from handshake agent to the connected tcp socket. Therefore, the applications act as a forwarder of the handshake messages between the peer tls endpoint and handshake agent. The received data messages are absorbed by the applications themselves (bypassing ssl stack completely). Similarly, the applications tx data directly by writing on their socket. > Waiting until after handshake solves both of these issues. The security sensitive check which is 'Wait for handshake to finish completely before accepting data' should not be the onus of the application. We have enough examples in past where application programmers made mistakes in setting up tls correctly. The idea is to isolate tls session setting up from the applications. > > I'm not aware of any tls libraries that send data before the finished message, > is there any reason you need to support this? Sending data records before sending finished message is a protocol error. A good tls library never does that. But an attacker can exploit it if applications can receive the data records before handshake is finished. With current kernel TLS, it is possible to do so. Further, as per tls RFC it is ok to piggyback the data records after the finished handshake message. This is called early data. But then it is the responsibility of applications to first complete finished message processing before accepting the data records. The proposal is to disallow application world seeing data records before handshake finishes. > > > > > - The handshake state should fallback to 'unverified'
Re: Security enhancement proposal for kernel TLS
You would probably get more responses if you cc the relevant people. Comments inline On 07/22/18 12:49 PM, Vakul Garg wrote: > The kernel based TLS record layer allows the user space world to use a > decoupled TLS implementation. > The applications need not be linked with TLS stack. > The TLS handshake can be done by a TLS daemon on the behalf of applications. > > Presently, as soon as the handshake process derives keys, it pushes the > negotiated keys to kernel TLS . > Thereafter the applications can directly read and write data on their TCP > socket (without having to use SSL apis). > > With the current kernel TLS implementation, there is a security problem. > Since the kernel TLS socket does not have information about the state of > handshake, > it allows applications to be able to receive data from the peer TLS endpoint > even when the handshake verification has not been completed by the SSL > daemon. > It is a security problem if applications can receive data if verification of > the handshake transcript is not completed (done with processing tls FINISHED > message). > > My proposal: > - Kernel TLS should maintain state of handshake (verified or > unverified). > In un-verified state, data records should not be allowed pass through > to the applications. > > - Add a new control interface using which that the user space SSL stack > can tell the TLS socket that handshake has been verified and DATA records can > flow. > In 'unverified' state, only control records should be allowed to pass > and reception DATA record should be pause the receive side record decryption. It's not entirely clear how your TLS handshake daemon works - Why is it necessary to set the keys in the kernel tls socket before the handshake is completed? Or, why do you need to hand off the fd to the client program before the handshake is completed? Waiting until after handshake solves both of these issues. I'm not aware of any tls libraries that send data before the finished message, is there any reason you need to support this? > > - The handshake state should fallback to 'unverified' in case a control > record is seen again by kernel TLS (e.g. in case of renegotiation, post > handshake client auth etc). Currently kernel tls sockets return an error unless you explicitly handle the control record for exactly this reason. If you want an external daemon to handle control messages after handshake, there definitely might be some synchronization that would make sense to push in the kernel. However, with TLS 1.3 removing renegotiation (and currently reneg is not implemented in kernel tls anyway), there's much less reason to do so.
Security enhancement proposal for kernel TLS
Hi The kernel based TLS record layer allows the user space world to use a decoupled TLS implementation. The applications need not be linked with TLS stack. The TLS handshake can be done by a TLS daemon on the behalf of applications. Presently, as soon as the handshake process derives keys, it pushes the negotiated keys to kernel TLS . Thereafter the applications can directly read and write data on their TCP socket (without having to use SSL apis). With the current kernel TLS implementation, there is a security problem. Since the kernel TLS socket does not have information about the state of handshake, it allows applications to be able to receive data from the peer TLS endpoint even when the handshake verification has not been completed by the SSL daemon. It is a security problem if applications can receive data if verification of the handshake transcript is not completed (done with processing tls FINISHED message). My proposal: - Kernel TLS should maintain state of handshake (verified or unverified). In un-verified state, data records should not be allowed pass through to the applications. - Add a new control interface using which that the user space SSL stack can tell the TLS socket that handshake has been verified and DATA records can flow. In 'unverified' state, only control records should be allowed to pass and reception DATA record should be pause the receive side record decryption. - The handshake state should fallback to 'unverified' in case a control record is seen again by kernel TLS (e.g. in case of renegotiation, post handshake client auth etc). Kindly comment. Regards Vakul
Proposal
Hello I have a business proposal of mutual benefits i would like to discuss with you i asked before and i still await your positive response thanks
Proposal
Hello I have a business proposal of mutual benefits i would like to discuss with you.