subject:"Proposal..."

Re: [RFC net-next] iavf: refactor plan proposal

2021-03-09 Thread Leon Romanovsky

On Tue, Mar 09, 2021 at 09:11:46PM -0800, Jesse Brandeburg wrote:
> Leon Romanovsky wrote:
>
> > > 3) Plan is to make the "new" iavf driver the default iavf once
> > >extensive regression testing can be completed.
> > >   a. Current proposal is to make CONFIG_IAVF have a sub-option
> > >  CONFIG_IAVF_V2 that lets the user adopt the new code,
> > >  without changing the config for existing users or breaking
> > >  them.
> >
> > I don't think that .config options are considered ABIs, so it is unclear
> > what do you mean by saying "disrupting current users". Instead of the
> > complication wrote above, do like any other driver does: perform your
> > testing, submit the code and switch to the new code at the same time.
>
> Because this VF driver runs on multiple hardware PFs (they all expose
> the same VF device ID) the testing matrix is quite huge and will take
> us a while to get through it. We aim to avoid making users's life hard
> by having CONFIG_IAVF=m become a surprise new code base behind the back
> of the user.

Don't you already test your patches against that testing DB?
Like Jakub said, do incremental changes and it will be much saner for everyone.

>
> I've always thought that the .config options *are* a sort of ABI,
> because when you do "make oldconfig" it tries to pick up your previous
> configuration and if, for instance, a driver changes it's Kconfig name,
> it will not pick up the old value of the old driver Kconfig name for
> the new build, and with either default or ask the user. The way we're
> proposing I think will allow the old driver to stay default until the
> user answers Y to the "new option" for the new, iecm based code.

I understand the rationale, but no - .config is not ABI at all.
There are three types of "users" who are messing with configs:
1. Distro people
2. Kernel developers
3. "Experts" who wants/needs rebuild kernel

All of them are expected to be proficient enough to handle changes
in CONFIG_* land. In your proposal you are trying to solve non-existent
problem of having users who are building their own kernel, but dumb
enough do not understand what they are doing.

We are removing/adding/renaming CONFIG_* all the time, this is no
different.

>
> > > [1]
> > > https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/
> >
> > Please don't introduce module parameters in new code.
>
> Thanks, we certainly won't. :-)
> I'm not sure why you commented about module parameters, but the above
> link is to the previous submission for a new driver that uses some
> common code as a module (iecm) for a new device driver (idpf) we had
> sent. The point of this email was to solicit feedback and give notice
> about doing a complicated refactor/replace where we end up re-using
> iecm for the new version of the iavf code, with the intent to be up
> front and working with the community throughout the process. Because of
> the complexity, we want do the right thing the first time so we can to
> avoid a restart/redesign.

I commented simply because it jumped in front of my eyes when I looked
on the patches in that link. It was general enough to write it here,
rest of my comments are too specific and better to be posted as a reply
to the patches itself.

Thanks

>
> Thanks,
>  Jesse

Re: [RFC net-next] iavf: refactor plan proposal

2021-03-09 Thread Jesse Brandeburg

Jakub Kicinski wrote:

> On Mon, 8 Mar 2021 16:28:58 -0800 Jesse Brandeburg wrote:
> > Hello,
> > 
> > We plan to refactor the iavf module and would appreciate community and
> > maintainer feedback on our plans.  We want to do this to realize the
> > usefulness of the common code module for multiple drivers.  This
> > proposal aims to avoid disrupting current users.
> > 
> > The steps we plan are something like:
> > 1) Continue upstreaming of the iecm module (common module) and
> >the initial feature set for the idpf driver[1] utilizing iecm.
> 
> Oh, that's still going? there wasn't any revision for such a long time
> I deleted my notes :-o

Argh! sorry about the delay. These proposed driver changes impacted
progress on this patch series, we should have done a better job
communicating what was going on.

> > We are looking to make sure that the mode of our refactoring will meet
> > the community's expectations. Any advice or feedback is appreciated.
> 
> Sounds like a slow, drawn out process painful to everyone involved.
> 
> The driver is upstream. My humble preference is that Intel sends small
> logical changes we can review, and preserve a meaningful git history.

We are attempting to make it as painless and quick as possible. With
that said, I see your point and am driving some internal discussions to
see what we can do differently.

The primary reason for the plan proposed is the code reuse model we've
chosen. With the change to the common module, the new iavf is
significantly different and replacing the old avf base with the new
would take many unnecessary intermediate steps that would be thrown
away at the end. The end design will use the code from the common
module with hooks to get device specific implementation where
necessary.  After putting in place the new-avf code we can update the
iavf with new functionality which is already present in the common
module.

Thanks,
 Jesse

Re: [RFC net-next] iavf: refactor plan proposal

2021-03-09 Thread Jesse Brandeburg

Leon Romanovsky wrote:

> > 3) Plan is to make the "new" iavf driver the default iavf once
> >extensive regression testing can be completed.
> > a. Current proposal is to make CONFIG_IAVF have a sub-option
> >CONFIG_IAVF_V2 that lets the user adopt the new code,
> >without changing the config for existing users or breaking
> >them.
> 
> I don't think that .config options are considered ABIs, so it is unclear
> what do you mean by saying "disrupting current users". Instead of the
> complication wrote above, do like any other driver does: perform your
> testing, submit the code and switch to the new code at the same time.

Because this VF driver runs on multiple hardware PFs (they all expose
the same VF device ID) the testing matrix is quite huge and will take
us a while to get through it. We aim to avoid making users's life hard
by having CONFIG_IAVF=m become a surprise new code base behind the back
of the user.

I've always thought that the .config options *are* a sort of ABI,
because when you do "make oldconfig" it tries to pick up your previous
configuration and if, for instance, a driver changes it's Kconfig name,
it will not pick up the old value of the old driver Kconfig name for
the new build, and with either default or ask the user. The way we're
proposing I think will allow the old driver to stay default until the
user answers Y to the "new option" for the new, iecm based code.

> > [1]
> > https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/
> 
> Please don't introduce module parameters in new code.

Thanks, we certainly won't. :-)
I'm not sure why you commented about module parameters, but the above
link is to the previous submission for a new driver that uses some
common code as a module (iecm) for a new device driver (idpf) we had
sent. The point of this email was to solicit feedback and give notice
about doing a complicated refactor/replace where we end up re-using
iecm for the new version of the iavf code, with the intent to be up
front and working with the community throughout the process. Because of
the complexity, we want do the right thing the first time so we can to
avoid a restart/redesign.

Thanks,
 Jesse

Re: [RFC net-next] iavf: refactor plan proposal

2021-03-09 Thread Jakub Kicinski

On Mon, 8 Mar 2021 16:28:58 -0800 Jesse Brandeburg wrote:
> Hello,
> 
> We plan to refactor the iavf module and would appreciate community and
> maintainer feedback on our plans.  We want to do this to realize the
> usefulness of the common code module for multiple drivers.  This
> proposal aims to avoid disrupting current users.
> 
> The steps we plan are something like:
> 1) Continue upstreaming of the iecm module (common module) and
>the initial feature set for the idpf driver[1] utilizing iecm.

Oh, that's still going? there wasn't any revision for such a long time
I deleted my notes :-o

> 2) Introduce the refactored iavf code as a "new" iavf driver with the
>same device ID, but Kconfig default to =n to enable testing. 
>   a. Make this exclusive so if someone opts in to "new" iavf,
>  then it disables the original iavf (?) 
>   b. If we do make it exclusive in Kconfig can we use the same
>  name? 
> 3) Plan is to make the "new" iavf driver the default iavf once
>extensive regression testing can be completed. 
>   a. Current proposal is to make CONFIG_IAVF have a sub-option
>  CONFIG_IAVF_V2 that lets the user adopt the new code,
>  without changing the config for existing users or breaking
>  them.
> 
> We are looking to make sure that the mode of our refactoring will meet
> the community's expectations. Any advice or feedback is appreciated.

Sounds like a slow, drawn out process painful to everyone involved.

The driver is upstream. My humble preference is that Intel sends small
logical changes we can review, and preserve a meaningful git history.

Re: [RFC net-next] iavf: refactor plan proposal

2021-03-08 Thread Leon Romanovsky

On Mon, Mar 08, 2021 at 04:28:58PM -0800, Jesse Brandeburg wrote:
> Hello,
>
> We plan to refactor the iavf module and would appreciate community and
> maintainer feedback on our plans.  We want to do this to realize the
> usefulness of the common code module for multiple drivers.  This
> proposal aims to avoid disrupting current users.
>
> The steps we plan are something like:
> 1) Continue upstreaming of the iecm module (common module) and
>the initial feature set for the idpf driver[1] utilizing iecm.
> 2) Introduce the refactored iavf code as a "new" iavf driver with the
>same device ID, but Kconfig default to =n to enable testing.
>   a. Make this exclusive so if someone opts in to "new" iavf,
>  then it disables the original iavf (?)
>   b. If we do make it exclusive in Kconfig can we use the same
>  name?
> 3) Plan is to make the "new" iavf driver the default iavf once
>extensive regression testing can be completed.
>   a. Current proposal is to make CONFIG_IAVF have a sub-option
>  CONFIG_IAVF_V2 that lets the user adopt the new code,
>  without changing the config for existing users or breaking
>  them.

I don't think that .config options are considered ABIs, so it is unclear
what do you mean by saying "disrupting current users". Instead of the
complication wrote above, do like any other driver does: perform your
testing, submit the code and switch to the new code at the same time.

>
> We are looking to make sure that the mode of our refactoring will meet
> the community's expectations. Any advice or feedback is appreciated.
>
> Thanks,
> Jesse, Alice, Alan
>
> [1]
> https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/

Please don't introduce module parameters in new code.

Thanks

[RFC net-next] iavf: refactor plan proposal

2021-03-08 Thread Jesse Brandeburg

Hello,

We plan to refactor the iavf module and would appreciate community and
maintainer feedback on our plans.  We want to do this to realize the
usefulness of the common code module for multiple drivers.  This
proposal aims to avoid disrupting current users.

The steps we plan are something like:
1) Continue upstreaming of the iecm module (common module) and
   the initial feature set for the idpf driver[1] utilizing iecm.
2) Introduce the refactored iavf code as a "new" iavf driver with the
   same device ID, but Kconfig default to =n to enable testing. 
a. Make this exclusive so if someone opts in to "new" iavf,
   then it disables the original iavf (?) 
b. If we do make it exclusive in Kconfig can we use the same
   name? 
3) Plan is to make the "new" iavf driver the default iavf once
   extensive regression testing can be completed. 
a. Current proposal is to make CONFIG_IAVF have a sub-option
   CONFIG_IAVF_V2 that lets the user adopt the new code,
   without changing the config for existing users or breaking
   them.

We are looking to make sure that the mode of our refactoring will meet
the community's expectations. Any advice or feedback is appreciated.

Thanks,
Jesse, Alice, Alan

[1]
https://lore.kernel.org/netdev/20200824173306.3178343-1-anthony.l.ngu...@intel.com/

Proposal for a new protocol family - AF_MCTP

2021-02-12 Thread Jeremy Kerr

Hi all,

I'm currently working on implementing support for the Management
Controller Transport Protocol (MCTP). Briefly, MCTP is a protocol for
intra-system communication between a management controller (typically a
BMC), and the devices it manages. If you're after the full details, the
DMTF have a specification (DSP0236) up at:

  https://www.dmtf.org/standards/pmci

In short, this involves adding a new protocol / address family
("AF_MCTP"), the supporting types for a sockets API, and netlink
protocol definitions.

At the moment, I'm currently at the design & prototyping stage - so no
patches to send just yet! However, if you're super keen, you can have a
review of the design outline for the OpenBMC project, up at:

  
https://github.com/jk-ozlabs/openbmc-docs/blob/mctp/designs/mctp/mctp-kernel.md

If you'd like to send feedback on any aspects of that, I'm keen to hear
them. You can either respond to me via email, or participate in the
gerrit review of that document, which is at:

  https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/40514

Otherwise, if you prefer to review as code instead, I'll be sending
patches to netdev once we've done a few passes of the design doc with
the OpenBMC community.

linux-can folks: the structure of MCTP is a little similar to CAN, and
I've been referring to net/can/ a little for the mctp implementation,
hence including the list here. If you have any particular hindsight you
have from your work, I'd be keen to hear about it too.

Cheers,


Jeremy

Re: [PATCH net-next 1/2] net/smc: send ISM devices with unique chid in CLC proposal

2020-10-04 Thread Karsten Graul

On Sat, 03 Oct 2020 17:05:38 -0700 (PDT)
David Miller  wrote:

> Series applied, but could you send a proper patch series in the future
> with a "[PATCH 0/N] ..." header posting?  It must explain what the
> patch series does at a high level, how it is doing it, and why it is
> doing it that way.
> 
> Thank you.

Hi Dave, not sure what went wrong but I sent the header posting along with
the patches, see https://lists.openwall.net/netdev/2020/10/02/197

-- 
Karsten Graul

Re: [PATCH net-next 1/2] net/smc: send ISM devices with unique chid in CLC proposal

2020-10-03 Thread David Miller

From: Karsten Graul 
Date: Fri,  2 Oct 2020 17:09:26 +0200

> When building a CLC proposal message then the list of ISM devices does
> not need to contain multiple devices that have the same chid value,
> all these devices use the same function at the end.
> Improve smc_find_ism_v2_device_clnt() to collect only ISM devices that
> have unique chid values.
> 
> Signed-off-by: Karsten Graul 

Series applied, but could you send a proper patch series in the future
with a "[PATCH 0/N] ..." header posting?  It must explain what the
patch series does at a high level, how it is doing it, and why it is
doing it that way.

Thank you.

[PATCH net-next 1/2] net/smc: send ISM devices with unique chid in CLC proposal

2020-10-02 Thread Karsten Graul

When building a CLC proposal message then the list of ISM devices does
not need to contain multiple devices that have the same chid value,
all these devices use the same function at the end.
Improve smc_find_ism_v2_device_clnt() to collect only ISM devices that
have unique chid values.

Signed-off-by: Karsten Graul 
---
 net/smc/af_smc.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index e874d0e6267f..670e802a73cb 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -599,6 +599,18 @@ static int smc_find_ism_device(struct smc_sock *smc, 
struct smc_init_info *ini)
return 0;
 }
 
+/* is chid unique for the ism devices that are already determined? */
+static bool smc_find_ism_v2_is_unique_chid(u16 chid, struct smc_init_info *ini,
+  int cnt)
+{
+   int i = (!ini->ism_dev[0]) ? 1 : 0;
+
+   for (; i < cnt; i++)
+   if (ini->ism_chid[i] == chid)
+   return false;
+   return true;
+}
+
 /* determine possible V2 ISM devices (either without PNETID or with PNETID plus
  * PNETID matching net_device)
  */
@@ -608,6 +620,7 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc,
int rc = SMC_CLC_DECL_NOSMCDDEV;
struct smcd_dev *smcd;
int i = 1;
+   u16 chid;
 
if (smcd_indicated(ini->smc_type_v1))
rc = 0; /* already initialized for V1 */
@@ -615,10 +628,13 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock 
*smc,
list_for_each_entry(smcd, &smcd_dev_list.list, list) {
if (smcd->going_away || smcd == ini->ism_dev[0])
continue;
+   chid = smc_ism_get_chid(smcd);
+   if (!smc_find_ism_v2_is_unique_chid(chid, ini, i))
+   continue;
if (!smc_pnet_is_pnetid_set(smcd->pnetid) ||
smc_pnet_is_ndev_pnetid(sock_net(&smc->sk), smcd->pnetid)) {
ini->ism_dev[i] = smcd;
-   ini->ism_chid[i] = smc_ism_get_chid(ini->ism_dev[i]);
+   ini->ism_chid[i] = chid;
ini->is_smcd = true;
rc = 0;
i++;
-- 
2.17.1

[PATCH net-next 10/14] net/smc: build and send V2 CLC proposal

2020-09-26 Thread Karsten Graul

From: Ursula Braun 

The new format of an SMCD V2 CLC proposal is introduced, and
building and checking of SMCD V2 CLC proposals is adapted
accordingly.

Signed-off-by: Ursula Braun 
Signed-off-by: Karsten Graul 
---
 net/smc/af_smc.c  |   2 +-
 net/smc/smc.h |   6 ++
 net/smc/smc_clc.c | 171 --
 net/smc/smc_clc.h |  73 ++--
 4 files changed, 210 insertions(+), 42 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 1d01a01c7fd5..10374673f75f 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1301,7 +1301,7 @@ static void smc_find_ism_device_serv(struct smc_sock 
*new_smc,
if (!smcd_indicated(pclc->hdr.typev1))
goto not_found;
ini->is_smcd = true; /* prepare ISM check */
-   ini->ism_peer_gid[0] = pclc_smcd->gid;
+   ini->ism_peer_gid[0] = ntohll(pclc_smcd->ism.gid);
if (smc_find_ism_device(new_smc, ini))
goto not_found;
if (!smc_listen_ism_init(new_smc, ini))
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 0b9c904e2282..a1e480a3ec43 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -20,6 +20,7 @@
 
 #define SMC_V1 1   /* SMC version V1 */
 #define SMC_V2 2   /* SMC version V2 */
+#define SMC_RELEASE0
 
 #define SMCPROTO_SMC   0   /* SMC protocol, IPv4 */
 #define SMCPROTO_SMC6  1   /* SMC protocol, IPv6 */
@@ -28,6 +29,8 @@
 * devices
 */
 
+#define SMC_MAX_EID_LEN32
+
 extern struct proto smc_proto;
 extern struct proto smc_proto6;
 
@@ -251,6 +254,9 @@ extern struct workqueue_struct  *smc_close_wq;  /* wq 
for close work */
 
 extern u8  local_systemid[SMC_SYSTEMID_LEN]; /* unique system identifier */
 
+#define ntohll(x) be64_to_cpu(x)
+#define htonll(x) cpu_to_be64(x)
+
 /* convert an u32 value into network byte order, store it into a 3 byte field 
*/
 static inline void hton24(u8 *net, u32 host)
 {
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index 26f1cdd35cb1..037c92a0c2b9 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -34,12 +34,52 @@ static const char SMC_EYECATCHER[4] = {'\xe2', '\xd4', 
'\xc3', '\xd9'};
 /* eye catcher "SMCD" EBCDIC for CLC messages */
 static const char SMCD_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xc4'};
 
+/* check arriving CLC proposal */
+static bool smc_clc_msg_prop_valid(struct smc_clc_msg_proposal *pclc)
+{
+   struct smc_clc_msg_proposal_prefix *pclc_prfx;
+   struct smc_clc_smcd_v2_extension *smcd_v2_ext;
+   struct smc_clc_msg_hdr *hdr = &pclc->hdr;
+   struct smc_clc_v2_extension *v2_ext;
+
+   v2_ext = smc_get_clc_v2_ext(pclc);
+   pclc_prfx = smc_clc_proposal_get_prefix(pclc);
+   if (hdr->version == SMC_V1) {
+   if (hdr->typev1 == SMC_TYPE_N)
+   return false;
+   if (ntohs(hdr->length) !=
+   sizeof(*pclc) + ntohs(pclc->iparea_offset) +
+   sizeof(*pclc_prfx) +
+   pclc_prfx->ipv6_prefixes_cnt *
+   sizeof(struct smc_clc_ipv6_prefix) +
+   sizeof(struct smc_clc_msg_trail))
+   return false;
+   } else {
+   if (ntohs(hdr->length) !=
+   sizeof(*pclc) +
+   sizeof(struct smc_clc_msg_smcd) +
+   (hdr->typev1 != SMC_TYPE_N ?
+   sizeof(*pclc_prfx) +
+   pclc_prfx->ipv6_prefixes_cnt *
+   sizeof(struct smc_clc_ipv6_prefix) : 0) +
+   (hdr->typev2 != SMC_TYPE_N ?
+   sizeof(*v2_ext) +
+   v2_ext->hdr.eid_cnt * SMC_MAX_EID_LEN : 0) +
+   (smcd_indicated(hdr->typev2) ?
+   sizeof(*smcd_v2_ext) + v2_ext->hdr.ism_gid_cnt *
+   sizeof(struct smc_clc_smcd_gid_chid) :
+   0) +
+   sizeof(struct smc_clc_msg_trail))
+   return false;
+   }
+   return true;
+}
+
 /* check if received message has a correct header length and contains valid
  * heading and trailing eyecatchers
  */
 static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr *clcm, bool check_trl)
 {
-   struct smc_clc_msg_proposal_prefix *pclc_prfx;
struct smc_clc_msg_accept_confirm *clc;
struct smc_clc_msg_proposal *pclc;
struct smc_clc_msg_decline *dclc;
@@ -51,13 +91,7 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr 
*clcm, bool check_trl)
switch (clcm->type) {
case SMC_C

Re: Yet another ethernet PHY LED control proposal

2020-09-14 Thread Pavel Machek

Hi!

> I have been thinking about another way to implement ABI for HW control
> of ethernet PHY connected LEDs.
> 
> This proposal is inspired by the fact that for some time there is a
> movement in the kernel to do transparent HW offloading of things (DSA
> is an example of that).

And it is good proposal.

> So currently we have the `netdev` trigger. When this is enabled for a
> LED, new files will appear in that LED's sysfs directory:
>   - `device_name` where user is supposed to write interface name
>   - `link` if set to 1, the LED will be ON if the interface is linked
>   - `rx` if set to 1, the LED will blink on receive event
>   - `tx` if set to 1, the LED will blink on transmit event
>   - `interval` specifies duration of the LED blink
> 
> Now what is interesting is that almost all combinations of link/rx/tx
> settings are offloadable to a Marvell PHY! (Not to all LEDs, though...)
> 
> So what if we abandoned the idea of a `hw` trigger, and instead just
> allowed a LED trigger to be offloadable, if that specific LED supports
> it?
> 
> For the HW mode for different speed we can just expand the `link` sysfs
> file ABI, so that if user writes a specific speed to this file, instead
> of just "1", the LED will be on if the interface is linked on that
> specific speed. Or maybe another sysfs file could be used for "light on
> N mbps" setting...
> 
> Afterwards we can figure out other possible modes.
> 
> What do you think?

If this can be implemented (and it probably can) it is the best
solution :-).

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature

Yet another ethernet PHY LED control proposal

2020-09-11 Thread Marek Behun

Hello,

I have been thinking about another way to implement ABI for HW control
of ethernet PHY connected LEDs.

This proposal is inspired by the fact that for some time there is a
movement in the kernel to do transparent HW offloading of things (DSA
is an example of that).

So currently we have the `netdev` trigger. When this is enabled for a
LED, new files will appear in that LED's sysfs directory:
  - `device_name` where user is supposed to write interface name
  - `link` if set to 1, the LED will be ON if the interface is linked
  - `rx` if set to 1, the LED will blink on receive event
  - `tx` if set to 1, the LED will blink on transmit event
  - `interval` specifies duration of the LED blink

Now what is interesting is that almost all combinations of link/rx/tx
settings are offloadable to a Marvell PHY! (Not to all LEDs, though...)

So what if we abandoned the idea of a `hw` trigger, and instead just
allowed a LED trigger to be offloadable, if that specific LED supports
it?

For the HW mode for different speed we can just expand the `link` sysfs
file ABI, so that if user writes a specific speed to this file, instead
of just "1", the LED will be on if the interface is linked on that
specific speed. Or maybe another sysfs file could be used for "light on
N mbps" setting...

Afterwards we can figure out other possible modes.

What do you think?

Marek

[PATCH net-next 03/10] net/smc: dynamic allocation of CLC proposal buffer

2020-09-10 Thread Karsten Graul

From: Ursula Braun 

Reduce stack size for smc_listen_work() and smc_clc_send_proposal()
by dynamic allocation of the CLC buffer to be received or sent.

Signed-off-by: Ursula Braun 
Signed-off-by: Karsten Graul 
---
 net/smc/af_smc.c  | 13 +--
 net/smc/smc_clc.c | 88 +++
 net/smc/smc_clc.h | 15 
 3 files changed, 67 insertions(+), 49 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8f6472f4ae21..00e2a4ce0131 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1276,10 +1276,10 @@ static void smc_listen_work(struct work_struct *work)
smc_listen_work);
struct socket *newclcsock = new_smc->clcsock;
struct smc_clc_msg_accept_confirm cclc;
+   struct smc_clc_msg_proposal_area *buf;
struct smc_clc_msg_proposal *pclc;
struct smc_init_info ini = {0};
bool ism_supported = false;
-   u8 buf[SMC_CLC_MAX_LEN];
int rc = 0;
 
if (new_smc->listen_smc->sk.sk_state != SMC_LISTEN)
@@ -1301,8 +1301,13 @@ static void smc_listen_work(struct work_struct *work)
/* do inband token exchange -
 * wait for and receive SMC Proposal CLC message
 */
-   pclc = (struct smc_clc_msg_proposal *)&buf;
-   rc = smc_clc_wait_msg(new_smc, pclc, SMC_CLC_MAX_LEN,
+   buf = kzalloc(sizeof(*buf), GFP_KERNEL);
+   if (!buf) {
+   rc = SMC_CLC_DECL_MEM;
+   goto out_decl;
+   }
+   pclc = (struct smc_clc_msg_proposal *)buf;
+   rc = smc_clc_wait_msg(new_smc, pclc, sizeof(*buf),
  SMC_CLC_PROPOSAL, CLC_WAIT_TIME);
if (rc)
goto out_decl;
@@ -1382,6 +1387,7 @@ static void smc_listen_work(struct work_struct *work)
}
 
/* finish worker */
+   kfree(buf);
if (!ism_supported) {
rc = smc_listen_rdma_finish(new_smc, &cclc,
ini.first_contact_local);
@@ -1397,6 +1403,7 @@ static void smc_listen_work(struct work_struct *work)
mutex_unlock(&smc_server_lgr_pending);
 out_decl:
smc_listen_decline(new_smc, rc, ini.first_contact_local);
+   kfree(buf);
 }
 
 static void smc_tcp_listen_work(struct work_struct *work)
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index c30fad120089..0c8e74faf5ca 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -153,7 +153,6 @@ static int smc_clc_prfx_set(struct socket *clcsock,
struct sockaddr_in *addr;
int rc = -ENOENT;
 
-   memset(prop, 0, sizeof(*prop));
if (!dst) {
rc = -ENOTCONN;
goto out;
@@ -412,76 +411,89 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 
peer_diag_info)
 int smc_clc_send_proposal(struct smc_sock *smc, int smc_type,
  struct smc_init_info *ini)
 {
-   struct smc_clc_ipv6_prefix ipv6_prfx[SMC_CLC_MAX_V6_PREFIX];
-   struct smc_clc_msg_proposal_prefix pclc_prfx;
-   struct smc_clc_msg_smcd pclc_smcd;
-   struct smc_clc_msg_proposal pclc;
-   struct smc_clc_msg_trail trl;
+   struct smc_clc_msg_proposal_prefix *pclc_prfx;
+   struct smc_clc_msg_proposal *pclc_base;
+   struct smc_clc_msg_proposal_area *pclc;
+   struct smc_clc_ipv6_prefix *ipv6_prfx;
+   struct smc_clc_msg_smcd *pclc_smcd;
+   struct smc_clc_msg_trail *trl;
int len, i, plen, rc;
int reason_code = 0;
struct kvec vec[5];
struct msghdr msg;
 
+   pclc = kzalloc(sizeof(*pclc), GFP_KERNEL);
+   if (!pclc)
+   return -ENOMEM;
+
+   pclc_base = &pclc->pclc_base;
+   pclc_smcd = &pclc->pclc_smcd;
+   pclc_prfx = &pclc->pclc_prfx;
+   ipv6_prfx = pclc->pclc_prfx_ipv6;
+   trl = &pclc->pclc_trl;
+
/* retrieve ip prefixes for CLC proposal msg */
-   rc = smc_clc_prfx_set(smc->clcsock, &pclc_prfx, ipv6_prfx);
-   if (rc)
+   rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx);
+   if (rc) {
+   kfree(pclc);
return SMC_CLC_DECL_CNFERR; /* configuration error */
+   }
 
/* send SMC Proposal CLC message */
-   plen = sizeof(pclc) + sizeof(pclc_prfx) +
-  (pclc_prfx.ipv6_prefixes_cnt * sizeof(ipv6_prfx[0])) +
-  sizeof(trl);
-   memset(&pclc, 0, sizeof(pclc));
-   memcpy(pclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
-   pclc.hdr.type = SMC_CLC_PROPOSAL;
-   pclc.hdr.version = SMC_V1;  /* SMC version */
-   pclc.hdr.path = smc_type;
+   plen = sizeof(*pclc_base) + sizeof(*pclc_prfx) +
+  (pclc_prfx->ipv6_prefixes_cnt * sizeof(ipv6_prfx[0])) +
+  sizeof(*trl);
+   memcpy(pclc_base->hdr.eyecatcher, SMC_EYECATCHER,
+  sizeof(SMC_EYECATCHER));
+   pclc_ba

Re: [RFC] bonding driver terminology change proposal

2020-08-12 Thread Jarod Wilson

On Thu, Jul 16, 2020 at 1:43 AM Jarod Wilson  wrote:
>
> On Wed, Jul 15, 2020 at 11:18 PM Andrew Lunn  wrote:
...
> > I really think that before we consider changes like this, somebody
> > needs to work on git tooling, so that it knows when mass renames have
> > happened, and can do the same sort of renames when cherry-picking
> > across the flag day. Without that, people trying to maintain stable
> > kernels are going to be very unhappy.
>
> I'm not familiar enough with git's internals to have a clue where to
> begin for something like that, but I suspect you're right. Doing
> blanket renames in stable branches sounds like a terrible idea, even
> if it would circumvent the cherry-pick issues. I guess now is as good
> a time as any to start poking around at git's internals...

I haven't forgotten about this, just been tied up with other work. I
spent a bit of time getting lost in git's internals, and the best idea
I've had suggested to me is some sort of cherry-pick hook that
executes an external script to massage variables back to old names for
-stable backporting. Could live somewhere in-tree, and maintainers
would have to know about it, but it would be reasonably painless.
Ideally, I was thinking a semantic patch to filter the backported
patch through, but haven't yet spent enough time playing with
coccinelle to know if that's actually a viable idea, since it's
designed to run on C code, not a patch, as I understand it.
Worst-case, it'd be a shell script doing some awk/sed/whatever.

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-16 Thread Stephen Hemminger

On Thu, 16 Jul 2020 11:59:47 -0700 (PDT)
David Miller  wrote:

> From: Jarod Wilson 
> Date: Wed, 15 Jul 2020 23:06:55 -0400
> 
> > On Mon, Jul 13, 2020 at 9:00 PM David Miller  wrote:  
> >>
> >> From: Michal Kubecek 
> >> Date: Tue, 14 Jul 2020 00:00:16 +0200
> >>  
> >> > Could we, please, avoid breaking existing userspace tools and scripts?  
> >>
> >> I will not let UAPI breakage, don't worry.  
> > 
> > Seeking some clarification here. Does the output of
> > /proc/net/bonding/ fall under that umbrella as well?  
> 
> Yes, anything user facing must not break.
> 

For iproute2, would like better wording on the command
parameters (but accept the old names so as not to break scripts).
The old names can be highlighted as for compatibility only
or removed from the usage manual and usage.

Internally, variable names and function names can change iproute2
since the internal API's are not considered part of user API.

Re: [RFC] bonding driver terminology change proposal

2020-07-16 Thread David Miller

From: Jarod Wilson 
Date: Wed, 15 Jul 2020 23:06:55 -0400

> On Mon, Jul 13, 2020 at 9:00 PM David Miller  wrote:
>>
>> From: Michal Kubecek 
>> Date: Tue, 14 Jul 2020 00:00:16 +0200
>>
>> > Could we, please, avoid breaking existing userspace tools and scripts?
>>
>> I will not let UAPI breakage, don't worry.
> 
> Seeking some clarification here. Does the output of
> /proc/net/bonding/ fall under that umbrella as well?

Yes, anything user facing must not break.

Re: [RFC] bonding driver terminology change proposal

2020-07-15 Thread Jarod Wilson

On Wed, Jul 15, 2020 at 11:18 PM Andrew Lunn  wrote:
>
> On Wed, Jul 15, 2020 at 11:04:16PM -0400, Jarod Wilson wrote:
> > On Mon, Jul 13, 2020 at 8:26 PM Andrew Lunn  wrote:
> > >
> > > Hi Jarod
> > >
> > > Do you have this change scripted? Could you apply the script to v5.4
> > > and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How
> > > many result in conflicts?
> > >
> > > Could you do the same with v4.19...v4.19.132, which has 20 fixes.
> > >
> > > This will give us an idea of the maintenance overhead such a change is
> > > going to cause, and how good git is at figuring out this sort of
> > > thing.
> >
> > Okay, I have some fugly bash scripts that use sed to do the majority
> > of the work here, save some manual bits done to add duplicate
> > interfaces w/new names and some aliases, and everything is compiling
> > and functions in a basic smoke test here.
> >
> > Summary on the 5.4 git cherry-pick conflict resolution after applying
> > changes: not that good. 7 of the 8 bonding fixes in the 5.4 stable
> > branch required fixing when straight cherry-picking. Dumping the
> > patches, running a sed script over them, and then git am'ing them
> > works pretty well though.
>
> Hi Jarad
>
> That is what i was expecting.
>
> I really think that before we consider changes like this, somebody
> needs to work on git tooling, so that it knows when mass renames have
> happened, and can do the same sort of renames when cherry-picking
> across the flag day. Without that, people trying to maintain stable
> kernels are going to be very unhappy.

I'm not familiar enough with git's internals to have a clue where to
begin for something like that, but I suspect you're right. Doing
blanket renames in stable branches sounds like a terrible idea, even
if it would circumvent the cherry-pick issues. I guess now is as good
a time as any to start poking around at git's internals...

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-15 Thread Andrew Lunn

On Wed, Jul 15, 2020 at 11:04:16PM -0400, Jarod Wilson wrote:
> On Mon, Jul 13, 2020 at 8:26 PM Andrew Lunn  wrote:
> >
> > Hi Jarod
> >
> > Do you have this change scripted? Could you apply the script to v5.4
> > and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How
> > many result in conflicts?
> >
> > Could you do the same with v4.19...v4.19.132, which has 20 fixes.
> >
> > This will give us an idea of the maintenance overhead such a change is
> > going to cause, and how good git is at figuring out this sort of
> > thing.
> 
> Okay, I have some fugly bash scripts that use sed to do the majority
> of the work here, save some manual bits done to add duplicate
> interfaces w/new names and some aliases, and everything is compiling
> and functions in a basic smoke test here.
> 
> Summary on the 5.4 git cherry-pick conflict resolution after applying
> changes: not that good. 7 of the 8 bonding fixes in the 5.4 stable
> branch required fixing when straight cherry-picking. Dumping the
> patches, running a sed script over them, and then git am'ing them
> works pretty well though.

Hi Jarad

That is what i was expecting.

I really think that before we consider changes like this, somebody
needs to work on git tooling, so that it knows when mass renames have
happened, and can do the same sort of renames when cherry-picking
across the flag day. Without that, people trying to maintain stable
kernels are going to be very unhappy.

 Andrew

Re: [RFC] bonding driver terminology change proposal

2020-07-15 Thread Jarod Wilson

On Mon, Jul 13, 2020 at 9:00 PM David Miller  wrote:
>
> From: Michal Kubecek 
> Date: Tue, 14 Jul 2020 00:00:16 +0200
>
> > Could we, please, avoid breaking existing userspace tools and scripts?
>
> I will not let UAPI breakage, don't worry.

Seeking some clarification here. Does the output of
/proc/net/bonding/ fall under that umbrella as well? I'm sure
there are people that do parse it for monitoring, and thus I assume
that it does, but want to be certain. I think this is the only
remaining thing I need to address in a local test conversion build.

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-15 Thread Jarod Wilson

On Mon, Jul 13, 2020 at 8:26 PM Andrew Lunn  wrote:
>
> Hi Jarod
>
> Do you have this change scripted? Could you apply the script to v5.4
> and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How
> many result in conflicts?
>
> Could you do the same with v4.19...v4.19.132, which has 20 fixes.
>
> This will give us an idea of the maintenance overhead such a change is
> going to cause, and how good git is at figuring out this sort of
> thing.

Okay, I have some fugly bash scripts that use sed to do the majority
of the work here, save some manual bits done to add duplicate
interfaces w/new names and some aliases, and everything is compiling
and functions in a basic smoke test here.

Summary on the 5.4 git cherry-pick conflict resolution after applying
changes: not that good. 7 of the 8 bonding fixes in the 5.4 stable
branch required fixing when straight cherry-picking. Dumping the
patches, running a sed script over them, and then git am'ing them
works pretty well though. I didn't try 4.19 (yet?), I assume it'll
just be more of the same.

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-15 Thread Jarod Wilson

On Wed, Jul 15, 2020 at 8:57 AM Edward Cree  wrote:
>
> Once again, the opinions below are my own and definitely do not
>  represent anything my employer would be seen dead in the same
>  room as.
>
> On 13/07/2020 23:41, Stephen Hemminger wrote:
> > As far as userspace, maybe keep the old API's but provide deprecation nags.
> Why would you need to deprecate the old APIs?
> If the user echoes 'slave' into some sysfs file (or whatever), that
>  indicates that they don't have any problem with using the word.
> So there's no reason toever remove that support — its _mere
>  existence_ isn't problematic for anyone not actively seeking to be
>  offended.
> Which I think is more evidence that this change is not motivated by
>  practical concerns but by a kind of performative ritual purity.
>
> This is dumb.  I suspect you all, including Jarod, know that this
>  is dumb, but you're either going along with it or keeping your
>  head down in the hope that it will all blow over and you can go
>  back to normal.  Unfortunately, it doesn't work like that; the
>  activists who push this stuff are never satisfied; making
>  concessions to them results not in peace but in further demands;
>  and just as the corporations today are caving to the current
>  demands for fear of being singled out by the mob, so they will
>  cave again to the next round of demands, and you'll be back in
>  the same position, trying to deal with bosses wanting you to
>  break uAPI without even a technical reason.
> And next time around, the mob will be bolder and the bosses more
>  pliant, because by giving in this time we'll have signalled that
>  we're weak and easily dominated.  I would advise anyone still in
>  doubt of this point to read Kipling's poem "Dane-geld".
> And we'll all be left wondering why kernel development is so
>  soulless and joyless that no-one, of _any_ colour, aspires to
>  become a kernel hacker any more.
>
> It's not too late to stop the crazy, if we all just stop
>  pretending it's sane.

No, it isn't a practical code concern motivating this change, it's
actually quite impractical from a code standpoint and has no technical
merit. I understand your position, but having seen many emotional
responses to issues surrounding this, I think it's a worthwhile effort
that many people actually do appreciate. Even if I'm not personally
offended by the terminology, as a white male, I don't think I possess
the life experiences to downplay the negative impact ongoing use of
terms like "slave" might have on people that are actual descendants of
slavery. Embracing and helping move forward social change seems like a
responsible thing to do here, as long as we can do it without breaking
the kernel and UAPI.

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-15 Thread Edward Cree

Once again, the opinions below are my own and definitely do not
 represent anything my employer would be seen dead in the same
 room as.

On 13/07/2020 23:41, Stephen Hemminger wrote:
> As far as userspace, maybe keep the old API's but provide deprecation nags.
Why would you need to deprecate the old APIs?
If the user echoes 'slave' into some sysfs file (or whatever), that
 indicates that they don't have any problem with using the word.
So there's no reason toever remove that support — its _mere
 existence_ isn't problematic for anyone not actively seeking to be
 offended.
Which I think is more evidence that this change is not motivated by
 practical concerns but by a kind of performative ritual purity.

This is dumb.  I suspect you all, including Jarod, know that this
 is dumb, but you're either going along with it or keeping your
 head down in the hope that it will all blow over and you can go
 back to normal.  Unfortunately, it doesn't work like that; the
 activists who push this stuff are never satisfied; making
 concessions to them results not in peace but in further demands;
 and just as the corporations today are caving to the current
 demands for fear of being singled out by the mob, so they will
 cave again to the next round of demands, and you'll be back in
 the same position, trying to deal with bosses wanting you to
 break uAPI without even a technical reason.
And next time around, the mob will be bolder and the bosses more
 pliant, because by giving in this time we'll have signalled that
 we're weak and easily dominated.  I would advise anyone still in
 doubt of this point to read Kipling's poem "Dane-geld".
And we'll all be left wondering why kernel development is so
 soulless and joyless that no-one, of _any_ colour, aspires to
 become a kernel hacker any more.

It's not too late to stop the crazy, if we all just stop
 pretending it's sane.

-ed

Re: [RFC] bonding driver terminology change proposal

2020-07-14 Thread Jarod Wilson

On Tue, Jul 14, 2020 at 4:39 PM Marcelo Ricardo Leitner
 wrote:
>
> On Tue, Jul 14, 2020 at 09:17:48PM +0200, Toke HÃ¸iland-JÃ¸rgensen wrote:
> > Jarod Wilson  writes:
> >
> > > As part of an effort to help enact social change, Red Hat is
> > > committing to efforts to eliminate any problematic terminology from
> > > any of the software that it ships and supports. Front and center for
> > > me personally in that effort is the bonding driver's use of the terms
> > > master and slave, and to a lesser extent, bond and bonding, due to
> > > bondage being another term for slavery. Most people in computer
> > > science understand these terms aren't intended to be offensive or
> > > oppressive, and have well understood meanings in computing, but
> > > nonetheless, they still present an open wound, and a barrier for
> > > participation and inclusion to some.
> > >
> > > To start out with, I'd like to attempt to eliminate as much of the use
> > > of master and slave in the bonding driver as possible. For the most
> > > part, I think this can be done without breaking UAPI, but may require
> > > changes to anything accessing bond info via proc or sysfs.
> > >
> > > My initial thought was to rename master to aggregator and slaves to
> > > ports, but... that gets really messy with the existing 802.3ad bonding
> > > code using both extensively already. I've given thought to a number of
> > > other possible combinations, but the one that I'm liking the most is
> > > master -> bundle and slave -> cable, for a number of reasons. I'd
> > > considered cable and wire, as a cable is a grouping of individual
> > > wires, but we're grouping together cables, really -- each bonded
> > > ethernet interface has a cable connected, so a bundle of cables makes
> > > sense visually and figuratively. Additionally, it's a swap made easier
> > > in the codebase by master and bundle and slave and cable having the
> > > same number of characters, respectively. Granted though, "bundle"
> > > doesn't suggest "runs the show" the way "master" or something like
> > > maybe "director" or "parent" does, but those lack the visual aspect
> > > present with a bundle of cables. Using parent/child could work too
> > > though, it's perhaps closer to the master/slave terminology currently
> > > in use as far as literal meaning.
> >
> > I've always thought of it as a "bond device" which has other netdevs as
> > "components" (as in 'things that are part of'). So maybe
> > "main"/"component" or something to that effect?
>
> Same here, and it's pretty much like how I see the bridge as well.
> "bridge device" and "legs".

I did toy with the idea of "torso" or "thorax" for the bond aggregate
device and "legs" for the bond components, but at this point, I guess
it's mostly bikeshedding, the bigger issue is "how messy would it
be?". I've scripted most of the changes, but not all of them. Still
working on it... :)

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-14 Thread Marcelo Ricardo Leitner

On Tue, Jul 14, 2020 at 09:17:48PM +0200, Toke Høiland-Jørgensen wrote:
> Jarod Wilson  writes:
> 
> > As part of an effort to help enact social change, Red Hat is
> > committing to efforts to eliminate any problematic terminology from
> > any of the software that it ships and supports. Front and center for
> > me personally in that effort is the bonding driver's use of the terms
> > master and slave, and to a lesser extent, bond and bonding, due to
> > bondage being another term for slavery. Most people in computer
> > science understand these terms aren't intended to be offensive or
> > oppressive, and have well understood meanings in computing, but
> > nonetheless, they still present an open wound, and a barrier for
> > participation and inclusion to some.
> >
> > To start out with, I'd like to attempt to eliminate as much of the use
> > of master and slave in the bonding driver as possible. For the most
> > part, I think this can be done without breaking UAPI, but may require
> > changes to anything accessing bond info via proc or sysfs.
> >
> > My initial thought was to rename master to aggregator and slaves to
> > ports, but... that gets really messy with the existing 802.3ad bonding
> > code using both extensively already. I've given thought to a number of
> > other possible combinations, but the one that I'm liking the most is
> > master -> bundle and slave -> cable, for a number of reasons. I'd
> > considered cable and wire, as a cable is a grouping of individual
> > wires, but we're grouping together cables, really -- each bonded
> > ethernet interface has a cable connected, so a bundle of cables makes
> > sense visually and figuratively. Additionally, it's a swap made easier
> > in the codebase by master and bundle and slave and cable having the
> > same number of characters, respectively. Granted though, "bundle"
> > doesn't suggest "runs the show" the way "master" or something like
> > maybe "director" or "parent" does, but those lack the visual aspect
> > present with a bundle of cables. Using parent/child could work too
> > though, it's perhaps closer to the master/slave terminology currently
> > in use as far as literal meaning.
> 
> I've always thought of it as a "bond device" which has other netdevs as
> "components" (as in 'things that are part of'). So maybe
> "main"/"component" or something to that effect?

Same here, and it's pretty much like how I see the bridge as well.
"bridge device" and "legs".

  Marcelo

Re: [RFC] bonding driver terminology change proposal

2020-07-14 Thread Toke Høiland-Jørgensen

Jarod Wilson  writes:

> As part of an effort to help enact social change, Red Hat is
> committing to efforts to eliminate any problematic terminology from
> any of the software that it ships and supports. Front and center for
> me personally in that effort is the bonding driver's use of the terms
> master and slave, and to a lesser extent, bond and bonding, due to
> bondage being another term for slavery. Most people in computer
> science understand these terms aren't intended to be offensive or
> oppressive, and have well understood meanings in computing, but
> nonetheless, they still present an open wound, and a barrier for
> participation and inclusion to some.
>
> To start out with, I'd like to attempt to eliminate as much of the use
> of master and slave in the bonding driver as possible. For the most
> part, I think this can be done without breaking UAPI, but may require
> changes to anything accessing bond info via proc or sysfs.
>
> My initial thought was to rename master to aggregator and slaves to
> ports, but... that gets really messy with the existing 802.3ad bonding
> code using both extensively already. I've given thought to a number of
> other possible combinations, but the one that I'm liking the most is
> master -> bundle and slave -> cable, for a number of reasons. I'd
> considered cable and wire, as a cable is a grouping of individual
> wires, but we're grouping together cables, really -- each bonded
> ethernet interface has a cable connected, so a bundle of cables makes
> sense visually and figuratively. Additionally, it's a swap made easier
> in the codebase by master and bundle and slave and cable having the
> same number of characters, respectively. Granted though, "bundle"
> doesn't suggest "runs the show" the way "master" or something like
> maybe "director" or "parent" does, but those lack the visual aspect
> present with a bundle of cables. Using parent/child could work too
> though, it's perhaps closer to the master/slave terminology currently
> in use as far as literal meaning.

I've always thought of it as a "bond device" which has other netdevs as
"components" (as in 'things that are part of'). So maybe
"main"/"component" or something to that effect?

-Toke

Re: [RFC] bonding driver terminology change proposal

2020-07-14 Thread Jarod Wilson

On Mon, Jul 13, 2020 at 8:55 PM Jay Vosburgh  wrote:
>
> Stephen Hemminger  wrote:
>
> >On Tue, 14 Jul 2020 00:00:16 +0200
> >Michal Kubecek  wrote:
> >
> >> On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote:
> >> > To start out with, I'd like to attempt to eliminate as much of the use
> >> > of master and slave in the bonding driver as possible. For the most
> >> > part, I think this can be done without breaking UAPI, but may require
> >> > changes to anything accessing bond info via proc or sysfs.
> >>
> >> Could we, please, avoid breaking existing userspace tools and scripts?
> >> Massive code churn is one thing and we could certainly bite the bullet
> >> and live with it (even if I'm still not convinced it would be as great
> >> idea as some present it) but trading theoretical offense for real and
> >> palpable harm to existing users is something completely different.
> >>
> >> Or is "don't break userspace" no longer the "first commandment" of linux
> >> kernel development?
> >>
> >> Michal Kubecek
> >
> >Please consider using same wording as current standard for link aggregration.
> >Current version is 802.1AX and it uses the terms:
> >  Multiplexer /  Aggregator
>
> Well, 802.1AX only defines LACP, and the bonding driver does
> more than just LACP.  Also, Multiplexer, in 802.1AX, is a function of
> various components, e.g., each Aggregator has a Multiplexer, as do other
> components.
>
> As "channel bonding" is a long-established term of art, I don't
> see an issue with something like "bond" and "port," which parallels the
> bridge / port terminology.

I did look at aggregator and port as options, but the overlap with the
bonding 802.3ad code would mean first reworking a bunch of that code
to free up those terms for more general bonding use. I think "bonding"
should be okay to keep around as well, and am kind of on the fence
with "master", since master of ceremonies, masters degress, master
keys, etc are all similar enough to what a master device in a bond
represents, and the main objectionable language is primarily "slave".

One option would be to rename "port" to "laggport" or "adport" or
something like that in the 802.3ad code, and then make use of "port"
in place of slave (which mirrors what's done in the team driver).

> [...]
> >As far as userspace, maybe keep the old API's but provide deprecation nags.
> >And don't document the old API values.
>
> Unless the community stance on not breaking user space has
> changed, the extant APIs must be maintained.  In the context of bonding,
> this would include "ip link" command line arguments, sysfs and procsfs
> interfaces, as well as netlink attribute names.  There are also exported
> kernel APIs that bonding utilizes, netdev_master_upper_dev_link, et al.

To some people, this could be a case that warranted breaking UAPIs. In
an ideal world, that would be nice, but obviously, breaking the world
to get there isn't good either, so I think maintaining them all is
hopefully still understandable.

> Additionally, just to be absolutely clear, is the proposal here
> intending to undertake a rather significant search and replace of the
> text strings "master" and "slave" within the bonding driver source?
> This in addition to whatever API changes end up being done.  If so, then
> I would also like to know the answer to Andrew's question regarding
> patch conflicts in order to gauge the future maintenance cost.

Correct, this would be full search-and-replace, with minor tweaks here
and there -- bond_enslave -> bond_connect or something like that,
since bond_encable wouldn't make sense, and replacing references to
ifenslave in the code isn't helpful, since ifenslave is still going to
be called ifenslave.

As of yet, no, I don't have this scripted, but I can certainly give
that a go. I'm not terribly familiar with coccinelle, and if that
would be the way to script it, or if a simple bash/perl/whatever
script would suffice.

--
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-14 Thread Jarod Wilson

On Mon, Jul 13, 2020 at 6:00 PM Michal Kubecek  wrote:
>
> On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote:
> > To start out with, I'd like to attempt to eliminate as much of the use
> > of master and slave in the bonding driver as possible. For the most
> > part, I think this can be done without breaking UAPI, but may require
> > changes to anything accessing bond info via proc or sysfs.
>
> Could we, please, avoid breaking existing userspace tools and scripts?
> Massive code churn is one thing and we could certainly bite the bullet
> and live with it (even if I'm still not convinced it would be as great
> idea as some present it) but trading theoretical offense for real and
> palpable harm to existing users is something completely different.
>
> Or is "don't break userspace" no longer the "first commandment" of linux
> kernel development?

Definitely looking to minimize breakage here, and it sounds like it'll
be to the point of "none", or this won't fly. I think this may require
having "legacy" aliases for certain interfaces and the like, to both
provide a less problematic interface name as the new default, but
prevent breaking any existing setups.

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-14 Thread Jarod Wilson

On Mon, Jul 13, 2020 at 5:36 PM Eric Dumazet  wrote:
>
> On 7/13/20 11:51 AM, Jarod Wilson wrote:
> > As part of an effort to help enact social change, Red Hat is
> > committing to efforts to eliminate any problematic terminology from
> > any of the software that it ships and supports. Front and center for
> > me personally in that effort is the bonding driver's use of the terms
> > master and slave, and to a lesser extent, bond and bonding, due to
> > bondage being another term for slavery. Most people in computer
> > science understand these terms aren't intended to be offensive or
> > oppressive, and have well understood meanings in computing, but
> > nonetheless, they still present an open wound, and a barrier for
> > participation and inclusion to some.
> >
> > To start out with, I'd like to attempt to eliminate as much of the use
> > of master and slave in the bonding driver as possible. For the most
> > part, I think this can be done without breaking UAPI, but may require
> > changes to anything accessing bond info via proc or sysfs.
> >
> > My initial thought was to rename master to aggregator and slaves to
> > ports, but... that gets really messy with the existing 802.3ad bonding
> > code using both extensively already. I've given thought to a number of
> > other possible combinations, but the one that I'm liking the most is
> > master -> bundle and slave -> cable, for a number of reasons. I'd
> > considered cable and wire, as a cable is a grouping of individual
> > wires, but we're grouping together cables, really -- each bonded
> > ethernet interface has a cable connected, so a bundle of cables makes
> > sense visually and figuratively. Additionally, it's a swap made easier
> > in the codebase by master and bundle and slave and cable having the
> > same number of characters, respectively. Granted though, "bundle"
> > doesn't suggest "runs the show" the way "master" or something like
> > maybe "director" or "parent" does, but those lack the visual aspect
> > present with a bundle of cables. Using parent/child could work too
> > though, it's perhaps closer to the master/slave terminology currently
> > in use as far as literal meaning.
> >
> > So... Thoughts?
> >
>
> So you considered : aggregator/ports, bundle/cable.
>
> I thought about cord/strand, since this is less likely to be used already in 
> networking land
> (like worker, thread, fiber, or wire ...)
>
> Although a cord with two strands is probably not very common :/

I'd also thought about cable and wire, since there are multiple
physical wires inside an ethernet cable, but you typically connect one
cable per port, so a bundle of cables seemed to make more sense. :) I
also had a few other ideas I played with, including a bundle of pipes
and a pipework of pipes (which is apparently a thing, but not very
common either, outside of maybe plumbers?).

-- 
Jarod Wilson
ja...@redhat.com

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread David Miller

From: Michal Kubecek 
Date: Tue, 14 Jul 2020 00:00:16 +0200

> Could we, please, avoid breaking existing userspace tools and scripts?

I will not let UAPI breakage, don't worry.

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread Jay Vosburgh

Stephen Hemminger  wrote:

>On Tue, 14 Jul 2020 00:00:16 +0200
>Michal Kubecek  wrote:
>
>> On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote:
>> > To start out with, I'd like to attempt to eliminate as much of the use
>> > of master and slave in the bonding driver as possible. For the most
>> > part, I think this can be done without breaking UAPI, but may require
>> > changes to anything accessing bond info via proc or sysfs.  
>> 
>> Could we, please, avoid breaking existing userspace tools and scripts?
>> Massive code churn is one thing and we could certainly bite the bullet
>> and live with it (even if I'm still not convinced it would be as great
>> idea as some present it) but trading theoretical offense for real and
>> palpable harm to existing users is something completely different.
>> 
>> Or is "don't break userspace" no longer the "first commandment" of linux
>> kernel development?
>> 
>> Michal Kubecek
>
>Please consider using same wording as current standard for link aggregration.
>Current version is 802.1AX and it uses the terms:
>  Multiplexer /  Aggregator

Well, 802.1AX only defines LACP, and the bonding driver does
more than just LACP.  Also, Multiplexer, in 802.1AX, is a function of
various components, e.g., each Aggregator has a Multiplexer, as do other
components.

As "channel bonding" is a long-established term of art, I don't
see an issue with something like "bond" and "port," which parallels the
bridge / port terminology.

[...]
>As far as userspace, maybe keep the old API's but provide deprecation nags.
>And don't document the old API values.

Unless the community stance on not breaking user space has
changed, the extant APIs must be maintained.  In the context of bonding,
this would include "ip link" command line arguments, sysfs and procsfs
interfaces, as well as netlink attribute names.  There are also exported
kernel APIs that bonding utilizes, netdev_master_upper_dev_link, et al.

Additionally, just to be absolutely clear, is the proposal here
intending to undertake a rather significant search and replace of the
text strings "master" and "slave" within the bonding driver source?
This in addition to whatever API changes end up being done.  If so, then
I would also like to know the answer to Andrew's question regarding
patch conflicts in order to gauge the future maintenance cost.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread David Miller

From: Jarod Wilson 
Date: Mon, 13 Jul 2020 14:51:39 -0400

> To start out with, I'd like to attempt to eliminate as much of the use
> of master and slave in the bonding driver as possible. For the most
> part, I think this can be done without breaking UAPI, but may require
> changes to anything accessing bond info via proc or sysfs.

You can change what you want internally to the driver in order to meet
this objective, but I am positively sure that external facing UAPI has
to be retained.

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread Andrew Lunn

Hi Jarod

Do you have this change scripted? Could you apply the script to v5.4
and then cherry-pick the 8 bonding fixes that exist in v5.4.51. How
many result in conflicts?

Could you do the same with v4.19...v4.19.132, which has 20 fixes.

This will give us an idea of the maintenance overhead such a change is
going to cause, and how good git is at figuring out this sort of
thing.

Andrew

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread Michal Kubecek

On Mon, Jul 13, 2020 at 03:41:18PM -0700, Stephen Hemminger wrote:
> On Tue, 14 Jul 2020 00:00:16 +0200
> Michal Kubecek  wrote:
> 
> > On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote:
> > > To start out with, I'd like to attempt to eliminate as much of the use
> > > of master and slave in the bonding driver as possible. For the most
> > > part, I think this can be done without breaking UAPI, but may require
> > > changes to anything accessing bond info via proc or sysfs.  
> > 
> > Could we, please, avoid breaking existing userspace tools and scripts?
> > Massive code churn is one thing and we could certainly bite the bullet
> > and live with it (even if I'm still not convinced it would be as great
> > idea as some present it) but trading theoretical offense for real and
> > palpable harm to existing users is something completely different.
> > 
> > Or is "don't break userspace" no longer the "first commandment" of linux
> > kernel development?
> > 
> > Michal Kubecek
> 
> Please consider using same wording as current standard for link aggregration.
> Current version is 802.1AX and it uses the terms:
>   Multiplexer /  Aggregator

But both of these are replacements for "master", right?

> As far as userspace, maybe keep the old API's but provide deprecation nags.
> And don't document the old API values.

I'm not a fan of nagging users. And even less of a fan of undocumented
keyword and value aliases.

Michal

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread Stephen Hemminger

On Tue, 14 Jul 2020 00:00:16 +0200
Michal Kubecek  wrote:

> On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote:
> > To start out with, I'd like to attempt to eliminate as much of the use
> > of master and slave in the bonding driver as possible. For the most
> > part, I think this can be done without breaking UAPI, but may require
> > changes to anything accessing bond info via proc or sysfs.  
> 
> Could we, please, avoid breaking existing userspace tools and scripts?
> Massive code churn is one thing and we could certainly bite the bullet
> and live with it (even if I'm still not convinced it would be as great
> idea as some present it) but trading theoretical offense for real and
> palpable harm to existing users is something completely different.
> 
> Or is "don't break userspace" no longer the "first commandment" of linux
> kernel development?
> 
> Michal Kubecek

Please consider using same wording as current standard for link aggregration.
Current version is 802.1AX and it uses the terms:
  Multiplexer /  Aggregator

There are no uses of master or slave in 802.1Ax standard.

As far as userspace, maybe keep the old API's but provide deprecation nags.
And don't document the old API values.

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread Michal Kubecek

On Mon, Jul 13, 2020 at 02:51:39PM -0400, Jarod Wilson wrote:
> To start out with, I'd like to attempt to eliminate as much of the use
> of master and slave in the bonding driver as possible. For the most
> part, I think this can be done without breaking UAPI, but may require
> changes to anything accessing bond info via proc or sysfs.

Could we, please, avoid breaking existing userspace tools and scripts?
Massive code churn is one thing and we could certainly bite the bullet
and live with it (even if I'm still not convinced it would be as great
idea as some present it) but trading theoretical offense for real and
palpable harm to existing users is something completely different.

Or is "don't break userspace" no longer the "first commandment" of linux
kernel development?

Michal Kubecek

Re: [RFC] bonding driver terminology change proposal

2020-07-13 Thread Eric Dumazet




On 7/13/20 11:51 AM, Jarod Wilson wrote:
> As part of an effort to help enact social change, Red Hat is
> committing to efforts to eliminate any problematic terminology from
> any of the software that it ships and supports. Front and center for
> me personally in that effort is the bonding driver's use of the terms
> master and slave, and to a lesser extent, bond and bonding, due to
> bondage being another term for slavery. Most people in computer
> science understand these terms aren't intended to be offensive or
> oppressive, and have well understood meanings in computing, but
> nonetheless, they still present an open wound, and a barrier for
> participation and inclusion to some.
> 
> To start out with, I'd like to attempt to eliminate as much of the use
> of master and slave in the bonding driver as possible. For the most
> part, I think this can be done without breaking UAPI, but may require
> changes to anything accessing bond info via proc or sysfs.
> 
> My initial thought was to rename master to aggregator and slaves to
> ports, but... that gets really messy with the existing 802.3ad bonding
> code using both extensively already. I've given thought to a number of
> other possible combinations, but the one that I'm liking the most is
> master -> bundle and slave -> cable, for a number of reasons. I'd
> considered cable and wire, as a cable is a grouping of individual
> wires, but we're grouping together cables, really -- each bonded
> ethernet interface has a cable connected, so a bundle of cables makes
> sense visually and figuratively. Additionally, it's a swap made easier
> in the codebase by master and bundle and slave and cable having the
> same number of characters, respectively. Granted though, "bundle"
> doesn't suggest "runs the show" the way "master" or something like
> maybe "director" or "parent" does, but those lack the visual aspect
> present with a bundle of cables. Using parent/child could work too
> though, it's perhaps closer to the master/slave terminology currently
> in use as far as literal meaning.
> 
> So... Thoughts?
> 

So you considered : aggregator/ports, bundle/cable.

I thought about cord/strand, since this is less likely to be used already in 
networking land
(like worker, thread, fiber, or wire ...)

Although a cord with two strands is probably not very common :/

[RFC] bonding driver terminology change proposal

2020-07-13 Thread Jarod Wilson

As part of an effort to help enact social change, Red Hat is
committing to efforts to eliminate any problematic terminology from
any of the software that it ships and supports. Front and center for
me personally in that effort is the bonding driver's use of the terms
master and slave, and to a lesser extent, bond and bonding, due to
bondage being another term for slavery. Most people in computer
science understand these terms aren't intended to be offensive or
oppressive, and have well understood meanings in computing, but
nonetheless, they still present an open wound, and a barrier for
participation and inclusion to some.

To start out with, I'd like to attempt to eliminate as much of the use
of master and slave in the bonding driver as possible. For the most
part, I think this can be done without breaking UAPI, but may require
changes to anything accessing bond info via proc or sysfs.

My initial thought was to rename master to aggregator and slaves to
ports, but... that gets really messy with the existing 802.3ad bonding
code using both extensively already. I've given thought to a number of
other possible combinations, but the one that I'm liking the most is
master -> bundle and slave -> cable, for a number of reasons. I'd
considered cable and wire, as a cable is a grouping of individual
wires, but we're grouping together cables, really -- each bonded
ethernet interface has a cable connected, so a bundle of cables makes
sense visually and figuratively. Additionally, it's a swap made easier
in the codebase by master and bundle and slave and cable having the
same number of characters, respectively. Granted though, "bundle"
doesn't suggest "runs the show" the way "master" or something like
maybe "director" or "parent" does, but those lack the visual aspect
present with a bundle of cables. Using parent/child could work too
though, it's perhaps closer to the master/slave terminology currently
in use as far as literal meaning.

So... Thoughts?

For reference, a work-in-progress adaptation from master/slave to
bundle/cable has a diffstat that is currently summarized as:

 37 files changed, 2607 insertions(+), 2571 deletions(-)

-- 
Jarod Wilson
ja...@redhat.com

Business Proposal - Please Reply

2019-10-03 Thread Yuval

Hello

My name is Yuval Rose. I have an urgent lucrative business 
opportunity for you worth over 15 Milli0n US D0llars. I got your 
details on the internet when I was searching for a reliable 
person that can handle this deal and I believe you can handle it. 
Waiting for your speedy reply for further and complete details.

Send reply to: j...@gutermanpartners.com


Best Regards
Yuval


Toronto-Canada

INVESTMENT PROPOSAL.

2019-09-19 Thread Hadel Issa

It’s my pleasure to contact you through this media because I need an
investment assistance in your country. However I have a profitable
investment proposal with  good interest to share with you, amounted
the sum of (Twenty Eight Million Four Hundred Thousand United State
Dollar ($28.400.000.00). If you  are willing to handle this project
kindly reply urgent to enable me provide you more information about
the investment funds and the project.

I am waiting to hear from you through this my private
email(hadeliss...@gmail.com) so we can proceed further.

Best Regards.
Mr. Hadel Issa

Re: Proposal: r8152 firmware patching framework

2019-09-03 Thread Prashant Malani

(Narrowing the recipient list for now)

On Tue, Sep 3, 2019 at 3:50 PM David Miller  wrote:
>
> From: Prashant Malani 
> Date: Tue, 3 Sep 2019 14:32:01 -0700
>
> > I've moved David to the TO list to hopefully get his suggestions and
> > guidance about how to design this in a upstream-compatible way.
>
> I am not an expert in this area so please do not solicit my opinion.
Noted. My apologies.
>
> Thank you.

Re: Proposal: r8152 firmware patching framework

2019-09-03 Thread David Miller

From: Prashant Malani 
Date: Tue, 3 Sep 2019 14:32:01 -0700

> I've moved David to the TO list to hopefully get his suggestions and
> guidance about how to design this in a upstream-compatible way.

I am not an expert in this area so please do not solicit my opinion.

Thank you.

Re: Proposal: r8152 firmware patching framework

2019-09-03 Thread Prashant Malani

Hi Bambi,

Thank you for your response. We'd be more than happy to assist in
working out a solution that would be acceptable by the upstream
maintainers.
I think having a maintainable and safe way to deploy firmware fixes
would be much appreciated by hardware users as well as upstream devs,
and certainly more manageable than big static byte-arrays in the
source code!

I've moved David to the TO list to hopefully get his suggestions and
guidance about how to design this in a upstream-compatible way.

I'd be happy to implement it too (I feel this can occur concurrent to
Hayes' upstreaming efforts).

David, could you kindly advise the best way to incorporate deploying
these firmware patches? This change link gives an idea of what we're
dealing with: 
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953

My original strawman is to just have a simple firmware format like so:
...

The driver code can have parts to deal with each section in an
appropriate fashion (e.g is each data entry a word or a byte? does
this section have a key which needs to be written to a certain
register etc.)

We'd be grateful if you can offer your advice about best practices (or
suggestions about who might be a good reviewer), so that we can have a
design in place before sending out any patches.

Thanks and best regards,

-Prashant

On Tue, Sep 3, 2019 at 2:01 AM Bambi Yeh  wrote:
>
> Hi Prashant:
>
> We will try to implement your requests.
> Based on our experience, upstream reviewer often reject our modification if 
> they have any concern.
> Do you think you can talk to them about this idea and see if they will accept 
> it or not?
> Or if you can help on this after we submit it?
>
> Also, Hayes is now updating our current upstream driver and it goes back and 
> forth for a while.
> So we will need some time to finish it and the target schedule to have your 
> request done is in the end of this month.
>
> Thank you very much.
>
> Best Regards,
> Bambi Yeh
>
> -Original Message-
> From: Hayes Wang 
> Sent: Monday, September 2, 2019 2:31 PM
> To: Amber Chen ; Prashant Malani 
> 
> Cc: David Miller ; netdev@vger.kernel.org; Bambi Yeh 
> ; Ryankao ; Jackc 
> ; Albertk ; marcoc...@google.com; 
> nic_swsd ; Grant Grundler 
> Subject: RE: Proposal: r8152 firmware patching framework
>
> Prashant Malani 
> > >
> > > (Adding a few more Realtek folks)
> > >
> > > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> > >
> > >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani
> >  wrote:
> > >>
> > >> Hi,
> > >>
> > >> The r8152 driver source code distributed by Realtek (on
> > >> www.realtek.com) contains firmware patches. This involves binary
> > >> byte-arrays being written byte/word-wise to the hardware memory
> > >> Example: grund...@chromium.org (cc-ed) has an experimental patch
> > which
> > >> includes the firmware patching code which was distributed with the
> > >> Realtek source :
> > >>
> > https://chromium-review.googlesource.com/c/chromiumos/third_party/kern
> > el
> > /+/1417953
> > >>
> > >> It would be nice to have a way to incorporate these firmware fixes
> > >> into the upstream code. Since having indecipherable byte-arrays is
> > >> not possible upstream, I propose the following:
> > >> - We use the assistance of Realtek to come up with a format which
> > >> the firmware patch files can follow (this can be documented in the
> > >> comments).
> > >>   - A real simple format could look like this:
> > >>   +
> > >>
> > ... > N
> > >...
> > >>+ The driver would be able to understand how to
> > >> parse each section (e.g is each data entry a byte or a word?)
> > >>
> > >> - We use request_firmware() to load the firmware, parse it and
> > >> write the data to the relevant registers.
>
> I plan to finish the patches which I am going to submit, first. Then, I could 
> focus on this. However, I don't think I would start this quickly. There are 
> many preparations and they would take me a lot of time.
>
> Best Regards,
> Hayes
>
>

RE: Proposal: r8152 firmware patching framework

2019-09-03 Thread Bambi Yeh

Hi Prashant:

We will try to implement your requests.
Based on our experience, upstream reviewer often reject our modification if 
they have any concern.
Do you think you can talk to them about this idea and see if they will accept 
it or not?
Or if you can help on this after we submit it?

Also, Hayes is now updating our current upstream driver and it goes back and 
forth for a while.
So we will need some time to finish it and the target schedule to have your 
request done is in the end of this month.

Thank you very much.

Best Regards,
Bambi Yeh

-Original Message-
From: Hayes Wang  
Sent: Monday, September 2, 2019 2:31 PM
To: Amber Chen ; Prashant Malani 
Cc: David Miller ; netdev@vger.kernel.org; Bambi Yeh 
; Ryankao ; Jackc 
; Albertk ; marcoc...@google.com; 
nic_swsd ; Grant Grundler 
Subject: RE: Proposal: r8152 firmware patching framework

Prashant Malani  
> >
> > (Adding a few more Realtek folks)
> >
> > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> >
> >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani
>  wrote:
> >>
> >> Hi,
> >>
> >> The r8152 driver source code distributed by Realtek (on
> >> www.realtek.com) contains firmware patches. This involves binary 
> >> byte-arrays being written byte/word-wise to the hardware memory
> >> Example: grund...@chromium.org (cc-ed) has an experimental patch
> which
> >> includes the firmware patching code which was distributed with the 
> >> Realtek source :
> >>
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kern
> el
> /+/1417953
> >>
> >> It would be nice to have a way to incorporate these firmware fixes 
> >> into the upstream code. Since having indecipherable byte-arrays is 
> >> not possible upstream, I propose the following:
> >> - We use the assistance of Realtek to come up with a format which 
> >> the firmware patch files can follow (this can be documented in the 
> >> comments).
> >>   - A real simple format could look like this:
> >>   +
> >>
> ... N
> >...
> >>+ The driver would be able to understand how to 
> >> parse each section (e.g is each data entry a byte or a word?)
> >>
> >> - We use request_firmware() to load the firmware, parse it and 
> >> write the data to the relevant registers.

I plan to finish the patches which I am going to submit, first. Then, I could 
focus on this. However, I don't think I would start this quickly. There are 
many preparations and they would take me a lot of time.

Best Regards,
Hayes

RE: Proposal: r8152 firmware patching framework

2019-09-01 Thread Hayes Wang

Prashant Malani  
> >
> > (Adding a few more Realtek folks)
> >
> > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> >
> >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani
>  wrote:
> >>
> >> Hi,
> >>
> >> The r8152 driver source code distributed by Realtek (on
> >> www.realtek.com) contains firmware patches. This involves binary
> >> byte-arrays being written byte/word-wise to the hardware memory
> >> Example: grund...@chromium.org (cc-ed) has an experimental patch
> which
> >> includes the firmware patching code which was distributed with the
> >> Realtek source :
> >>
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel
> /+/1417953
> >>
> >> It would be nice to have a way to incorporate these firmware fixes
> >> into the upstream code. Since having indecipherable byte-arrays is not
> >> possible upstream, I propose the following:
> >> - We use the assistance of Realtek to come up with a format which the
> >> firmware patch files can follow (this can be documented in the
> >> comments).
> >>   - A real simple format could look like this:
> >>   +
> >>
> ... >...
> >>+ The driver would be able to understand how to parse
> >> each section (e.g is each data entry a byte or a word?)
> >>
> >> - We use request_firmware() to load the firmware, parse it and write
> >> the data to the relevant registers.

I plan to finish the patches which I am going to submit, first. Then,
I could focus on this. However, I don't think I would start this
quickly. There are many preparations and they would take me a lot of
time.

Best Regards,
Hayes

Re: Proposal: r8152 firmware patching framework

2019-08-30 Thread Amber Chen

+ acct mgr, Stephen



> Prashant Malani  於 2019年8月31日 上午6:24 寫道：
> 
> (Adding a few more Realtek folks)
> 
> Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> 
>> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani  
>> wrote:
>> 
>> Hi,
>> 
>> The r8152 driver source code distributed by Realtek (on
>> www.realtek.com) contains firmware patches. This involves binary
>> byte-arrays being written byte/word-wise to the hardware memory
>> Example: grund...@chromium.org (cc-ed) has an experimental patch which
>> includes the firmware patching code which was distributed with the
>> Realtek source :
>> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953
>> 
>> It would be nice to have a way to incorporate these firmware fixes
>> into the upstream code. Since having indecipherable byte-arrays is not
>> possible upstream, I propose the following:
>> - We use the assistance of Realtek to come up with a format which the
>> firmware patch files can follow (this can be documented in the
>> comments).
>>   - A real simple format could look like this:
>>   +
>> ..
>>+ The driver would be able to understand how to parse
>> each section (e.g is each data entry a byte or a word?)
>> 
>> - We use request_firmware() to load the firmware, parse it and write
>> the data to the relevant registers.
>> 
>> I'm unfamiliar with what the preferred method of firmware patching is,
>> so I hope the maintainers can help suggest the best path forward.
>> 
>> As an aside: It would be great if Realtek could publish a list of
>> fixes that the firmware patches implement (I think a list on the
>> driver download page on the Realtek website would be an excellent
>> starting point).
>> 
>> Thanks and Best regards,
>> 
>> -Prashant
> 
> --Please consider the environment before printing this e-mail.

Re: Proposal: r8152 firmware patching framework

2019-08-30 Thread Prashant Malani

(Adding a few more Realtek folks)

Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?

On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani  wrote:
>
> Hi,
>
> The r8152 driver source code distributed by Realtek (on
> www.realtek.com) contains firmware patches. This involves binary
> byte-arrays being written byte/word-wise to the hardware memory
> Example: grund...@chromium.org (cc-ed) has an experimental patch which
> includes the firmware patching code which was distributed with the
> Realtek source :
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953
>
> It would be nice to have a way to incorporate these firmware fixes
> into the upstream code. Since having indecipherable byte-arrays is not
> possible upstream, I propose the following:
> - We use the assistance of Realtek to come up with a format which the
> firmware patch files can follow (this can be documented in the
> comments).
>- A real simple format could look like this:
>+
> ..
> + The driver would be able to understand how to parse
> each section (e.g is each data entry a byte or a word?)
>
> - We use request_firmware() to load the firmware, parse it and write
> the data to the relevant registers.
>
> I'm unfamiliar with what the preferred method of firmware patching is,
> so I hope the maintainers can help suggest the best path forward.
>
> As an aside: It would be great if Realtek could publish a list of
> fixes that the firmware patches implement (I think a list on the
> driver download page on the Realtek website would be an excellent
> starting point).
>
> Thanks and Best regards,
>
> -Prashant

Proposal: r8152 firmware patching framework

2019-08-29 Thread Prashant Malani

Hi,

The r8152 driver source code distributed by Realtek (on
www.realtek.com) contains firmware patches. This involves binary
byte-arrays being written byte/word-wise to the hardware memory
Example: grund...@chromium.org (cc-ed) has an experimental patch which
includes the firmware patching code which was distributed with the
Realtek source :
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953

It would be nice to have a way to incorporate these firmware fixes
into the upstream code. Since having indecipherable byte-arrays is not
possible upstream, I propose the following:
- We use the assistance of Realtek to come up with a format which the
firmware patch files can follow (this can be documented in the
comments).
   - A real simple format could look like this:
   +
..
+ The driver would be able to understand how to parse
each section (e.g is each data entry a byte or a word?)

- We use request_firmware() to load the firmware, parse it and write
the data to the relevant registers.

I'm unfamiliar with what the preferred method of firmware patching is,
so I hope the maintainers can help suggest the best path forward.

As an aside: It would be great if Realtek could publish a list of
fixes that the firmware patches implement (I think a list on the
driver download page on the Realtek website would be an excellent
starting point).

Thanks and Best regards,

-Prashant

MY $25,000,000.00 INVESTMENT PROPOSAL WITH YOU AND IN YOUR COUNTRY.

2019-08-22 Thread Law firm(Eku and Associates)

-- 
Dear,
With due respect this is not spam or Scam mail, because I have
contacted you before and there was no response from you,I apologise if
the contents of this mail are contrary to your moral ethics, which I
feel may be of great disturbance to your person, but please treat this
with absolute confidentiality, believing that this email reaches you
in good faith. My contacting you is not a mistake or a coincidence
because God can use any person known or unknown to accomplish great
things.
I am a lawyer and I have an investment business proposal to offer you.
It is not official but should be considered as legal and confidential
business. I have a customer's deposit of $US25 million dollars ready
to be moved for investment if you can partner with us. We are ready to
offer you 10% of this total amount as your compensation for supporting
the transaction to completion. If you are interested to help me please
reply me with your full details as stated below:
(1) Your full names:
(2) Your address:
(3) Your occupation:
(4) Your mobile telephone number:
(5) Your nationality:
(6) Your present location:
(7) Your age:
So that I will provide you more details on what to do and what is
required for successful completion.
Note: DO NOT REPLY ME IF YOU ARE NOT INTERESTED AND WITHOUT THE ABOVE
MENTIONED DETAILS

Sincèrement vôtre,
Avocat Etienne Eku Esq.(Lawfirm)
Procureur principal. De Cabinet d’avocats de l’Afrique de l’ouest.
Skype:westafricalawfirm

Re: [RFC v2] vsock: proposal to support multiple transports at runtime

2019-08-22 Thread Stefano Garzarella

On Mon, Aug 19, 2019 at 02:09:11PM +0100, Stefan Hajnoczi wrote:
> On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote:
> > 
> > Hi all,
> > this is a v2 of a proposal addressing the comments made by Dexuan, Stefan,
> > and Jorgen.
> > 
> > v1: https://www.spinics.net/lists/netdev/msg570274.html
> > 
> > 
> > 
> > We can define two types of transport that we have to handle at the same time
> > (e.g. in a nested VM we would have both types of transport running 
> > together):
> > 
> > - 'host->guest' transport, it runs in the host and it is used to communicate
> >   with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also
> >   runs in the guest who has nested guests, to communicate with them.
> > 
> >   [Phase 2]
> >   We can support multiple 'host->guest' transport running at the same time,
> >   but on x86 only one hypervisor uses VMX at any given time.
> > 
> > - 'guest->host' transport, it runs in the guest and it is used to 
> > communicate
> >   with the host.
> > 
> > 
> > The main goal is to find a way to decide what transport use in these cases:
> > 1. connect() / sendto()
> > 
> >a. use the 'host->guest' transport, if the destination is the guest
> >   (dest_cid > VMADDR_CID_HOST).
> > 
> >   [Phase 2]
> >   In order to support multiple 'host->guest' transports running at the 
> > same
> >   time, we should assign CIDs uniquely across all transports. In this 
> > way,
> >   a packet generated by the host side will get directed to the 
> > appropriate
> >   transport based on the CID.
> > 
> >b. use the 'guest->host' transport, if the destination is the host or the
> >   hypervisor.
> >   (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR)
> > 
> > 
> > 2. listen() / recvfrom()
> > 
> >a. use the 'host->guest' transport, if the socket is bound to
> >   VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> >   'guest->host' transport.
> >   We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
> >   address this case.
> > 
> >   [Phase 2]
> >   We can support network namespaces to create independent AF_VSOCK
> >   addressing domains:
> >   - could be used to partition VMs between hypervisors or at a finer
> >  granularity;
> >   - could be used to isolate host applications from guest applications
> >  using the same ports with CID_ANY;
> > 
> >b. use the 'guest->host' transport, if the socket is bound to local CID
> >   different from the VMADDR_CID_HOST (guest CID get with
> >   IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to 
> > be
> >   backward compatible).
> >   Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.
> > 
> >c. shared port space between transports
> >   For incoming requests or packets, we should be able to choose which
> >   transport use, looking at the 'port' requested.
> > 
> >   - stream sockets already support shared port space between transports
> > (one port can be assigned to only one transport)
> > 
> >   [Phase 2]
> >   - datagram sockets will support it, but for now VMCI transport is the
> > default transport for any host side datagram socket (KVM and Hyper-V
> > do not yet support datagrams sockets)
> > 
> > We will make the loading of af_vsock.ko independent of the transports to
> > allow to:
> >- create a AF_VSOCK socket without any loaded transports;
> >- listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded
> >  transports;
> > 
> > Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the
> > vmci_transport.ko to the af_vsock.ko.
> > [Jorgen will check if this will impact the existing VMware products]
> > 
> > Notes:
> >- For Hyper-V sockets, the host can only be Windows. No changes should
> >  be required on the Windows host to support the changes on this 
> > proposal.
> > 
> >- Communication between guests are not allowed on any transports, so we 
> > can
> >  drop packets sent from a guest to another guest (dest_cid >
> >  VMADDR_CID_HOST) if the 'host->guest' transport is not available.
>

Re: [RFC v2] vsock: proposal to support multiple transports at runtime

2019-08-19 Thread Stefan Hajnoczi

On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote:
> 
> Hi all,
> this is a v2 of a proposal addressing the comments made by Dexuan, Stefan,
> and Jorgen.
> 
> v1: https://www.spinics.net/lists/netdev/msg570274.html
> 
> 
> 
> We can define two types of transport that we have to handle at the same time
> (e.g. in a nested VM we would have both types of transport running together):
> 
> - 'host->guest' transport, it runs in the host and it is used to communicate
>   with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also
>   runs in the guest who has nested guests, to communicate with them.
> 
>   [Phase 2]
>   We can support multiple 'host->guest' transport running at the same time,
>   but on x86 only one hypervisor uses VMX at any given time.
> 
> - 'guest->host' transport, it runs in the guest and it is used to communicate
>   with the host.
> 
> 
> The main goal is to find a way to decide what transport use in these cases:
> 1. connect() / sendto()
> 
>a. use the 'host->guest' transport, if the destination is the guest
>   (dest_cid > VMADDR_CID_HOST).
> 
>   [Phase 2]
>   In order to support multiple 'host->guest' transports running at the 
> same
>   time, we should assign CIDs uniquely across all transports. In this way,
>   a packet generated by the host side will get directed to the appropriate
>   transport based on the CID.
> 
>b. use the 'guest->host' transport, if the destination is the host or the
>   hypervisor.
>   (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR)
> 
> 
> 2. listen() / recvfrom()
> 
>a. use the 'host->guest' transport, if the socket is bound to
>   VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
>   'guest->host' transport.
>   We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
>   address this case.
> 
>   [Phase 2]
>   We can support network namespaces to create independent AF_VSOCK
>   addressing domains:
>   - could be used to partition VMs between hypervisors or at a finer
>granularity;
>   - could be used to isolate host applications from guest applications
>using the same ports with CID_ANY;
> 
>b. use the 'guest->host' transport, if the socket is bound to local CID
>   different from the VMADDR_CID_HOST (guest CID get with
>   IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be
>   backward compatible).
>   Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.
> 
>c. shared port space between transports
>   For incoming requests or packets, we should be able to choose which
>   transport use, looking at the 'port' requested.
> 
>   - stream sockets already support shared port space between transports
> (one port can be assigned to only one transport)
> 
>   [Phase 2]
>   - datagram sockets will support it, but for now VMCI transport is the
> default transport for any host side datagram socket (KVM and Hyper-V
> do not yet support datagrams sockets)
> 
> We will make the loading of af_vsock.ko independent of the transports to
> allow to:
>- create a AF_VSOCK socket without any loaded transports;
>- listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded
>  transports;
> 
> Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the
> vmci_transport.ko to the af_vsock.ko.
> [Jorgen will check if this will impact the existing VMware products]
> 
> Notes:
>- For Hyper-V sockets, the host can only be Windows. No changes should
>  be required on the Windows host to support the changes on this proposal.
> 
>- Communication between guests are not allowed on any transports, so we can
>  drop packets sent from a guest to another guest (dest_cid >
>  VMADDR_CID_HOST) if the 'host->guest' transport is not available.
> 
>- [Phase 2] tag used to identify things that can be done at a later stage,
>  but that should be taken into account during this design.
> 
>- Namespace support will be developed in [Phase 2] or in a separate 
> project.
> 
> 
> 
> Comments and suggestions are welcome.
> I'll be on PTO for next two weeks, so sorry in advance if I'll answer later.
> 
> If we agree on this proposal, when I get back, I'll start working on the code
> to get a first PATCH RFC.

Stefano,
I've reviewed your proposal and it looks good for solving nested
virtualization.

The tricky implementation details will be supporting listen sockets,
especially with VMADDR_CID_ANY so they can be accessed from both
transports.

Stefan


signature.asc
Description: PGP signature

RES: PROPOSAL.

2019-07-30 Thread José Luiz Fabris







De: José Luiz Fabris
Enviado: terça-feira, 30 de julho de 2019 18:37
Para: José Luiz Fabris
Assunto: PROPOSAL.

Good Day,
I am Mrs.Margaret Ko May-Yee Leung Deputy Managing Director and Executive 
Director of Chong Hing Bank Limited. I write briefly to seek your collaboration 
in a multi-million transaction with good return for us on participation reply 
to my private email address below. Please before we proceed further, I'd like 
to know your FIRST and LAST
name so I will cross check with what I have on my file before proceeding with 
the details of our proposal.
E-mail: margaretkoleung...@gmail.com for more details send FIRST and LAST name 
to
My private email addreess:  margaretkoleung...@gmail.com Thank you and I look 
forward to hearing from you shortly.

Regards,
Dir. Margaret Ko May-Yee Leung.



Esta mensagem (incluindo anexos) contém informação confidencial destinada a um 
usuário específico e seu conteúdo é protegido por lei. Se você não é o 
destinatário correto deve apagar esta mensagem.

O emitente desta mensagem é responsável por seu conteúdo e endereçamento.
Cabe ao destinatário cuidar quanto ao tratamento adequado. A divulgação, 
reprodução e/ou distribuição sem a devida autorização ou qualquer outra ação 
sem conformidade com as normas internas do Ifes são proibidas e passíveis de 
sanção disciplinar, cível e criminal.

INVESTMENT PROPOSAL!

2019-07-25 Thread Sir.Francois Stamm

I got your contact while on a search for a reliable and trustworthy
partner who is to help me co-ordinate a business over there in your
country. I am interested in having an investment in your country based
on long-term business venture that has a good return on investment
[ROI] under your supervision.You will be required to;

[1]. Receive the funds.
[2]. Invest and Manage the funds profitably.

Though am interested in mechanized farm or any other viable business,
and I do not know if your country is a very good market for such
investment, so i needed a very good advice on what kind of investment
has a good return and profitable there in your country that we can
both start apart from the mechanized farming I already have in mind.

If you are interested, kindly contact me
via:-stamfrancoi...@gmail.com,for more details.
Regards,
Francois.

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Michal Kubecek

On Fri, Jun 28, 2019 at 03:55:53PM +0200, Jiri Pirko wrote:
> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
> >
> >What is your user case for having multiple IFLA_ALT_NAME for the same
> >IFLA_NAME?
> 
> I don't know about specific usecase for having more. Perhaps Michal
> does.

One use case that comes to my mind are the "predictable names"
implemented by udev/systemd which can be based on different naming
schemes (bus address, BIOS numbering, MAC address etc.) and it's not
always obvious which scheme is going to be used. I have even seen
multiple times that one schemed was used during system installation and
another in the installed system so that network configuration created by
installer did not work.

For block devices, current practice is not to rename the device and only
create multiple symlinks based on different naming schemes (by id, by
uuid, by label, etc.). With support for multiple altnames, we could also
identify the network device in different ways (all applicable ones).

Michal

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Jiri Pirko

Fri, Jun 28, 2019 at 05:44:47PM CEST, step...@networkplumber.org wrote:
>On Fri, 28 Jun 2019 15:55:53 +0200
>Jiri Pirko  wrote:
>
>> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
>> >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:  
>> >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:  
>> >> >On Thu, 27 Jun 2019 20:39:48 +0200
>> >> >Michal Kubecek  wrote:
>> >> >  
>> >> >> > 
>> >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> >> >> > # ip link show "Onboard Ethernet"
>> >> >> > Device "Onboard Ethernet" does not exist.
>> >> >> > 
>> >> >> > So it does not really appear to be an alias, it is a label. To be
>> >> >> > truly useful, it needs to be more than a label, it needs to be a real
>> >> >> > alias which you can use.
>> >> >> 
>> >> >> That's exactly what I meant: to be really useful, one should be able to
>> >> >> use the alias(es) for setting device options, for adding routes, in
>> >> >> netfilter rules etc.
>> >> >> 
>> >> >> Michal  
>> >> >
>> >> >The kernel doesn't enforce uniqueness of alias.
>> >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily 
>> >> >fixed).
>> >> >
>> >> >If it did, then handling it in iproute would be something like:  
>> >> 
>> >> I think that it is desired for kernel to work with "real alias" as a
>> >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
>> >> Userspace mapping like you did here might be perhaps okay for iproute2,
>> >> but I think that we need something and easy to use for all.
>> >> 
>> >> Let's call it "altname". Get would return:
>> >> 
>> >> IFLA_NAME  eth0
>> >> IFLA_ALT_NAME_LIST
>> >>IFLA_ALT_NAME  eth0
>> >>IFLA_ALT_NAME  somethingelse
>> >>IFLA_ALT_NAME  somenamethatisreallylong  
>> >
>> >Hi Jiri
>> >
>> >What is your user case for having multiple IFLA_ALT_NAME for the same
>> >IFLA_NAME?  
>> 
>> I don't know about specific usecase for having more. Perhaps Michal
>> does.
>> 
>> From the implementation perspective it is handy to have the ifname as
>> the first alt name in kernel, so the userspace would just pass
>> IFLA_ALT_NAME always. Also for avoiding name collisions etc.
>
>I like the alternate name proposal. The kernel would have to impose  
>uniqueness.
>Does alt_name have to be unique across both regular and alt_name?

Yes. That is my idea. To have one big hashtable to contain them all.


>Having multiple names list seems less interesting but it could be useful.

Yeah. Okay, I'm going to jump on this.

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Stephen Hemminger

On Fri, 28 Jun 2019 15:55:53 +0200
Jiri Pirko  wrote:

> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
> >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:  
> >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:  
> >> >On Thu, 27 Jun 2019 20:39:48 +0200
> >> >Michal Kubecek  wrote:
> >> >  
> >> >> > 
> >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> >> >> > # ip link show "Onboard Ethernet"
> >> >> > Device "Onboard Ethernet" does not exist.
> >> >> > 
> >> >> > So it does not really appear to be an alias, it is a label. To be
> >> >> > truly useful, it needs to be more than a label, it needs to be a real
> >> >> > alias which you can use.
> >> >> 
> >> >> That's exactly what I meant: to be really useful, one should be able to
> >> >> use the alias(es) for setting device options, for adding routes, in
> >> >> netfilter rules etc.
> >> >> 
> >> >> Michal  
> >> >
> >> >The kernel doesn't enforce uniqueness of alias.
> >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
> >> >
> >> >If it did, then handling it in iproute would be something like:  
> >> 
> >> I think that it is desired for kernel to work with "real alias" as a
> >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
> >> Userspace mapping like you did here might be perhaps okay for iproute2,
> >> but I think that we need something and easy to use for all.
> >> 
> >> Let's call it "altname". Get would return:
> >> 
> >> IFLA_NAME  eth0
> >> IFLA_ALT_NAME_LIST
> >>    IFLA_ALT_NAME  eth0
> >>IFLA_ALT_NAME  somethingelse
> >>IFLA_ALT_NAME  somenamethatisreallylong  
> >
> >Hi Jiri
> >
> >What is your user case for having multiple IFLA_ALT_NAME for the same
> >IFLA_NAME?  
> 
> I don't know about specific usecase for having more. Perhaps Michal
> does.
> 
> From the implementation perspective it is handy to have the ifname as
> the first alt name in kernel, so the userspace would just pass
> IFLA_ALT_NAME always. Also for avoiding name collisions etc.

I like the alternate name proposal. The kernel would have to impose  uniqueness.
Does alt_name have to be unique across both regular and alt_name?
Having multiple names list seems less interesting but it could be useful.

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Jiri Pirko

Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
>On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
>> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:
>> >On Thu, 27 Jun 2019 20:39:48 +0200
>> >Michal Kubecek  wrote:
>> >
>> >> > 
>> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> >> > # ip link show "Onboard Ethernet"
>> >> > Device "Onboard Ethernet" does not exist.
>> >> > 
>> >> > So it does not really appear to be an alias, it is a label. To be
>> >> > truly useful, it needs to be more than a label, it needs to be a real
>> >> > alias which you can use.  
>> >> 
>> >> That's exactly what I meant: to be really useful, one should be able to
>> >> use the alias(es) for setting device options, for adding routes, in
>> >> netfilter rules etc.
>> >> 
>> >> Michal
>> >
>> >The kernel doesn't enforce uniqueness of alias.
>> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
>> >
>> >If it did, then handling it in iproute would be something like:
>> 
>> I think that it is desired for kernel to work with "real alias" as a
>> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
>> Userspace mapping like you did here might be perhaps okay for iproute2,
>> but I think that we need something and easy to use for all.
>> 
>> Let's call it "altname". Get would return:
>> 
>> IFLA_NAME  eth0
>> IFLA_ALT_NAME_LIST
>>IFLA_ALT_NAME  eth0
>>IFLA_ALT_NAME  somethingelse
>>IFLA_ALT_NAME  somenamethatisreallylong
>
>Hi Jiri
>
>What is your user case for having multiple IFLA_ALT_NAME for the same
>IFLA_NAME?

I don't know about specific usecase for having more. Perhaps Michal
does.

>From the implementation perspective it is handy to have the ifname as
the first alt name in kernel, so the userspace would just pass
IFLA_ALT_NAME always. Also for avoiding name collisions etc.



>
>   Thanks
>   Andrew
>

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Andrew Lunn

On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:
> >On Thu, 27 Jun 2019 20:39:48 +0200
> >Michal Kubecek  wrote:
> >
> >> > 
> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> >> > # ip link show "Onboard Ethernet"
> >> > Device "Onboard Ethernet" does not exist.
> >> > 
> >> > So it does not really appear to be an alias, it is a label. To be
> >> > truly useful, it needs to be more than a label, it needs to be a real
> >> > alias which you can use.  
> >> 
> >> That's exactly what I meant: to be really useful, one should be able to
> >> use the alias(es) for setting device options, for adding routes, in
> >> netfilter rules etc.
> >> 
> >> Michal
> >
> >The kernel doesn't enforce uniqueness of alias.
> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
> >
> >If it did, then handling it in iproute would be something like:
> 
> I think that it is desired for kernel to work with "real alias" as a
> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
> Userspace mapping like you did here might be perhaps okay for iproute2,
> but I think that we need something and easy to use for all.
> 
> Let's call it "altname". Get would return:
> 
> IFLA_NAME  eth0
> IFLA_ALT_NAME_LIST
>IFLA_ALT_NAME  eth0
>IFLA_ALT_NAME  somethingelse
>IFLA_ALT_NAME  somenamethatisreallylong

Hi Jiri

What is your user case for having multiple IFLA_ALT_NAME for the same
IFLA_NAME?

Thanks
Andrew

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Jiri Pirko

Fri, Jun 28, 2019 at 01:42:12PM CEST, mkube...@suse.cz wrote:
>On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
>> 
>> I think that it is desired for kernel to work with "real alias" as a
>> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
>> Userspace mapping like you did here might be perhaps okay for iproute2,
>> but I think that we need something and easy to use for all.
>> 
>> Let's call it "altname". Get would return:
>> 
>> IFLA_NAME  eth0
>> IFLA_ALT_NAME_LIST
>>IFLA_ALT_NAME  eth0
>>IFLA_ALT_NAME  somethingelse
>>IFLA_ALT_NAME  somenamethatisreallylong
>> 
>> then userspace would pass with a request (get/set/del):
>> IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong
>> or
>> IFLA_NAME eth0 if it is talking with older kernel
>> 
>> Then following would do exactly the same:
>> ip link set eth0 addr 11:22:33:44:55:66
>> ip link set somethingelse addr 11:22:33:44:55:66
>> ip link set somenamethatisreallylong addr 11:22:33:44:55:66
>
>Yes, this sounds nice.
>
>> We would have to figure out the iproute2 iface to add/del altnames:
>> ip link add eth0 altname somethingelse
>> ip link del eth0 altname somethingelse
>>   this might be also:
>>   ip link del somethingelse altname somethingelse
>
>This would be a bit confusing, IMHO, as so far
>
>  ip link add $name ...
>
>always means we want to add or delete new device $name which would not
>be the case here. How about the other way around:
>
>  ip link add somethingelse altname_for eth0
>
>(preferrably with a better keyword than "altname_for" :-) ). Or maybe
>
>  ip altname add somethingelse dev eth0
>  ip altname del somethingelse dev eth0

Yeah, I like this.

Let's see how it will work during the implementation.

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Michal Kubecek

On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
> 
> I think that it is desired for kernel to work with "real alias" as a
> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
> Userspace mapping like you did here might be perhaps okay for iproute2,
> but I think that we need something and easy to use for all.
> 
> Let's call it "altname". Get would return:
> 
> IFLA_NAME  eth0
> IFLA_ALT_NAME_LIST
>IFLA_ALT_NAME  eth0
>IFLA_ALT_NAME  somethingelse
>IFLA_ALT_NAME  somenamethatisreallylong
> 
> then userspace would pass with a request (get/set/del):
> IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong
> or
> IFLA_NAME eth0 if it is talking with older kernel
> 
> Then following would do exactly the same:
> ip link set eth0 addr 11:22:33:44:55:66
> ip link set somethingelse addr 11:22:33:44:55:66
> ip link set somenamethatisreallylong addr 11:22:33:44:55:66

Yes, this sounds nice.

> We would have to figure out the iproute2 iface to add/del altnames:
> ip link add eth0 altname somethingelse
> ip link del eth0 altname somethingelse
>   this might be also:
>   ip link del somethingelse altname somethingelse

This would be a bit confusing, IMHO, as so far

  ip link add $name ...

always means we want to add or delete new device $name which would not
be the case here. How about the other way around:

  ip link add somethingelse altname_for eth0

(preferrably with a better keyword than "altname_for" :-) ). Or maybe

  ip altname add somethingelse dev eth0
  ip altname del somethingelse dev eth0

Michal

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Jiri Pirko

Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:
>On Thu, 27 Jun 2019 20:39:48 +0200
>Michal Kubecek  wrote:
>
>> > 
>> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> > # ip link show "Onboard Ethernet"
>> > Device "Onboard Ethernet" does not exist.
>> > 
>> > So it does not really appear to be an alias, it is a label. To be
>> > truly useful, it needs to be more than a label, it needs to be a real
>> > alias which you can use.  
>> 
>> That's exactly what I meant: to be really useful, one should be able to
>> use the alias(es) for setting device options, for adding routes, in
>> netfilter rules etc.
>> 
>> Michal
>
>The kernel doesn't enforce uniqueness of alias.
>Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
>
>If it did, then handling it in iproute would be something like:

I think that it is desired for kernel to work with "real alias" as a
handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
Userspace mapping like you did here might be perhaps okay for iproute2,
but I think that we need something and easy to use for all.

Let's call it "altname". Get would return:

IFLA_NAME  eth0
IFLA_ALT_NAME_LIST
   IFLA_ALT_NAME  eth0
   IFLA_ALT_NAME  somethingelse
   IFLA_ALT_NAME  somenamethatisreallylong

then userspace would pass with a request (get/set/del):
IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong
or
IFLA_NAME eth0 if it is talking with older kernel

Then following would do exactly the same:
ip link set eth0 addr 11:22:33:44:55:66
ip link set somethingelse addr 11:22:33:44:55:66
ip link set somenamethatisreallylong addr 11:22:33:44:55:66

We would have to figure out the iproute2 iface to add/del altnames:
ip link add eth0 altname somethingelse
ip link del eth0 altname somethingelse
  this might be also:
  ip link del somethingelse altname somethingelse

How does this sound?

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Jiri Pirko

Thu, Jun 27, 2019 at 09:35:27PM CEST, d...@redhat.com wrote:
>On Thu, 2019-06-27 at 12:20 -0700, Stephen Hemminger wrote:
>> On Thu, 27 Jun 2019 20:39:48 +0200
>> Michal Kubecek  wrote:
>> 
>> > > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> > > # ip link show "Onboard Ethernet"
>> > > Device "Onboard Ethernet" does not exist.
>> > > 
>> > > So it does not really appear to be an alias, it is a label. To be
>> > > truly useful, it needs to be more than a label, it needs to be a
>> > > real
>> > > alias which you can use.  
>> > 
>> > That's exactly what I meant: to be really useful, one should be
>> > able to
>> > use the alias(es) for setting device options, for adding routes, in
>> > netfilter rules etc.
>> > 
>> > Michal
>> 
>> The kernel doesn't enforce uniqueness of alias.
>
>Can we even enforce unique aliases/labels? Given that the kernel hasn't
>enforced that in the past there's a good possibility of breaking stuff
>if it started. (unfortunately)

Correct. I think that Michal's idea to introduce "real aliases" is very
intereting. However, the existing "alias" as we have it does not seem
right to be used. Also because of the UAPI. We have IFLA_IFALIAS which
is a single value. For "real aliases" we need nested array.

[...]

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Jiri Pirko

Thu, Jun 27, 2019 at 07:14:31PM CEST, dsah...@gmail.com wrote:
>On 6/27/19 3:43 AM, Jiri Pirko wrote:
>> Hi all.
>> 
>> In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
>> netdevice name length. Now when we have PF and VF representors
>> with port names like "pfXvfY", it became quite common to hit this limit:
>> 0123456789012345
>> enp131s0f1npf0vf6
>> enp131s0f1npf0vf22
>
>QinQ (stacked vlans) is another example.

There are more usecases for this, yes.


>
>> 
>> Since IFLA_NAME is just a string, I though it might be possible to use
>> it to carry longer names as it is. However, the userspace tools, like
>> iproute2, are doing checks before print out. So for example in output of
>> "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
>> completely avoided.
>> 
>> So here is a proposal that might work:
>> 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
>>IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
>>user should be prepared for any string size.
>> 2) Add a file in sysfs that would indicate that NAME_EXT is supported by
>>the kernel.
>
>no sysfs files.
>
>Johannes added infrastructure to retrieve the policy. That is a more
>flexible and robust option for determining what the kernel supports.

Sure, udev can query rtnetlink. I just proposed it as an option, anyway,
it's implementation detail.


>
>
>> 3) Udev is going to look for the sysfs indication file. In case when
>>kernel supports long names, it will do rename to longer name, setting
>>IFLA_NAME_EXT. If not, it does what it does now - fail.
>> 4) There are two cases that can happen during rename:
>>A) The name is shorter than IFNAMSIZ
>>   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:
>>  original IFLA_NAME = eth0
>>  original IFLA_NAME_EXT = eth0
>>  renamed  IFLA_NAME = enp5s0f1npf0vf1
>>  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
>>B) The name is longer tha IFNAMSIZ
>>   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would 
>>  contain the new one:
>>  original IFLA_NAME = eth0
>>  original IFLA_NAME_EXT = eth0
>>  renamed  IFLA_NAME = eth0
>>  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22
>
>so kernel side there will be 2 names for the same net_device?

Yes. However, updated tools (which would be eventually all) are going to
show only the ext one.



>
>> 
>> This would allow the old tools to work with "eth0" and the new
>> tools would work with "enp131s0f1npf0vf22". In sysfs, there would
>> be symlink from one name to another.
>
>I would prefer a solution that does not rely on sysfs hooks.

Please note that this /sys/class/net/ifacename dirs are already created.
What I propose is to have symlink from ext to the short name or vice
versa. The solution really does not "rely" on this...


>
>>   
>> Also, there might be a warning added to kernel if someone works
>> with IFLA_NAME that the userspace tool should be upgraded.
>
>that seems like spam and confusion for the first few years of a new api.

Spam? warn_once?


>
>> 
>> Eventually, only IFLA_NAME_EXT is going to be used by everyone.
>> 
>> I'm aware there are other places where similar new attribute
>> would have to be introduced too (ip rule for example).
>> I'm not saying this is a simple work.
>> 
>> Question is what to do with the ioctl api (get ifindex etc). I would
>> probably leave it as is and push tools to use rtnetlink instead.
>
>The ioctl API is going to be a limiter here. ifconfig is still quite
>prevalent and net-snmp still uses ioctl (as just 2 common examples).
>snmp showing one set of names and rtnetlink s/w showing another is going
>to be really confusing.

I don't see other way though, do you? The ioctl names are unextendable :/

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Dan Williams

On Thu, 2019-06-27 at 12:20 -0700, Stephen Hemminger wrote:
> On Thu, 27 Jun 2019 20:39:48 +0200
> Michal Kubecek  wrote:
> 
> > > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> > > # ip link show "Onboard Ethernet"
> > > Device "Onboard Ethernet" does not exist.
> > > 
> > > So it does not really appear to be an alias, it is a label. To be
> > > truly useful, it needs to be more than a label, it needs to be a
> > > real
> > > alias which you can use.  
> > 
> > That's exactly what I meant: to be really useful, one should be
> > able to
> > use the alias(es) for setting device options, for adding routes, in
> > netfilter rules etc.
> > 
> > Michal
> 
> The kernel doesn't enforce uniqueness of alias.

Can we even enforce unique aliases/labels? Given that the kernel hasn't
enforced that in the past there's a good possibility of breaking stuff
if it started. (unfortunately)

Dan

> Also current kernel RTM_GETLINK doesn't do filter by alias (easily
> fixed).
> 
> If it did, then handling it in iproute would be something like:
> 
> diff --git a/lib/ll_map.c b/lib/ll_map.c
> index e0ed54bf77c9..c798ba542224 100644
> --- a/lib/ll_map.c
> +++ b/lib/ll_map.c
> @@ -26,15 +26,18 @@
>  struct ll_cache {
>   struct hlist_node idx_hash;
>   struct hlist_node name_hash;
> + struct hlist_node alias_hash;
>   unsignedflags;
>   unsignedindex;
>   unsigned short  type;
> - charname[];
> + char*alias;
> + charname[IFNAMSIZ];
>  };
>  
>  #define IDXMAP_SIZE  1024
>  static struct hlist_head idx_head[IDXMAP_SIZE];
>  static struct hlist_head name_head[IDXMAP_SIZE];
> +static struct hlist_head alias_head[IDXMAP_SIZE];
>  
>  static struct ll_cache *ll_get_by_index(unsigned index)
>  {
> @@ -77,10 +80,26 @@ static struct ll_cache *ll_get_by_name(const char
> *name)
>   return NULL;
>  }
>  
> +static struct ll_cache *ll_get_by_alias(const char *alias)
> +{
> + struct hlist_node *n;
> + unsigned h = namehash(alias) & (IDXMAP_SIZE - 1);
> +
> + hlist_for_each(n, &alias_head[h]) {
> + struct ll_cache *im
> + = container_of(n, struct ll_cache, alias_hash);
> +
> + if (strcmp(im->alias, alias) == 0)
> + return im;
> + }
> +
> + return NULL;
> +}
> +
>  int ll_remember_index(struct nlmsghdr *n, void *arg)
>  {
>   unsigned int h;
> - const char *ifname;
> + const char *ifname, *ifalias;
>   struct ifinfomsg *ifi = NLMSG_DATA(n);
>   struct ll_cache *im;
>   struct rtattr *tb[IFLA_MAX+1];
> @@ -96,6 +115,10 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   if (im) {
>   hlist_del(&im->name_hash);
>   hlist_del(&im->idx_hash);
> + if (im->alias) {
> + hlist_del(&im->alias_hash);
> + free(im->alias);
> + }
>   free(im);
>   }
>   return 0;
> @@ -106,6 +129,8 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   if (ifname == NULL)
>   return 0;
>  
> + ifalias = tb[IFLA_IFALIAS] ? rta_getattr_str(tb[IFLA_IFALIAS])
> : NULL;
> +
>   if (im) {
>   /* change to existing entry */
>   if (strcmp(im->name, ifname) != 0) {
> @@ -114,6 +139,14 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   hlist_add_head(&im->name_hash, &name_head[h]);
>   }
>  
> + if (im->alias) {
> + hlist_del(&im->alias_hash);
> + if (ifalias) {
> + h = namehash(ifalias) & (IDXMAP_SIZE -
> 1);
> + hlist_add_head(&im->alias_hash,
> &alias_head[h]);
> + }
> + }
> +
>   im->flags = ifi->ifi_flags;
>   return 0;
>   }
> @@ -132,6 +165,12 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   h = namehash(ifname) & (IDXMAP_SIZE - 1);
>   hlist_add_head(&im->name_hash, &name_head[h]);
>  
> + if (ifalias) {
> + im->alias = strdup(ifalias);
> + h = namehash(ifalias) & (IDXMAP_SIZE - 1);
> + hlist_add_head(&im->alias_hash, &alias_head[h]);
> + }   
> + 
>   return 0;
>  }
>  
> @@ -152,7 +191,7 @@ static unsigned int ll_idx_a2n(const char *name)
>   return idx;
>  }
>  
> -static int ll_link_get(const char *name, int index)
> +static int ll_link_get(const char *name, const char *alias, int
> index)
>  {
>   struct {
>   struct nlmsghdr n;
> @@ -176,6 +215,9 @@ static int ll_link_get(const char *name, int
> index)
>   if (name)
>   addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name,
> strlen(name) + 1);
> + if (alias)
> + addattr_l(&req.n, sizeof(req), IFLA_IF

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Stephen Hemminger

On Thu, 27 Jun 2019 20:39:48 +0200
Michal Kubecek  wrote:

> > 
> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> > # ip link show "Onboard Ethernet"
> > Device "Onboard Ethernet" does not exist.
> > 
> > So it does not really appear to be an alias, it is a label. To be
> > truly useful, it needs to be more than a label, it needs to be a real
> > alias which you can use.  
> 
> That's exactly what I meant: to be really useful, one should be able to
> use the alias(es) for setting device options, for adding routes, in
> netfilter rules etc.
> 
> Michal

The kernel doesn't enforce uniqueness of alias.
Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).

If it did, then handling it in iproute would be something like:

diff --git a/lib/ll_map.c b/lib/ll_map.c
index e0ed54bf77c9..c798ba542224 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -26,15 +26,18 @@
 struct ll_cache {
struct hlist_node idx_hash;
struct hlist_node name_hash;
+   struct hlist_node alias_hash;
unsignedflags;
unsignedindex;
unsigned short  type;
-   charname[];
+   char*alias;
+   charname[IFNAMSIZ];
 };
 
 #define IDXMAP_SIZE1024
 static struct hlist_head idx_head[IDXMAP_SIZE];
 static struct hlist_head name_head[IDXMAP_SIZE];
+static struct hlist_head alias_head[IDXMAP_SIZE];
 
 static struct ll_cache *ll_get_by_index(unsigned index)
 {
@@ -77,10 +80,26 @@ static struct ll_cache *ll_get_by_name(const char *name)
return NULL;
 }
 
+static struct ll_cache *ll_get_by_alias(const char *alias)
+{
+   struct hlist_node *n;
+   unsigned h = namehash(alias) & (IDXMAP_SIZE - 1);
+
+   hlist_for_each(n, &alias_head[h]) {
+   struct ll_cache *im
+   = container_of(n, struct ll_cache, alias_hash);
+
+   if (strcmp(im->alias, alias) == 0)
+   return im;
+   }
+
+   return NULL;
+}
+
 int ll_remember_index(struct nlmsghdr *n, void *arg)
 {
unsigned int h;
-   const char *ifname;
+   const char *ifname, *ifalias;
struct ifinfomsg *ifi = NLMSG_DATA(n);
struct ll_cache *im;
struct rtattr *tb[IFLA_MAX+1];
@@ -96,6 +115,10 @@ int ll_remember_index(struct nlmsghdr *n, void *arg)
if (im) {
hlist_del(&im->name_hash);
hlist_del(&im->idx_hash);
+   if (im->alias) {
+   hlist_del(&im->alias_hash);
+   free(im->alias);
+   }
free(im);
}
return 0;
@@ -106,6 +129,8 @@ int ll_remember_index(struct nlmsghdr *n, void *arg)
if (ifname == NULL)
return 0;
 
+   ifalias = tb[IFLA_IFALIAS] ? rta_getattr_str(tb[IFLA_IFALIAS]) : NULL;
+
if (im) {
/* change to existing entry */
if (strcmp(im->name, ifname) != 0) {
@@ -114,6 +139,14 @@ int ll_remember_index(struct nlmsghdr *n, void *arg)
hlist_add_head(&im->name_hash, &name_head[h]);
}
 
+   if (im->alias) {
+   hlist_del(&im->alias_hash);
+   if (ifalias) {
+   h = namehash(ifalias) & (IDXMAP_SIZE - 1);
+   hlist_add_head(&im->alias_hash, &alias_head[h]);
+   }
+   }
+
im->flags = ifi->ifi_flags;
return 0;
}
@@ -132,6 +165,12 @@ int ll_remember_index(struct nlmsghdr *n, void *arg)
h = namehash(ifname) & (IDXMAP_SIZE - 1);
hlist_add_head(&im->name_hash, &name_head[h]);
 
+   if (ifalias) {
+   im->alias = strdup(ifalias);
+   h = namehash(ifalias) & (IDXMAP_SIZE - 1);
+   hlist_add_head(&im->alias_hash, &alias_head[h]);
+   }   
+   
return 0;
 }
 
@@ -152,7 +191,7 @@ static unsigned int ll_idx_a2n(const char *name)
return idx;
 }
 
-static int ll_link_get(const char *name, int index)
+static int ll_link_get(const char *name, const char *alias, int index)
 {
struct {
struct nlmsghdr n;
@@ -176,6 +215,9 @@ static int ll_link_get(const char *name, int index)
if (name)
addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name,
  strlen(name) + 1);
+   if (alias)
+   addattr_l(&req.n, sizeof(req), IFLA_IFALIAS, alias,
+ strlen(alias) + 1);
 
if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n, &answer) < 0)
goto out;
@@ -206,7 +248,7 @@ const char *ll_index_to_name(unsigned int idx)
if (im)
return im->name;
 
-   if (ll_link_get(NULL, idx) == idx) {
+   if (ll_link_get(NULL, NULL, idx) == idx) {
im = ll_get

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Michal Kubecek

On Thu, Jun 27, 2019 at 08:35:38PM +0200, Andrew Lunn wrote:
> On Thu, Jun 27, 2019 at 11:23:05AM -0700, Stephen Hemminger wrote:
> > On Thu, 27 Jun 2019 20:08:03 +0200 Michal Kubecek  wrote:
> > 
> > > It often feels as a deficiency that unlike block devices where we can
> > > keep one name and create multiple symlinks based on different naming
> > > schemes, network devices can have only one name. There are aliases but
> > > AFAIK they are only used (and can be only used) for SNMP. IMHO this
> > > limitation is part of the mess that left us with so-called "predictable
> > > names" which are in practice neither persistent nor predictable.
> > > 
> > > So perhaps we could introduce actual aliases (or altnames or whatever we
> > > would call them) for network devices that could be used to identify
> > > a network device whenever both kernel and userspace tool supports them.
> > > Old (and ancient) tools would have to use the one canonical name limited
> > > to current IFNAMSIZ, new tools would allow using any alias which could
> > > be longer.
> >  
> > That is already there in current network model.
> > # ip li set dev eno1 alias 'Onboard Ethernet'
> > # ip li show dev eno1
> > 2: eno1:  mtu 1500 qdisc mq state UP mode 
> > DEFAULT group default qlen 1000
> > link/ether ac:1f:6b:74:38:c0 brd ff:ff:ff:ff:ff:ff
> > alias Onboard Ethernet
> 
> $ ip li set dev enp3s0 alias "Onboard Ethernet"
> # ip link show "Onboard Ethernet"
> Device "Onboard Ethernet" does not exist.
> 
> So it does not really appear to be an alias, it is a label. To be
> truly useful, it needs to be more than a label, it needs to be a real
> alias which you can use.

That's exactly what I meant: to be really useful, one should be able to
use the alias(es) for setting device options, for adding routes, in
netfilter rules etc.

Michal

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Andrew Lunn

On Thu, Jun 27, 2019 at 11:23:05AM -0700, Stephen Hemminger wrote:
> On Thu, 27 Jun 2019 20:08:03 +0200
> Michal Kubecek  wrote:
> 
> > It often feels as a deficiency that unlike block devices where we can
> > keep one name and create multiple symlinks based on different naming
> > schemes, network devices can have only one name. There are aliases but
> > AFAIK they are only used (and can be only used) for SNMP. IMHO this
> > limitation is part of the mess that left us with so-called "predictable
> > names" which are in practice neither persistent nor predictable.
> > 
> > So perhaps we could introduce actual aliases (or altnames or whatever we
> > would call them) for network devices that could be used to identify
> > a network device whenever both kernel and userspace tool supports them.
> > Old (and ancient) tools would have to use the one canonical name limited
> > to current IFNAMSIZ, new tools would allow using any alias which could
> > be longer.
> > 
> > Michal
> 
>  
> That is already there in current network model.
> # ip li set dev eno1 alias 'Onboard Ethernet'
> # ip li show dev eno1
> 2: eno1:  mtu 1500 qdisc mq state UP mode 
> DEFAULT group default qlen 1000
> link/ether ac:1f:6b:74:38:c0 brd ff:ff:ff:ff:ff:ff
> alias Onboard Ethernet

Hi Stephen

$ ip li set dev enp3s0 alias "Onboard Ethernet"
# ip link show "Onboard Ethernet"
Device "Onboard Ethernet" does not exist.

So it does not really appear to be an alias, it is a label. To be
truly useful, it needs to be more than a label, it needs to be a real
alias which you can use.

 Andrew

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Stephen Hemminger

On Thu, 27 Jun 2019 20:08:03 +0200
Michal Kubecek  wrote:

> It often feels as a deficiency that unlike block devices where we can
> keep one name and create multiple symlinks based on different naming
> schemes, network devices can have only one name. There are aliases but
> AFAIK they are only used (and can be only used) for SNMP. IMHO this
> limitation is part of the mess that left us with so-called "predictable
> names" which are in practice neither persistent nor predictable.
> 
> So perhaps we could introduce actual aliases (or altnames or whatever we
> would call them) for network devices that could be used to identify
> a network device whenever both kernel and userspace tool supports them.
> Old (and ancient) tools would have to use the one canonical name limited
> to current IFNAMSIZ, new tools would allow using any alias which could
> be longer.
> 
> Michal

 
That is already there in current network model.
# ip li set dev eno1 alias 'Onboard Ethernet'
# ip li show dev eno1
2: eno1:  mtu 1500 qdisc mq state UP mode 
DEFAULT group default qlen 1000
link/ether ac:1f:6b:74:38:c0 brd ff:ff:ff:ff:ff:ff
alias Onboard Ethernet

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Michal Kubecek

On Thu, Jun 27, 2019 at 11:14:31AM -0600, David Ahern wrote:
> > 4) There are two cases that can happen during rename:
> >A) The name is shorter than IFNAMSIZ
> >   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = enp5s0f1npf0vf1
> >  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
> >B) The name is longer tha IFNAMSIZ
> >   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would 
> >  contain the new one:
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = eth0
> >  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22
> 
> so kernel side there will be 2 names for the same net_device?

It often feels as a deficiency that unlike block devices where we can
keep one name and create multiple symlinks based on different naming
schemes, network devices can have only one name. There are aliases but
AFAIK they are only used (and can be only used) for SNMP. IMHO this
limitation is part of the mess that left us with so-called "predictable
names" which are in practice neither persistent nor predictable.

So perhaps we could introduce actual aliases (or altnames or whatever we
would call them) for network devices that could be used to identify
a network device whenever both kernel and userspace tool supports them.
Old (and ancient) tools would have to use the one canonical name limited
to current IFNAMSIZ, new tools would allow using any alias which could
be longer.

Michal

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Stephen Hemminger

On Thu, 27 Jun 2019 10:48:08 -0700
Jakub Kicinski  wrote:

> On Thu, 27 Jun 2019 11:43:27 +0200, Jiri Pirko wrote:
> > Hi all.
> > 
> > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
> > netdevice name length. Now when we have PF and VF representors
> > with port names like "pfXvfY", it became quite common to hit this limit:
> > 0123456789012345
> > enp131s0f1npf0vf6
> > enp131s0f1npf0vf22
> > 
> > Since IFLA_NAME is just a string, I though it might be possible to use
> > it to carry longer names as it is. However, the userspace tools, like
> > iproute2, are doing checks before print out. So for example in output of
> > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
> > completely avoided.
> > 
> > So here is a proposal that might work:
> > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
> >IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
> >user should be prepared for any string size.
> > 2) Add a file in sysfs that would indicate that NAME_EXT is supported by
> >the kernel.
> > 3) Udev is going to look for the sysfs indication file. In case when
> >kernel supports long names, it will do rename to longer name, setting
> >IFLA_NAME_EXT. If not, it does what it does now - fail.
> > 4) There are two cases that can happen during rename:
> >A) The name is shorter than IFNAMSIZ  
> >   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = enp5s0f1npf0vf1
> >  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
> >B) The name is longer tha IFNAMSIZ  
> >   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would 
> >  contain the new one:
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = eth0
> >  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22  
> 
> I think B is the only way, A risks duplicate IFLA_NAMEs over ioctl,
> right?  And maybe there is some crazy application out there which 
> mixes netlink and ioctl.
> 
> I guess it's not worse than status quo, given that today renames 
> will fail and we will either get truncated names or eth0s..
> 
> > This would allow the old tools to work with "eth0" and the new
> > tools would work with "enp131s0f1npf0vf22". In sysfs, there would
> > be symlink from one name to another.
> >   
> > Also, there might be a warning added to kernel if someone works
> > with IFLA_NAME that the userspace tool should be upgraded.
> > 
> > Eventually, only IFLA_NAME_EXT is going to be used by everyone.
> > 
> > I'm aware there are other places where similar new attribute
> > would have to be introduced too (ip rule for example).
> > I'm not saying this is a simple work.
> > 
> > Question is what to do with the ioctl api (get ifindex etc). I would
> > probably leave it as is and push tools to use rtnetlink instead.
> > 
> > Any ideas why this would not work? Any ideas how to solve this
> > differently?  
> 
> Since we'd have to update all user space to make use of the new names
> I'd be tempted to move to a more structured device identification.
> 
> 5: enp131s0f1npf0vf6:  ...
> 
> vs:
> 
> 5: eth5 (parent enp131s0f1 pf 0 vf 6 peer X*):  ...
> 
> * ;)
> 
> And allow filtering/selection of device based on more attributes than
> just name and ifindex.  In practice in container workloads, for example,
> the names are already very much insufficient to identify the device.
> Refocusing on attributes is probably a big effort and not that practical
> for traditional CLI users?  IDK
> 
> Anyway, IMHO your scheme is strictly better than status quo.

Or Cisco style naming ;-) Ethernet0/0 

There is a better solution for human use already.
the field ifalias allows arbitrary values and hooked into SNMP.

Why not have userspace fill in this field with something by default?

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Jakub Kicinski

On Thu, 27 Jun 2019 11:43:27 +0200, Jiri Pirko wrote:
> Hi all.
> 
> In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
> netdevice name length. Now when we have PF and VF representors
> with port names like "pfXvfY", it became quite common to hit this limit:
> 0123456789012345
> enp131s0f1npf0vf6
> enp131s0f1npf0vf22
> 
> Since IFLA_NAME is just a string, I though it might be possible to use
> it to carry longer names as it is. However, the userspace tools, like
> iproute2, are doing checks before print out. So for example in output of
> "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
> completely avoided.
> 
> So here is a proposal that might work:
> 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
>IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
>user should be prepared for any string size.
> 2) Add a file in sysfs that would indicate that NAME_EXT is supported by
>the kernel.
> 3) Udev is going to look for the sysfs indication file. In case when
>kernel supports long names, it will do rename to longer name, setting
>IFLA_NAME_EXT. If not, it does what it does now - fail.
> 4) There are two cases that can happen during rename:
>A) The name is shorter than IFNAMSIZ
>   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:  
>  original IFLA_NAME = eth0
>  original IFLA_NAME_EXT = eth0
>  renamed  IFLA_NAME = enp5s0f1npf0vf1
>  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
>B) The name is longer tha IFNAMSIZ
>   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would   
>  contain the new one:
>  original IFLA_NAME = eth0
>  original IFLA_NAME_EXT = eth0
>  renamed  IFLA_NAME = eth0
>  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22

I think B is the only way, A risks duplicate IFLA_NAMEs over ioctl,
right?  And maybe there is some crazy application out there which 
mixes netlink and ioctl.

I guess it's not worse than status quo, given that today renames 
will fail and we will either get truncated names or eth0s..

> This would allow the old tools to work with "eth0" and the new
> tools would work with "enp131s0f1npf0vf22". In sysfs, there would
> be symlink from one name to another.
>   
> Also, there might be a warning added to kernel if someone works
> with IFLA_NAME that the userspace tool should be upgraded.
> 
> Eventually, only IFLA_NAME_EXT is going to be used by everyone.
> 
> I'm aware there are other places where similar new attribute
> would have to be introduced too (ip rule for example).
> I'm not saying this is a simple work.
> 
> Question is what to do with the ioctl api (get ifindex etc). I would
> probably leave it as is and push tools to use rtnetlink instead.
> 
> Any ideas why this would not work? Any ideas how to solve this
> differently?

Since we'd have to update all user space to make use of the new names
I'd be tempted to move to a more structured device identification.

5: enp131s0f1npf0vf6:  ...

vs:

5: eth5 (parent enp131s0f1 pf 0 vf 6 peer X*):  ...

* ;)

And allow filtering/selection of device based on more attributes than
just name and ifindex.  In practice in container workloads, for example,
the names are already very much insufficient to identify the device.
Refocusing on attributes is probably a big effort and not that practical
for traditional CLI users?  IDK

Anyway, IMHO your scheme is strictly better than status quo.

Re: [RFC] longer netdev names proposal

2019-06-27 Thread David Ahern

On 6/27/19 3:43 AM, Jiri Pirko wrote:
> Hi all.
> 
> In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
> netdevice name length. Now when we have PF and VF representors
> with port names like "pfXvfY", it became quite common to hit this limit:
> 0123456789012345
> enp131s0f1npf0vf6
> enp131s0f1npf0vf22

QinQ (stacked vlans) is another example.

> 
> Since IFLA_NAME is just a string, I though it might be possible to use
> it to carry longer names as it is. However, the userspace tools, like
> iproute2, are doing checks before print out. So for example in output of
> "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
> completely avoided.
> 
> So here is a proposal that might work:
> 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
>IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
>user should be prepared for any string size.
> 2) Add a file in sysfs that would indicate that NAME_EXT is supported by
>the kernel.

no sysfs files.

Johannes added infrastructure to retrieve the policy. That is a more
flexible and robust option for determining what the kernel supports.


> 3) Udev is going to look for the sysfs indication file. In case when
>kernel supports long names, it will do rename to longer name, setting
>IFLA_NAME_EXT. If not, it does what it does now - fail.
> 4) There are two cases that can happen during rename:
>A) The name is shorter than IFNAMSIZ
>   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:
>  original IFLA_NAME = eth0
>  original IFLA_NAME_EXT = eth0
>  renamed  IFLA_NAME = enp5s0f1npf0vf1
>  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
>B) The name is longer tha IFNAMSIZ
>   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would 
>  contain the new one:
>  original IFLA_NAME = eth0
>  original IFLA_NAME_EXT = eth0
>  renamed  IFLA_NAME = eth0
>  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22

so kernel side there will be 2 names for the same net_device?

> 
> This would allow the old tools to work with "eth0" and the new
> tools would work with "enp131s0f1npf0vf22". In sysfs, there would
> be symlink from one name to another.

I would prefer a solution that does not rely on sysfs hooks.

>   
> Also, there might be a warning added to kernel if someone works
> with IFLA_NAME that the userspace tool should be upgraded.

that seems like spam and confusion for the first few years of a new api.

> 
> Eventually, only IFLA_NAME_EXT is going to be used by everyone.
> 
> I'm aware there are other places where similar new attribute
> would have to be introduced too (ip rule for example).
> I'm not saying this is a simple work.
> 
> Question is what to do with the ioctl api (get ifindex etc). I would
> probably leave it as is and push tools to use rtnetlink instead.

The ioctl API is going to be a limiter here. ifconfig is still quite
prevalent and net-snmp still uses ioctl (as just 2 common examples).
snmp showing one set of names and rtnetlink s/w showing another is going
to be really confusing.

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Dan Williams

On Thu, 2019-06-27 at 08:29 -0700, Stephen Hemminger wrote:
> On Thu, 27 Jun 2019 11:43:27 +0200
> Jiri Pirko  wrote:
> 
> > Hi all.
> > 
> > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit
> > for
> > netdevice name length. Now when we have PF and VF representors
> > with port names like "pfXvfY", it became quite common to hit this
> > limit:
> > 0123456789012345
> > enp131s0f1npf0vf6
> > enp131s0f1npf0vf22
> > 
> > Since IFLA_NAME is just a string, I though it might be possible to
> > use
> > it to carry longer names as it is. However, the userspace tools,
> > like
> > iproute2, are doing checks before print out. So for example in
> > output of
> > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
> > completely avoided.
> > 
> > So here is a proposal that might work:
> > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer
> > than
> >IFNAMSIZE, say 64 bytes. The max size should be only defined in
> > kernel,
> >user should be prepared for any string size.
> > 2) Add a file in sysfs that would indicate that NAME_EXT is
> > supported by
> >the kernel.
> > 3) Udev is going to look for the sysfs indication file. In case
> > when
> >kernel supports long names, it will do rename to longer name,
> > setting
> >IFLA_NAME_EXT. If not, it does what it does now - fail.
> > 4) There are two cases that can happen during rename:
> >A) The name is shorter than IFNAMSIZ
> >   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same
> > string:  
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = enp5s0f1npf0vf1
> >  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
> >B) The name is longer tha IFNAMSIZ
> >   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT
> > would   
> >  contain the new one:
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = eth0
> >  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22

It makes me a bit uncomfortable to allow IFLA_NAME and IFLA_NAME_EXT to
be completely different. That sounds like a big source of confusion and
debugging problems in production.

Dan

> > This would allow the old tools to work with "eth0" and the new
> > tools would work with "enp131s0f1npf0vf22". In sysfs, there would
> > be symlink from one name to another.
> >   
> > Also, there might be a warning added to kernel if someone works
> > with IFLA_NAME that the userspace tool should be upgraded.
> > 
> > Eventually, only IFLA_NAME_EXT is going to be used by everyone.
> > 
> > I'm aware there are other places where similar new attribute
> > would have to be introduced too (ip rule for example).
> > I'm not saying this is a simple work.
> > 
> > Question is what to do with the ioctl api (get ifindex etc). I
> > would
> > probably leave it as is and push tools to use rtnetlink instead.
> > 
> > Any ideas why this would not work? Any ideas how to solve this
> > differently?
> > 
> > Thanks!
> > 
> > Jiri
> >  
> 
> I looked into this in the past, but then rejected it because
> there are so many tools that use names, not just iproute2.
> Plus long names are very user unfriendly.

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Stephen Hemminger

On Thu, 27 Jun 2019 11:43:27 +0200
Jiri Pirko  wrote:

> Hi all.
> 
> In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
> netdevice name length. Now when we have PF and VF representors
> with port names like "pfXvfY", it became quite common to hit this limit:
> 0123456789012345
> enp131s0f1npf0vf6
> enp131s0f1npf0vf22
> 
> Since IFLA_NAME is just a string, I though it might be possible to use
> it to carry longer names as it is. However, the userspace tools, like
> iproute2, are doing checks before print out. So for example in output of
> "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
> completely avoided.
> 
> So here is a proposal that might work:
> 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
>IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
>user should be prepared for any string size.
> 2) Add a file in sysfs that would indicate that NAME_EXT is supported by
>the kernel.
> 3) Udev is going to look for the sysfs indication file. In case when
>kernel supports long names, it will do rename to longer name, setting
>IFLA_NAME_EXT. If not, it does what it does now - fail.
> 4) There are two cases that can happen during rename:
>A) The name is shorter than IFNAMSIZ
>   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:  
>  original IFLA_NAME = eth0
>  original IFLA_NAME_EXT = eth0
>  renamed  IFLA_NAME = enp5s0f1npf0vf1
>  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
>B) The name is longer tha IFNAMSIZ
>   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would   
>  contain the new one:
>  original IFLA_NAME = eth0
>  original IFLA_NAME_EXT = eth0
>  renamed  IFLA_NAME = eth0
>  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22
> 
> This would allow the old tools to work with "eth0" and the new
> tools would work with "enp131s0f1npf0vf22". In sysfs, there would
> be symlink from one name to another.
>   
> Also, there might be a warning added to kernel if someone works
> with IFLA_NAME that the userspace tool should be upgraded.
> 
> Eventually, only IFLA_NAME_EXT is going to be used by everyone.
> 
> I'm aware there are other places where similar new attribute
> would have to be introduced too (ip rule for example).
> I'm not saying this is a simple work.
> 
> Question is what to do with the ioctl api (get ifindex etc). I would
> probably leave it as is and push tools to use rtnetlink instead.
> 
> Any ideas why this would not work? Any ideas how to solve this
> differently?
> 
> Thanks!
> 
> Jiri
>  

I looked into this in the past, but then rejected it because
there are so many tools that use names, not just iproute2.
Plus long names are very user unfriendly.

[RFC] longer netdev names proposal

2019-06-27 Thread Jiri Pirko

Hi all.

In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
netdevice name length. Now when we have PF and VF representors
with port names like "pfXvfY", it became quite common to hit this limit:
0123456789012345
enp131s0f1npf0vf6
enp131s0f1npf0vf22

Since IFLA_NAME is just a string, I though it might be possible to use
it to carry longer names as it is. However, the userspace tools, like
iproute2, are doing checks before print out. So for example in output of
"ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
completely avoided.

So here is a proposal that might work:
1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
   IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
   user should be prepared for any string size.
2) Add a file in sysfs that would indicate that NAME_EXT is supported by
   the kernel.
3) Udev is going to look for the sysfs indication file. In case when
   kernel supports long names, it will do rename to longer name, setting
   IFLA_NAME_EXT. If not, it does what it does now - fail.
4) There are two cases that can happen during rename:
   A) The name is shorter than IFNAMSIZ
  -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:
 original IFLA_NAME = eth0
 original IFLA_NAME_EXT = eth0
 renamed  IFLA_NAME = enp5s0f1npf0vf1
 renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
   B) The name is longer tha IFNAMSIZ
  -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would 
 contain the new one:
 original IFLA_NAME = eth0
 original IFLA_NAME_EXT = eth0
 renamed  IFLA_NAME = eth0
 renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22

This would allow the old tools to work with "eth0" and the new
tools would work with "enp131s0f1npf0vf22". In sysfs, there would
be symlink from one name to another.
  
Also, there might be a warning added to kernel if someone works
with IFLA_NAME that the userspace tool should be upgraded.

Eventually, only IFLA_NAME_EXT is going to be used by everyone.

I'm aware there are other places where similar new attribute
would have to be introduced too (ip rule for example).
I'm not saying this is a simple work.

Question is what to do with the ioctl api (get ifindex etc). I would
probably leave it as is and push tools to use rtnetlink instead.

Any ideas why this would not work? Any ideas how to solve this
differently?

Thanks!

Jiri

[RFC v2] vsock: proposal to support multiple transports at runtime

2019-06-06 Thread Stefano Garzarella



Hi all,
this is a v2 of a proposal addressing the comments made by Dexuan, Stefan,
and Jorgen.

v1: https://www.spinics.net/lists/netdev/msg570274.html



We can define two types of transport that we have to handle at the same time
(e.g. in a nested VM we would have both types of transport running together):

- 'host->guest' transport, it runs in the host and it is used to communicate
  with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also
  runs in the guest who has nested guests, to communicate with them.

  [Phase 2]
  We can support multiple 'host->guest' transport running at the same time,
  but on x86 only one hypervisor uses VMX at any given time.

- 'guest->host' transport, it runs in the guest and it is used to communicate
  with the host.


The main goal is to find a way to decide what transport use in these cases:
1. connect() / sendto()

   a. use the 'host->guest' transport, if the destination is the guest
  (dest_cid > VMADDR_CID_HOST).

  [Phase 2]
  In order to support multiple 'host->guest' transports running at the same
  time, we should assign CIDs uniquely across all transports. In this way,
  a packet generated by the host side will get directed to the appropriate
  transport based on the CID.

   b. use the 'guest->host' transport, if the destination is the host or the
  hypervisor.
  (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR)


2. listen() / recvfrom()

   a. use the 'host->guest' transport, if the socket is bound to
  VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
  'guest->host' transport.
  We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
  address this case.

  [Phase 2]
  We can support network namespaces to create independent AF_VSOCK
  addressing domains:
  - could be used to partition VMs between hypervisors or at a finer
 granularity;
  - could be used to isolate host applications from guest applications
 using the same ports with CID_ANY;

   b. use the 'guest->host' transport, if the socket is bound to local CID
  different from the VMADDR_CID_HOST (guest CID get with
  IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be
  backward compatible).
  Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.

   c. shared port space between transports
  For incoming requests or packets, we should be able to choose which
  transport use, looking at the 'port' requested.

  - stream sockets already support shared port space between transports
(one port can be assigned to only one transport)

  [Phase 2]
  - datagram sockets will support it, but for now VMCI transport is the
default transport for any host side datagram socket (KVM and Hyper-V
do not yet support datagrams sockets)

We will make the loading of af_vsock.ko independent of the transports to
allow to:
   - create a AF_VSOCK socket without any loaded transports;
   - listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded
 transports;

Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the
vmci_transport.ko to the af_vsock.ko.
[Jorgen will check if this will impact the existing VMware products]

Notes:
   - For Hyper-V sockets, the host can only be Windows. No changes should
 be required on the Windows host to support the changes on this proposal.

   - Communication between guests are not allowed on any transports, so we can
 drop packets sent from a guest to another guest (dest_cid >
 VMADDR_CID_HOST) if the 'host->guest' transport is not available.

   - [Phase 2] tag used to identify things that can be done at a later stage,
 but that should be taken into account during this design.

   - Namespace support will be developed in [Phase 2] or in a separate project.



Comments and suggestions are welcome.
I'll be on PTO for next two weeks, so sorry in advance if I'll answer later.

If we agree on this proposal, when I get back, I'll start working on the code
to get a first PATCH RFC.

Cheers,
Stefano

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-06-03 Thread Stefano Garzarella

On Fri, May 31, 2019 at 09:24:49AM +, Jorgen Hansen wrote:
> On 30 May 2019, at 13:19, Stefano Garzarella  wrote:
> > 
> > On Tue, May 28, 2019 at 04:01:00PM +, Jorgen Hansen wrote:
> >>> On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote:
> >>>> On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> > 
> >>>>> 
> >>>>> 
> >>>>> 2. listen() / recvfrom()
> >>>>> 
> >>>>>a. use the 'host side transport', if the socket is bound to
> >>>>>   VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> >>>>>   guest transport.
> >>>>>   We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order 
> >>>>> to
> >>>>>   address this case.
> >>>>>   If we want to support multiple 'host side transport' running at 
> >>>>> the
> >>>>>   same time, we should find a way to allow an application to bound a
> >>>>>   specific host transport (e.g. adding new 
> >>>>> VMADDR_CID_LISTEN_FROM_KVM,
> >>>>>   VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)
> >>>> 
> >>>> Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE,
> >>>> VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible.  What if my service
> >>>> should only be available to a subset of VMware VMs?
> >>> 
> >>> You're right, it is not very flexible.
> >> 
> >> When I was last looking at this, I was considering a proposal where
> >> the incoming traffic would determine which transport to use for
> >> CID_ANY in the case of multiple transports. For stream sockets, we
> >> already have a shared port space, so if we receive a connection
> >> request for < port N, CID_ANY>, that connection would use the
> >> transport of the incoming request. The transport could either be a
> >> host->guest transport or the guest->host transport. This is a bit
> >> harder to do for datagrams since the VSOCK port is decided by the
> >> transport itself today. For VMCI, a VMCI datagram handler is allocated
> >> for each datagram socket, and the ID of that handler is used as the
> >> port. So we would potentially have to register the same datagram port
> >> with all transports.
> > 
> > So, do you think we should implement a shared port space also for
> > datagram sockets?
> 
> Yes, having the two socket types work the same way seems cleaner to me. We 
> should at least cover it in the design.
> 

Okay, I'll add this point on a v2 of this proposal!

> > For now only the VMWare implementation supports the datagram sockets,
> > but in the future we could support it also on KVM and HyperV, so I think
> > we should consider it in this proposal.
> 
> So for now, it sounds like we could make the VMCI transport the default 
> transport for any host side datagram socket, then.
> 

Yes, make sense.

> >> 
> >> The use of network namespaces would be complimentary to this, and
> >> could be used to partition VMs between hypervisors or at a finer
> >> granularity. This could also be used to isolate host applications from
> >> guest applications using the same ports with CID_ANY if necessary.
> >> 
> > 
> > Another point to the netns support, I'll put it in the proposal (or it
> > could go in parallel with the multi-transport support).
> > 
> 
> It should be fine to put in the proposal that we rely on namespaces to 
> provide this support, but pursue namespaces as a separate project.

Sure.

I'll send a v2 adding all the points discussed to be sure that we are
aligned. Then I'll start working on it if we agree on the proposal.

Thanks,
Stefano

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-31 Thread Jorgen Hansen

On 30 May 2019, at 13:19, Stefano Garzarella  wrote:
> 
> On Tue, May 28, 2019 at 04:01:00PM +, Jorgen Hansen wrote:
>>> On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote:
>>>> On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> 
>>>>> 
>>>>> 
>>>>> 2. listen() / recvfrom()
>>>>> 
>>>>>a. use the 'host side transport', if the socket is bound to
>>>>>   VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
>>>>>   guest transport.
>>>>>   We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
>>>>>   address this case.
>>>>>   If we want to support multiple 'host side transport' running at the
>>>>>   same time, we should find a way to allow an application to bound a
>>>>>   specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM,
>>>>>   VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)
>>>> 
>>>> Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE,
>>>> VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible.  What if my service
>>>> should only be available to a subset of VMware VMs?
>>> 
>>> You're right, it is not very flexible.
>> 
>> When I was last looking at this, I was considering a proposal where
>> the incoming traffic would determine which transport to use for
>> CID_ANY in the case of multiple transports. For stream sockets, we
>> already have a shared port space, so if we receive a connection
>> request for < port N, CID_ANY>, that connection would use the
>> transport of the incoming request. The transport could either be a
>> host->guest transport or the guest->host transport. This is a bit
>> harder to do for datagrams since the VSOCK port is decided by the
>> transport itself today. For VMCI, a VMCI datagram handler is allocated
>> for each datagram socket, and the ID of that handler is used as the
>> port. So we would potentially have to register the same datagram port
>> with all transports.
> 
> So, do you think we should implement a shared port space also for
> datagram sockets?

Yes, having the two socket types work the same way seems cleaner to me. We 
should at least cover it in the design.

> For now only the VMWare implementation supports the datagram sockets,
> but in the future we could support it also on KVM and HyperV, so I think
> we should consider it in this proposal.

So for now, it sounds like we could make the VMCI transport the default 
transport for any host side datagram socket, then.

>> 
>> The use of network namespaces would be complimentary to this, and
>> could be used to partition VMs between hypervisors or at a finer
>> granularity. This could also be used to isolate host applications from
>> guest applications using the same ports with CID_ANY if necessary.
>> 
> 
> Another point to the netns support, I'll put it in the proposal (or it
> could go in parallel with the multi-transport support).
> 

It should be fine to put in the proposal that we rely on namespaces to provide 
this support, but pursue namespaces as a separate project.

Thanks,
Jorgen

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-30 Thread Stefano Garzarella

On Tue, May 28, 2019 at 04:01:00PM +, Jorgen Hansen wrote:
> > On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote:
> > > On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> > > > Hi guys,
> > > > I'm currently interested on implement a multi-transport support for 
> > > > VSOCK in
> > > > order to handle nested VMs.
> 
> Thanks for picking this up!
> 

:)

> > > >
> > > > As Stefan suggested me, I started to look at this discussion:
> > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2017%2F8%2F17%2F551&data=02%7C01%7Cjhansen%40vmware.com%7Cc2a340a868bb4525c6d408d6e2905909%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636945506938670252&sdata=kl820ZF1AAOXEyCZYoNPpYmLVyvK3ISr1GT0oDODEn4%3D&reserved=0
> > > > Below I tried to summarize a proposal for a discussion, following the 
> > > > ideas
> > > > from Dexuan, Jorgen, and Stefan.
> > > >
> > > >
> > > > We can define two types of transport that we have to handle at the same 
> > > > time
> > > > (e.g. in a nested VM we would have both types of transport running 
> > > > together):
> > > >
> > > > - 'host side transport', it runs in the host and it is used to 
> > > > communicate with
> > > >   the guests of a specific hypervisor (KVM, VMWare or HyperV)
> > > >
> > > >   Should we support multiple 'host side transport' running at the same 
> > > > time?
> > > >
> > > > - 'guest side transport'. it runs in the guest and it is used to 
> > > > communicate
> > > >   with the host transport
> > >
> > > I find this terminology confusing.  Perhaps "host->guest" (your 'host
> > > side transport') and "guest->host" (your 'guest side transport') is
> > > clearer?
> >
> > I agree, "host->guest" and "guest->host" are better, I'll use them.
> >
> > >
> > > Or maybe the nested virtualization terminology of L2 transport (your
> > > 'host side transport') and L0 transport (your 'guest side transport')?
> > > Here we are the L1 guest and L0 is the host and L2 is our nested guest.
> > >
> >
> > I'm confused, if L2 is the nested guest, it should be the
> > 'guest side transport'. Did I miss anything?
> >
> > Maybe it is another point to your first proposal :)
> >
> > > >
> > > >
> > > > The main goal is to find a way to decide what transport use in these 
> > > > cases:
> > > > 1. connect() / sendto()
> > > >
> > > > a. use the 'host side transport', if the destination is the guest
> > > >(dest_cid > VMADDR_CID_HOST).
> > > >If we want to support multiple 'host side transport' running at 
> > > > the
> > > >same time, we should assign CIDs uniquely across all transports.
> > > >In this way, a packet generated by the host side will get 
> > > > directed
> > > >to the appropriate transport based on the CID
> > >
> > > The multiple host side transport case is unlikely to be necessary on x86
> > > where only one hypervisor uses VMX at any given time.  But eventually it
> > > may happen so it's wise to at least allow it in the design.
> > >
> >
> > Okay, I was in doubt, but I'll keep it in the design.
> >
> > > >
> > > > b. use the 'guest side transport', if the destination is the host
> > > >(dest_cid == VMADDR_CID_HOST)
> > >
> > > Makes sense to me.
> > >
> 
> Agreed. With the addition that VMADDR_CID_HYPERVISOR is also routed as
> "guest->host/guest side transport".
> 

Yes, I had it in mind, but I forgot to write it in the proposal.

> >> >
> >> >
> >> > 2. listen() / recvfrom()
> > > >
> >> > a. use the 'host side transport', if the socket is bound to
> > > >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> > > >guest transport.
> > > >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order 
> > > > to
> > > >address this case.
> > > >If we want t

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-28 Thread Jorgen Hansen

> On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote:
> > On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> > > Hi guys,
> > > I'm currently interested on implement a multi-transport support for VSOCK 
> > > in
> > > order to handle nested VMs.

Thanks for picking this up!

> > >
> > > As Stefan suggested me, I started to look at this discussion:
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2017%2F8%2F17%2F551&data=02%7C01%7Cjhansen%40vmware.com%7Cc2a340a868bb4525c6d408d6e2905909%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636945506938670252&sdata=kl820ZF1AAOXEyCZYoNPpYmLVyvK3ISr1GT0oDODEn4%3D&reserved=0
> > > Below I tried to summarize a proposal for a discussion, following the 
> > > ideas
> > > from Dexuan, Jorgen, and Stefan.
> > >
> > >
> > > We can define two types of transport that we have to handle at the same 
> > > time
> > > (e.g. in a nested VM we would have both types of transport running 
> > > together):
> > >
> > > - 'host side transport', it runs in the host and it is used to 
> > > communicate with
> > >   the guests of a specific hypervisor (KVM, VMWare or HyperV)
> > >
> > >   Should we support multiple 'host side transport' running at the same 
> > > time?
> > >
> > > - 'guest side transport'. it runs in the guest and it is used to 
> > > communicate
> > >   with the host transport
> >
> > I find this terminology confusing.  Perhaps "host->guest" (your 'host
> > side transport') and "guest->host" (your 'guest side transport') is
> > clearer?
>
> I agree, "host->guest" and "guest->host" are better, I'll use them.
>
> >
> > Or maybe the nested virtualization terminology of L2 transport (your
> > 'host side transport') and L0 transport (your 'guest side transport')?
> > Here we are the L1 guest and L0 is the host and L2 is our nested guest.
> >
>
> I'm confused, if L2 is the nested guest, it should be the
> 'guest side transport'. Did I miss anything?
>
> Maybe it is another point to your first proposal :)
>
> > >
> > >
> > > The main goal is to find a way to decide what transport use in these 
> > > cases:
> > > 1. connect() / sendto()
> > >
> > > a. use the 'host side transport', if the destination is the guest
> > >(dest_cid > VMADDR_CID_HOST).
> > >If we want to support multiple 'host side transport' running at the
> > >same time, we should assign CIDs uniquely across all transports.
> > >In this way, a packet generated by the host side will get directed
> > >to the appropriate transport based on the CID
> >
> > The multiple host side transport case is unlikely to be necessary on x86
> > where only one hypervisor uses VMX at any given time.  But eventually it
> > may happen so it's wise to at least allow it in the design.
> >
>
> Okay, I was in doubt, but I'll keep it in the design.
>
> > >
> > > b. use the 'guest side transport', if the destination is the host
> > >(dest_cid == VMADDR_CID_HOST)
> >
> > Makes sense to me.
> >

Agreed. With the addition that VMADDR_CID_HYPERVISOR is also routed as 
"guest->host/guest side transport".

>> >
>> >
>> > 2. listen() / recvfrom()
> > >
>> > a. use the 'host side transport', if the socket is bound to
> > >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> > >guest transport.
> > >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
> > >address this case.
> > >If we want to support multiple 'host side transport' running at the
> > >same time, we should find a way to allow an application to bound a
> > >specific host transport (e.g. adding new 
> > > VMADDR_CID_LISTEN_FROM_KVM,
> > >VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)
> >
> > Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE,
> > VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible.  What if my service
> > should only be available to a subset of VMware VMs?
>
> You're right, it is not very flexible.

When I was

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-27 Thread Stefano Garzarella

On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote:
> On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> > Hi guys,
> > I'm currently interested on implement a multi-transport support for VSOCK in
> > order to handle nested VMs.
> > 
> > As Stefan suggested me, I started to look at this discussion:
> > https://lkml.org/lkml/2017/8/17/551
> > Below I tried to summarize a proposal for a discussion, following the ideas
> > from Dexuan, Jorgen, and Stefan.
> > 
> > 
> > We can define two types of transport that we have to handle at the same time
> > (e.g. in a nested VM we would have both types of transport running 
> > together):
> > 
> > - 'host side transport', it runs in the host and it is used to communicate 
> > with
> >   the guests of a specific hypervisor (KVM, VMWare or HyperV)
> > 
> >   Should we support multiple 'host side transport' running at the same time?
> > 
> > - 'guest side transport'. it runs in the guest and it is used to communicate
> >   with the host transport
> 
> I find this terminology confusing.  Perhaps "host->guest" (your 'host
> side transport') and "guest->host" (your 'guest side transport') is
> clearer?

I agree, "host->guest" and "guest->host" are better, I'll use them.

> 
> Or maybe the nested virtualization terminology of L2 transport (your
> 'host side transport') and L0 transport (your 'guest side transport')?
> Here we are the L1 guest and L0 is the host and L2 is our nested guest.
>

I'm confused, if L2 is the nested guest, it should be the
'guest side transport'. Did I miss anything?

Maybe it is another point to your first proposal :)

> > 
> > 
> > The main goal is to find a way to decide what transport use in these cases:
> > 1. connect() / sendto()
> > 
> > a. use the 'host side transport', if the destination is the guest
> >(dest_cid > VMADDR_CID_HOST).
> >If we want to support multiple 'host side transport' running at the
> >same time, we should assign CIDs uniquely across all transports.
> >In this way, a packet generated by the host side will get directed
> >to the appropriate transport based on the CID
> 
> The multiple host side transport case is unlikely to be necessary on x86
> where only one hypervisor uses VMX at any given time.  But eventually it
> may happen so it's wise to at least allow it in the design.
> 

Okay, I was in doubt, but I'll keep it in the design.

> > 
> > b. use the 'guest side transport', if the destination is the host
> >(dest_cid == VMADDR_CID_HOST)
> 
> Makes sense to me.
> 
> > 
> > 
> > 2. listen() / recvfrom()
> > 
> > a. use the 'host side transport', if the socket is bound to
> >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> >guest transport.
> >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
> >address this case.
> >If we want to support multiple 'host side transport' running at the
> >same time, we should find a way to allow an application to bound a
> >specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM,
> >VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)
> 
> Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE,
> VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible.  What if my service
> should only be available to a subset of VMware VMs?

You're right, it is not very flexible.

> 
> Instead it might be more appropriate to use network namespaces to create
> independent AF_VSOCK addressing domains.  Then you could have two
> separate groups of VMware VMs and selectively listen to just one group.
> 

Does AF_VSOCK support network namespace or it could be another
improvement to take care? (IIUC is not currently supported)

A possible issue that I'm seeing with netns is if they are used for
other purpose (e.g. to isolate the network of a VM), we should have
multiple instances of the application, one per netns.

> > 
> > b. use the 'guest side transport', if the socket is bound to local CID
> >different from the VMADDR_CID_HOST (guest CID get with
> >IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY
> >(to be backward compatible).
> >Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.
>

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-23 Thread Stefan Hajnoczi

On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> Hi guys,
> I'm currently interested on implement a multi-transport support for VSOCK in
> order to handle nested VMs.
> 
> As Stefan suggested me, I started to look at this discussion:
> https://lkml.org/lkml/2017/8/17/551
> Below I tried to summarize a proposal for a discussion, following the ideas
> from Dexuan, Jorgen, and Stefan.
> 
> 
> We can define two types of transport that we have to handle at the same time
> (e.g. in a nested VM we would have both types of transport running together):
> 
> - 'host side transport', it runs in the host and it is used to communicate 
> with
>   the guests of a specific hypervisor (KVM, VMWare or HyperV)
> 
>   Should we support multiple 'host side transport' running at the same time?
> 
> - 'guest side transport'. it runs in the guest and it is used to communicate
>   with the host transport

I find this terminology confusing.  Perhaps "host->guest" (your 'host
side transport') and "guest->host" (your 'guest side transport') is
clearer?

Or maybe the nested virtualization terminology of L2 transport (your
'host side transport') and L0 transport (your 'guest side transport')?
Here we are the L1 guest and L0 is the host and L2 is our nested guest.

> 
> 
> The main goal is to find a way to decide what transport use in these cases:
> 1. connect() / sendto()
> 
>   a. use the 'host side transport', if the destination is the guest
>  (dest_cid > VMADDR_CID_HOST).
>  If we want to support multiple 'host side transport' running at the
>  same time, we should assign CIDs uniquely across all transports.
>  In this way, a packet generated by the host side will get directed
>  to the appropriate transport based on the CID

The multiple host side transport case is unlikely to be necessary on x86
where only one hypervisor uses VMX at any given time.  But eventually it
may happen so it's wise to at least allow it in the design.

> 
>   b. use the 'guest side transport', if the destination is the host
>  (dest_cid == VMADDR_CID_HOST)

Makes sense to me.

> 
> 
> 2. listen() / recvfrom()
> 
>   a. use the 'host side transport', if the socket is bound to
>  VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
>  guest transport.
>  We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
>  address this case.
>  If we want to support multiple 'host side transport' running at the
>  same time, we should find a way to allow an application to bound a
>  specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM,
>  VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)

Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE,
VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible.  What if my service
should only be available to a subset of VMware VMs?

Instead it might be more appropriate to use network namespaces to create
independent AF_VSOCK addressing domains.  Then you could have two
separate groups of VMware VMs and selectively listen to just one group.

> 
>   b. use the 'guest side transport', if the socket is bound to local CID
>  different from the VMADDR_CID_HOST (guest CID get with
>  IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY
>  (to be backward compatible).
>  Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.

Two additional topics:

1. How will loading af_vsock.ko change?  In particular, can an
   application create a socket in af_vsock.ko without any loaded
   transport?  Can it enter listen state without any loaded transport
   (this seems useful with VMADDR_CID_ANY)?

2. Does your proposed behavior match VMware's existing nested vsock
   semantics?

signature.asc
Description: PGP signature

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-20 Thread Stefano Garzarella

Hi Dexuan,

On Thu, May 16, 2019 at 09:48:11PM +, Dexuan Cui wrote:
> > From: Stefano Garzarella 
> > Sent: Tuesday, May 14, 2019 1:16 AM
> > To: netdev@vger.kernel.org; Stefan Hajnoczi ; Dexuan
> > 
> > Hi guys,
> > I'm currently interested on implement a multi-transport support for VSOCK in
> > order to handle nested VMs.
> 
> Hi Stefano,
> Thanks for reviving the discussion! :-)
> 

You're welcome :)

> I don't know a lot about the details of kvm/vmware sockets, but please let me
> share my understanding about them, and let me also share some details about
> hyper-v sockets, which I think should be the simplest:
> 
> 1) For hyper-v sockets, the "host" can only be Windows. We can do nothing on 
> the
> Windows host, and I guess we need to do nothing there.

I agree that for the Windows host we shouldn't change anything.

> 
> 2) For hyper-v sockets, I think we only care about Linux guest, and the guest 
> can
> only talk to the host; a guest can not talk to another guest running on the 
> same host.

Also for KVM (virtio) a guest can talk only with the host.

> 
> 3) On a hyper-v host, if the guest is running kvm/vmware (i.e. nested 
> virtualization),
> I think in the "KVM guest" the Linux hyper-v transport driver needs to load 
> so that
> the guest can talk to the host (I'm not sure about "vmware guest" in this 
> case); 
> the "KVM guest" also needs to load the kvm transport drivers so that it can 
> talk
> to its child VMs (I'm not sure abut "vmware guest" in this case).

Okay, so since in the "KVM guest" we will have both hyper-v and kvm
transports, we should implement a way to decide what transport use in
the cases that I described in the first email.

> 
> 4) On kvm/vmware, if the guest is a Windows guest, I think we can do nothing 
> in
> the guest;

Yes, the driver in Windows guest shouldn't change.

> if the guest is Linux guest, I think the kvm/vmware transport drivers
> should load; if the Linux guest is running kvm/vmware (nested 
> virtualization), I
> think the proper "to child VMs" versions of the kvm/vmware transport drivers
> need to load.

Exactly, and for the KVM side is the vhost-vsock driver. So, as the
point 3, we should support at least two transports running in Linux at
the same time.

Thank you very much to share these information!

Cheers,
Stefano

RE: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-16 Thread Dexuan Cui

> From: Stefano Garzarella 
> Sent: Tuesday, May 14, 2019 1:16 AM
> To: netdev@vger.kernel.org; Stefan Hajnoczi ; Dexuan
> 
> Hi guys,
> I'm currently interested on implement a multi-transport support for VSOCK in
> order to handle nested VMs.

Hi Stefano,
Thanks for reviving the discussion! :-)

I don't know a lot about the details of kvm/vmware sockets, but please let me
share my understanding about them, and let me also share some details about
hyper-v sockets, which I think should be the simplest:

1) For hyper-v sockets, the "host" can only be Windows. We can do nothing on the
Windows host, and I guess we need to do nothing there.

2) For hyper-v sockets, I think we only care about Linux guest, and the guest 
can
only talk to the host; a guest can not talk to another guest running on the 
same host.

3) On a hyper-v host, if the guest is running kvm/vmware (i.e. nested 
virtualization),
I think in the "KVM guest" the Linux hyper-v transport driver needs to load so 
that
the guest can talk to the host (I'm not sure about "vmware guest" in this 
case); 
the "KVM guest" also needs to load the kvm transport drivers so that it can talk
to its child VMs (I'm not sure abut "vmware guest" in this case).

4) On kvm/vmware, if the guest is a Windows guest, I think we can do nothing in
the guest; if the guest is Linux guest, I think the kvm/vmware transport drivers
should load; if the Linux guest is running kvm/vmware (nested virtualization), I
think the proper "to child VMs" versions of the kvm/vmware transport drivers
need to load. 

Thanks,
-- Dexuan

[RFC] vsock: proposal to support multiple transports at runtime

2019-05-14 Thread Stefano Garzarella

Hi guys,
I'm currently interested on implement a multi-transport support for VSOCK in
order to handle nested VMs.

As Stefan suggested me, I started to look at this discussion:
https://lkml.org/lkml/2017/8/17/551
Below I tried to summarize a proposal for a discussion, following the ideas
from Dexuan, Jorgen, and Stefan.


We can define two types of transport that we have to handle at the same time
(e.g. in a nested VM we would have both types of transport running together):

- 'host side transport', it runs in the host and it is used to communicate with
  the guests of a specific hypervisor (KVM, VMWare or HyperV)

  Should we support multiple 'host side transport' running at the same time?

- 'guest side transport'. it runs in the guest and it is used to communicate
  with the host transport


The main goal is to find a way to decide what transport use in these cases:
1. connect() / sendto()

a. use the 'host side transport', if the destination is the guest
   (dest_cid > VMADDR_CID_HOST).
   If we want to support multiple 'host side transport' running at the
   same time, we should assign CIDs uniquely across all transports.
   In this way, a packet generated by the host side will get directed
   to the appropriate transport based on the CID

b. use the 'guest side transport', if the destination is the host
   (dest_cid == VMADDR_CID_HOST)


2. listen() / recvfrom()

a. use the 'host side transport', if the socket is bound to
   VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
   guest transport.
   We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
   address this case.
   If we want to support multiple 'host side transport' running at the
   same time, we should find a way to allow an application to bound a
   specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM,
   VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)

b. use the 'guest side transport', if the socket is bound to local CID
   different from the VMADDR_CID_HOST (guest CID get with
   IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY
   (to be backward compatible).
   Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.

Thanks in advance for your comments and suggestions.

Cheers,
Stefano

Investment Proposal.

2019-03-26 Thread Saleh Hussien Consultancy

Greetings,

We are consultancy firm situated in Bahrain currently looking to finance new or 
existing projects in any industry.

Currently we are sourcing for investment opportunities for our review and 
consideration and we provide financial and strategic advisory services to 
growing companies and entrepreneurs both private and institutional investors 
and I would be delighted to discuss further.

Should you wish to know more about the investment funding, feel free to contact 
us.

Regards,

Saleh H A Hussain
Consultant
P.O. Box 11674, Manama
Kingdom of Bahrain.
www.shcbahrain.com

Re: kernel tls interface with user space modification proposal

2019-03-21 Thread Boris Pismenny

Hi Vakul,

+TLS maintainers

I suggest you send this to TLS maintainers if you want to get more 
feedback, and it would be best to tag this as RFC.

On 3/5/2019 9:56 AM, Vakul Garg wrote:
> Hi
> 
> The present interface of kernel tls with user space has few shortcomings.
> 
> The biggest one is that when we need to add a ciphersuite in kernel tls, then 
> we need to define new structures for passing cryptographic parameters 
> required by record layer.
> And the user space ssl stack also has to be modified because it tries to use 
> kernel tls only for a given set of ciphers implemented it it.
> 

As all TLS versions below 1.2 are being deprecated, and with TLS1.3 
supporting only 5 ciphersuites based on AES-GCM, AES-CCM and 
Chacha-Poly. I think that it is safe to go forward based on the existing 
model for these ciphers, while not supporting any other (older ciphers).

If we were to try and support all the available ciphers, then it might 
make sense to have a generic infrastructure for this.

> A better schema could be that if kernel tls support is compiled/enabled in 
> user space SSL stack, it tries to use it for all record layer ciphers.
> If kernel tls does not support a given cipher, then setsockopt fails and SSL 
> stack can fallback to non-ktls mode for the session and subsequent ones using 
> same cipher type.
> > This would require passing the crypto material in a generic form 
which is same for all cipher types.
> 
> My proposal is that at the sestsockopt interface, instead of passing discrete 
> keys/salt/IV etc of certain lengths (which are different for each cipher), we 
> pass the cipher type and the full keyblock (128 bytes).
> Thereafter, kernel tls chops the keyblock into keys/iv/salt which are defined 
> by the given cipher type.
> 
> (The keyblock is derived by SSL stack from master secret and then segmented 
> in to keys/IV/salt).
> 

Does this work for TLS1.3?

> This would keep the interface between ktls and user space software 
> independent of cipher types supported by kernel tls.
> 
> Further, it is redundant to pass same TLS version, cipher type info in both 
> Rx and Tx direction.
>
> I propose that we define an additional setsockopt interface for passing 
> crypto params in both directions.

Additional interfaces double the maintenance effort, and I'm not sure it 
is interesting to support any of the ciphers besides the once used by 
TLS1.3.

> This setsockopt() would be invoked by SSL stack after handshake is deemed 
> completed to start record protocol offload in both directions.
> 
> struct tls_rec_prot_info {
>unsigned short version;
>unsigned short cipher_type;
>unsigned char keyblock[128];
>unsigned char tx_seq[8];
>unsigned char rx_seq[8];
>};
> 
> setsockopt(sock, SOL_TLS, TLS_INFO, &rec_prot_info, sizeof(rec_prot_info));
> 
> Kindly advise.
> 
> Regards
> 
> Vakul
>

[RFC][Proposal] BPF Control MAP

2019-03-08 Thread Saeed Mahameed

In this proposal I am going to address the lack of a unified user API
for accessing and manipulating BPF system attributes, while this
proposal is generic and will work on any BPF subsystem (eBPF attach
points), I will mostly focus on XDP use cases.

So lately I started working on three different XDP open issues, namely
XDP statistic, XDP redirect and XDP meta-data, while the details of
these issues are not really relevant for the sake of this proposal, all
of them share one common problem: the lack of unified user interface to
manipulate and access their attributes.

Examples:
1. Query XDP statistics.
2. XDP resource management, Setup XDP-redirect TX resources.
3. Setup and query XDP-metadata - (BTF data structure).

Jesper Brouer, explains some of these issues in details at:
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org

Yes I considered, netlink, devlink, ethtool, sysctrl, etc .. but each
one of them has it's own drawback, they are networking specific and
will not serve the BPF general purpose.

What we want is, all of the BPF related knobs to be present in BPF user
tools: bcc, bpftool and libbpf. Ideally we don't want these tools to
integrate with all different subsystem's UAPIs, especially the wide
variety of the networking UAPIs, and imagine what other subsystems are
going to be using ..

So what seems to be the right path here is a unified BPF
control/configuration user interface, which will hook the caller with
the targeted subsystem.

To be aligned with all existing BPF tools I am going to propose the use
of BPF syscall (No, not a new BPF syscall command, I am not planing to
reinvent the wheel - "again" -).
What i am going to suggest is to use an already existing API which runs
on top of the BPF syscall, BPF MAPs API with just a small tweak. Enter:



BPF control MAP:

A special type of MAP "BPF_MAP_TYPE_CONTROL", this map will not behave
like other maps in the essence of having a user defined data structure
behind it, we are going to use it just to hook the user with the
targeted underlying subsystem and delegate user commands to it through
map operations (create/update_elem/lookup_elem/etc ...)



Requirements and implementation details:

1) Hook the user with the targeted subsystem:
- On create map, user selects the BPF_MAP_TYPE_CONTROL map type and
sets map_attr.ctrl_type to be the subsystem he wants to access and
manipulate (KPROBE/CGROUP/SOCKET_FILTER/XDP/etc..).

2) Set and Get operations of a specific BPF subsystem or an object in
that subsystem (for example a netdev in XDP).
- user will use the file descriptor retrieved on map creation to access
(Set/Get) the BPF subsystem attributes via map update_elem and
lookup_elem operations, the key will be the object id (example:
ifindex, or just the type of configuration to access) keys and values
are subsystem dependent.

3) Iterate through the different attributes/objects of the subsystem,
Use case: XDP BPF subsystem, get ALL netdevs XDP attributes/statistics.
can be easily achieved with: bpf_map_get_next_key.



Advantages & Motivation:
Why BPF MAP and not just a plain new BPF syscall command or any other
existing UAPI:

0) All BPF users love maps and got used to them, and simply, everything
is a map, system objects can be keys and their attributes can be
values.

1) **BTF** integration, any map (key, value) pair can be described in
BTF in kernel level and can be attached to the map the user creates,
this will be a huge advantage for user forward compatibility, and for
development convenience to not copy kernel uapi headers on each
attribute set updates, and simplify ABI compatibility.
New values or attributes can be dumped/parsed in user space with zero
effort, no need to constantly update user space tools.

2) BPF maps already laid the groundwork for our requirements as the
infrastructure and has the semantics that we are looking for (set/get).

3) Already integrated in user-space tools and libraries such ash
bcc/libbpf and friends, what is missing is just this small tweak (in
the kernel) to hook one special map type with the underlying BPF
subsystems.

Thoughts ?


[Some EXTRAs]
Example use cases (XDP only for now):

1) Query XDP stats of all XDP netdevs:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_STATS);

while (bpf_map_get_next_key(xdp_ctrl, &ifindex, &next_ifindex) == 0) {
bpf_map_lookup_elem(xdp_ctrl, &next_ifindex, &stats);
// we don't even need to know stats format in this case
btf_pretty_print(xdp_ctrl->btf, &stats);
ifindex =  next_ifindex;
}

2) Setup XDP tx redirect resources on egress netdev (netdev with no XDP
program).

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_ATTR);

xdp_attr->command = SETUP_REDIRECT;
xdp_attr->rings.num = 12;
xdp_attr->rings.size = 128;
bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);

3) Turn On/Off

kernel tls interface with user space modification proposal

2019-03-05 Thread Vakul Garg

Hi

The present interface of kernel tls with user space has few shortcomings.

The biggest one is that when we need to add a ciphersuite in kernel tls, then 
we need to define new structures for passing cryptographic parameters required 
by record layer.
And the user space ssl stack also has to be modified because it tries to use 
kernel tls only for a given set of ciphers implemented it it.

A better schema could be that if kernel tls support is compiled/enabled in user 
space SSL stack, it tries to use it for all record layer ciphers.
If kernel tls does not support a given cipher, then setsockopt fails and SSL 
stack can fallback to non-ktls mode for the session and subsequent ones using 
same cipher type.

This would require passing the crypto material in a generic form which is same 
for all cipher types.

My proposal is that at the sestsockopt interface, instead of passing discrete 
keys/salt/IV etc of certain lengths (which are different for each cipher), we 
pass the cipher type and the full keyblock (128 bytes).
Thereafter, kernel tls chops the keyblock into keys/iv/salt which are defined 
by the given cipher type. 

(The keyblock is derived by SSL stack from master secret and then segmented in 
to keys/IV/salt).

This would keep the interface between ktls and user space software independent 
of cipher types supported by kernel tls.

Further, it is redundant to pass same TLS version, cipher type info in both Rx 
and Tx direction.

I propose that we define an additional setsockopt interface for passing crypto 
params in both directions.
This setsockopt() would be invoked by SSL stack after handshake is deemed 
completed to start record protocol offload in both directions.

struct tls_rec_prot_info {
  unsigned short version;
  unsigned short cipher_type;
  unsigned char keyblock[128];
  unsigned char tx_seq[8];
  unsigned char rx_seq[8];
  };

setsockopt(sock, SOL_TLS, TLS_INFO, &rec_prot_info, sizeof(rec_prot_info));

Kindly advise.

Regards

Vakul

Re: Business Proposal

2018-11-01 Thread Edward Yuan



Dear Friend, 

  My name is Mr. Edward Yuan, a consultant/broker. I know you might be a bit 
apprehensive because you do not know me. Nevertheless, I have a proposal on 
behalf of a client, a lucrative business that might be of mutual benefit to you.

If interested in this proposition please kindly and urgently contact me for 
more details. 

Best Regards.
Mr. Edward Yuan.

---
This email has been checked for viruses by AVG.
https://www.avg.com

Re: Security enhancement proposal for kernel TLS

2018-08-03 Thread Dave Watson

On 08/02/18 05:23 PM, Vakul Garg wrote:
> > I agree that Boris' patch does what you say it does - it sets keys 
> > immediately
> > after CCS instead of after FINISHED message.  I disagree that the kernel tls
> > implementation currently requires that specific ordering, nor do I think 
> > that it
> > should require that ordering.
> 
> The current kernel implementation assumes record sequence number to start 
> from '0'.
> If keys have to be set after FINISHED message, then record sequence number 
> need to
> be communicated from user space TLS stack to kernel. IIRC, sequence number is 
> not 
> part of the interface through which key is transferred.

The setsockopt call struct takes the key, iv, salt, and seqno:

struct tls12_crypto_info_aes_gcm_128 {
struct tls_crypto_info info;
unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE];
unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE];
unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE];
};

RE: Security enhancement proposal for kernel TLS

2018-08-02 Thread Vakul Garg




> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Thursday, August 2, 2018 2:17 AM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris
> Pismenny 
> Subject: Re: Security enhancement proposal for kernel TLS
> 
> On 07/31/18 10:45 AM, Vakul Garg wrote:
> > > > IIUC, with the upstream implementation of tls record layer in
> > > > kernel, the decryption of tls FINISHED message happens in kernel.
> > > > Therefore the keys are already being sent to kernel tls socket
> > > > before handshake is
> > > completed.
> > >
> > > This is incorrect.
> >
> > Let us first reach a common ground on this.
> >
> >  The kernel TLS implementation can decrypt only after setting the keys on
> the socket.
> > The TLS message 'finished' (which is encrypted) is received after receiving
> 'CCS'
> > message. After the user space  TLS library receives CCS message, it
> > sets the keys on kernel TLS socket. Therefore, the next message in the
> > socket receive queue which is TLS finished gets decrypted in kernel only.
> >
> > Please refer to following Boris's patch on openssl. The  commit log says:
> > " We choose to set this option at the earliest - just after CCS is 
> > complete".
> 
> I agree that Boris' patch does what you say it does - it sets keys immediately
> after CCS instead of after FINISHED message.  I disagree that the kernel tls
> implementation currently requires that specific ordering, nor do I think that 
> it
> should require that ordering.

The current kernel implementation assumes record sequence number to start from 
'0'.
If keys have to be set after FINISHED message, then record sequence number need 
to
be communicated from user space TLS stack to kernel. IIRC, sequence number is 
not 
part of the interface through which key is transferred.

Re: Security enhancement proposal for kernel TLS

2018-08-01 Thread Dave Watson

On 07/31/18 10:45 AM, Vakul Garg wrote:
> > > IIUC, with the upstream implementation of tls record layer in kernel,
> > > the decryption of tls FINISHED message happens in kernel. Therefore
> > > the keys are already being sent to kernel tls socket before handshake is
> > completed.
> > 
> > This is incorrect.  
> 
> Let us first reach a common ground on this.
> 
>  The kernel TLS implementation can decrypt only after setting the keys on the 
> socket.
> The TLS message 'finished' (which is encrypted) is received after receiving 
> 'CCS'
> message. After the user space  TLS library receives CCS message, it sets the 
> keys
> on kernel TLS socket. Therefore, the next message in the  socket receive queue
> which is TLS finished gets decrypted in kernel only.
> 
> Please refer to following Boris's patch on openssl. The  commit log says:
> " We choose to set this option at the earliest - just after CCS is complete".

I agree that Boris' patch does what you say it does - it sets keys
immediately after CCS instead of after FINISHED message.  I disagree
that the kernel tls implementation currently requires that specific
ordering, nor do I think that it should require that ordering.

RE: Security enhancement proposal for kernel TLS

2018-07-31 Thread Vakul Garg




> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Tuesday, July 31, 2018 2:46 AM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris
> Pismenny 
> Subject: Re: Security enhancement proposal for kernel TLS
> 
> On 07/30/18 06:31 AM, Vakul Garg wrote:
> > > It's not entirely clear how your TLS handshake daemon works -   Why is
> > > it necessary to set the keys in the kernel tls socket before the
> > > handshake is completed?
> >
> > IIUC, with the upstream implementation of tls record layer in kernel,
> > the decryption of tls FINISHED message happens in kernel. Therefore
> > the keys are already being sent to kernel tls socket before handshake is
> completed.
> 
> This is incorrect.  

Let us first reach a common ground on this.

 The kernel TLS implementation can decrypt only after setting the keys on the 
socket.
The TLS message 'finished' (which is encrypted) is received after receiving 
'CCS'
message. After the user space  TLS library receives CCS message, it sets the 
keys
on kernel TLS socket. Therefore, the next message in the  socket receive queue
which is TLS finished gets decrypted in kernel only.

Please refer to following Boris's patch on openssl. The  commit log says:
" We choose to set this option at the earliest - just after CCS is complete".

--
commit a01dd062a32c687630b2a860b4bb053008f09ff5
Author: Boris Pismenny 
Date:   Sun Mar 11 16:18:27 2018 +0200

ssl: Linux TLS Rx Offload

This patch adds support for the Linux TLS Rx socket option.
It completes the previous patch for TLS Tx offload.
If the socket option is successful, then the receive data-path of the TCP
socket is implemented by the kernel.
We choose to set this option at the earliest - just after CCS is complete.
--

The  fact that keys are handed over to kernel TLS socket can also be verified
by putting a log in tls_sw_recvmsg().

I would stop here for you to confirm my observation first. 
Regards. Vakul


 > Currently the kernel TLS implementation decrypts
> everything after you set the keys on the socket.  I'm suggesting that you
> don't set the keys on the socket until after the FINISHED message.
> 
> > > Or, why do you need to hand off the fd to the client program before
> > > the handshake is completed?
> >
> > The fd is always owned by the client program..
> >
> > In my proposal, the applications poll their own tcp socket using
> read/recvmsg etc.
> > If they get handshake record, they forward it to the entity running
> handshake agent.
> > The handshake agent could be a linux daemon or could run on a separate
> > security processor like 'Secure element' or say arm trustzone etc. The
> > applications forward any handshake message it gets backs from
> > handshake agent to the connected tcp socket. Therefore, the
> > applications act as a forwarder of the handshake messages between the
> peer tls endpoint and handshake agent.
> > The received data messages are absorbed by the applications themselves
> > (bypassing ssl stack completely). Similarly, the applications tx data 
> > directly
> by writing on their socket.
> >
> > > Waiting until after handshake solves both of these issues.
> >
> > The security sensitive check which is 'Wait for handshake to finish
> > completely before accepting data' should not be the onus of the
> > application. We have enough examples in past where application
> > programmers made mistakes in setting up tls correctly. The idea is to
> isolate tls session setting up from the applications.
> 
> It's not clear to me what you gain by putting this 'handshake finished'
> notification in the kernel instead of in the client's tls library - you're 
> already
> forwarding the handshake start notification to the daemon, why can't the
> daemon notify them back in userspace that
> the handshake is finished?
> 
> If you did want to put the notification in the kernel, how would you handle
> poll on the socket, since probably both the handshake daemon and client
> might be polling the socket, but one for control messages and one for data?
> 
> The original kernel TLS RFC did split these to two separate sockets, but we
> decided it was too complicated, and that's not how userspace TLS clients
> function today.
> 
> Do you have an implementation of this?  There are a bunch of tricky corner
> cases here, it might make more sense to have something concrete to discuss.
> 
> > Further, as per tls RFC it is ok to piggy

Re: Security enhancement proposal for kernel TLS

2018-07-30 Thread Dave Watson

On 07/30/18 06:31 AM, Vakul Garg wrote:
> > It's not entirely clear how your TLS handshake daemon works -   Why is
> > it necessary to set the keys in the kernel tls socket before the handshake 
> > is
> > completed? 
> 
> IIUC, with the upstream implementation of tls record layer in kernel, the
> decryption of tls FINISHED message happens in kernel. Therefore the keys are
> already being sent to kernel tls socket before handshake is completed.

This is incorrect.  Currently the kernel TLS implementation decrypts
everything after you set the keys on the socket.  I'm suggesting that
you don't set the keys on the socket until after the FINISHED message.

> > Or, why do you need to hand off the fd to the client program
> > before the handshake is completed?
>   
> The fd is always owned by the client program..
> 
> In my proposal, the applications poll their own tcp socket using read/recvmsg 
> etc.
> If they get handshake record, they forward it to the entity running handshake 
> agent.
> The handshake agent could be a linux daemon or could run on a separate 
> security
> processor like 'Secure element' or say arm trustzone etc. The applications
> forward any handshake message it gets backs from handshake agent to the
> connected tcp socket. Therefore, the  applications act as a forwarder of the 
> handshake 
> messages between the peer tls endpoint and handshake agent.
> The received data messages are absorbed by the applications themselves 
> (bypassing ssl stack
> completely). Similarly, the applications tx data directly by writing on their 
> socket.
> 
> > Waiting until after handshake solves both of these issues.
>  
> The security sensitive check which is 'Wait for handshake to finish 
> completely before 
> accepting data' should not be the onus of the application. We have enough 
> examples
> in past where application programmers made mistakes in setting up tls 
> correctly. The idea
> is to isolate tls session setting up from the applications.

It's not clear to me what you gain by putting this 'handshake
finished' notification in the kernel instead of in the client's tls
library - you're already forwarding the handshake start notification
to the daemon, why can't the daemon notify them back in userspace that
the handshake is finished?   

If you did want to put the notification in the kernel, how would you
handle poll on the socket, since probably both the handshake daemon
and client might be polling the socket, but one for control messages
and one for data? 

The original kernel TLS RFC did split these to two separate sockets,
but we decided it was too complicated, and that's not how userspace
TLS clients function today.

Do you have an implementation of this?  There are a bunch of tricky
corner cases here, it might make more sense to have something concrete
to discuss.

> Further, as per tls RFC it is ok to piggyback the data records after the 
> finished handshake
> message. This is called early data. But then it is the responsibility of 
> applications to first
> complete finished message processing before accepting the data records.
> 
> The proposal is to disallow application world seeing data records
> before handshake finishes.

You're talking about the TLS 1.3 0-RTT feature, which is indeed an
interesting case.  For in-process TLS libraries, it's fairly easy to
punt, and don't set the kernel TLS keys until after the 0-RTT data +
handshake message.  For an OOB handshake daemon it might indeed make
more sense to leave the data in kernelspace ... somehow.

> > >   - The handshake state should fallback to 'unverified' in case a control
> > record is seen again by kernel TLS (e.g. in case of renegotiation, post
> > handshake client auth etc).
> > 
> > Currently kernel tls sockets return an error unless you explicitly handle 
> > the
> > control record for exactly this reason.
> 
> IIRC, any kind handshake message post handshake-completion is a problem for 
> kernel tls.
> This includes renegotiation, post handshake client-auth etc.
> 
> Please correct me if I am wrong.

You are correct, but currently kernel TLS sockets return an error
unless you explicitly handle the control message.  This should be
enough already to implement your proposal.

RE: Security enhancement proposal for kernel TLS

2018-07-29 Thread Vakul Garg

Sorry for a delayed response.
Kindly see inline.

> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Wednesday, July 25, 2018 9:30 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris
> Pismenny 
> Subject: Re: Security enhancement proposal for kernel TLS
> 
> You would probably get more responses if you cc the relevant people.
> Comments inline
> 
> On 07/22/18 12:49 PM, Vakul Garg wrote:
> > The kernel based TLS record layer allows the user space world to use a
> decoupled TLS implementation.
> > The applications need not be linked with TLS stack.
> > The TLS handshake can be done by a TLS daemon on the behalf of
> applications.
> >
> > Presently, as soon as the handshake process derives keys, it pushes the
> negotiated keys to kernel TLS .
> > Thereafter the applications can directly read and write data on their TCP
> socket (without having to use SSL apis).
> >
> > With the current kernel TLS implementation, there is a security problem.
> > Since the kernel TLS socket does not have information about the state
> > of handshake, it allows applications to be able to receive data from the
> peer TLS endpoint even when the handshake verification has not been
> completed by the SSL daemon.
> > It is a security problem if applications can receive data if verification 
> > of the
> handshake transcript is not completed (done with processing tls FINISHED
> message).
> >
> > My proposal:
> > - Kernel TLS should maintain state of handshake (verified or
> unverified).
> > In un-verified state, data records should not be allowed pass through
> to the applications.
> >
> > - Add a new control interface using which that the user space SSL
> stack can tell the TLS socket that handshake has been verified and DATA
> records can flow.
> > In 'unverified' state, only control records should be allowed to pass
> and reception DATA record should be pause the receive side record
> decryption.
> 
> It's not entirely clear how your TLS handshake daemon works -   Why is
> it necessary to set the keys in the kernel tls socket before the handshake is
> completed? 

IIUC, with the upstream implementation of tls record layer in kernel, the
decryption of tls FINISHED message happens in kernel. Therefore the keys are
already being sent to kernel tls socket before handshake is completed.

> Or, why do you need to hand off the fd to the client program
> before the handshake is completed?

The fd is always owned by the client program..
The client program opens up the socket, TCP bind/connect it and then
hands it over to SSL stack as a transport handle for exchanging handshake
messages. This is how it works today whether we use kernel TLS or not.
I do not propose to change it.

In my proposal, the applications poll their own tcp socket using read/recvmsg 
etc.
If they get handshake record, they forward it to the entity running handshake 
agent.
The handshake agent could be a linux daemon or could run on a separate security
processor like 'Secure element' or say arm trustzone etc. The applications
forward any handshake message it gets backs from handshake agent to the
connected tcp socket. Therefore, the  applications act as a forwarder of the 
handshake 
messages between the peer tls endpoint and handshake agent.
The received data messages are absorbed by the applications themselves 
(bypassing ssl stack
completely). Similarly, the applications tx data directly by writing on their 
socket.

> Waiting until after handshake solves both of these issues.

The security sensitive check which is 'Wait for handshake to finish completely 
before 
accepting data' should not be the onus of the application. We have enough 
examples
in past where application programmers made mistakes in setting up tls 
correctly. The idea
is to isolate tls session setting up from the applications.

> 
> I'm not aware of any tls libraries that send data before the finished message,
> is there any reason you need to support this?

Sending data records before sending finished message is a protocol error.
A good tls library never does that. But an attacker can exploit it if 
applications can receive
the  data records before handshake is finished. With current kernel TLS, it is 
possible to do so.

Further, as per tls RFC it is ok to piggyback the data records after the 
finished handshake
message. This is called early data. But then it is the responsibility of 
applications to first
complete finished message processing before accepting the data records.

The proposal is to disallow application world seeing data records before 
handshake finishes.

> 
> >
> > - The handshake state should fallback to 'unverified'

Re: Security enhancement proposal for kernel TLS

2018-07-25 Thread Dave Watson

You would probably get more responses if you cc the relevant people.
Comments inline

On 07/22/18 12:49 PM, Vakul Garg wrote:
> The kernel based TLS record layer allows the user space world to use a 
> decoupled TLS implementation.
> The applications need not be linked with TLS stack. 
> The TLS handshake can be done by a TLS daemon on the behalf of applications.
> 
> Presently, as soon as the handshake process derives keys, it pushes the 
> negotiated keys to kernel TLS . 
> Thereafter the applications can directly read and write data on their TCP 
> socket (without having to use SSL apis).
> 
> With the current kernel TLS implementation, there is a security problem. 
> Since the kernel TLS socket does not have information about the state of 
> handshake, 
> it allows applications to be able to receive data from the peer TLS endpoint 
> even when the handshake verification has not been completed by the SSL 
> daemon. 
> It is a security problem if applications can receive data if verification of 
> the handshake transcript is not completed (done with processing tls FINISHED 
> message).
> 
> My proposal:
>   - Kernel TLS should maintain state of handshake (verified or 
> unverified). 
>   In un-verified state, data records should not be allowed pass through 
> to the applications.
> 
>   - Add a new control interface using which that the user space SSL stack 
> can tell the TLS socket that handshake has been verified and DATA records can 
> flow. 
>   In 'unverified' state, only control records should be allowed to pass 
> and reception DATA record should be pause the receive side record decryption.

It's not entirely clear how your TLS handshake daemon works -   Why is
it necessary to set the keys in the kernel tls socket before the
handshake is completed?  Or, why do you need to hand off the fd to the
client program before the handshake is completed?  

Waiting until after handshake solves both of these issues.

I'm not aware of any tls libraries that send data before the finished
message, is there any reason you need to support this?

> 
>   - The handshake state should fallback to 'unverified' in case a control 
> record is seen again by kernel TLS (e.g. in case of renegotiation, post 
> handshake client auth etc).

Currently kernel tls sockets return an error unless you explicitly
handle the control record for exactly this reason.

If you want an external daemon to handle control messages after
handshake, there definitely might be some synchronization that would
make sense to push in the kernel.  However, with TLS 1.3 removing
renegotiation (and currently reneg is not implemented in kernel tls
anyway), there's much less reason to do so.

Security enhancement proposal for kernel TLS

2018-07-22 Thread Vakul Garg

Hi

The kernel based TLS record layer allows the user space world to use a 
decoupled TLS implementation.
The applications need not be linked with TLS stack. 
The TLS handshake can be done by a TLS daemon on the behalf of applications.

Presently, as soon as the handshake process derives keys, it pushes the 
negotiated keys to kernel TLS . 
Thereafter the applications can directly read and write data on their TCP 
socket (without having to use SSL apis).

With the current kernel TLS implementation, there is a security problem. 
Since the kernel TLS socket does not have information about the state of 
handshake, 
it allows applications to be able to receive data from the peer TLS endpoint 
even when the handshake verification has not been completed by the SSL daemon. 
It is a security problem if applications can receive data if verification of 
the handshake transcript is not completed (done with processing tls FINISHED 
message).

My proposal:
- Kernel TLS should maintain state of handshake (verified or 
unverified). 
In un-verified state, data records should not be allowed pass through 
to the applications.

- Add a new control interface using which that the user space SSL stack 
can tell the TLS socket that handshake has been verified and DATA records can 
flow. 
In 'unverified' state, only control records should be allowed to pass 
and reception DATA record should be pause the receive side record decryption.

- The handshake state should fallback to 'unverified' in case a control 
record is seen again by kernel TLS (e.g. in case of renegotiation, post 
handshake client auth etc).

Kindly comment.

Regards

Vakul

Proposal

2018-07-12 Thread Miss Victoria Mehmet

Hello



I have a business proposal of mutual benefits i would like to discuss with
you i asked before and i still await your positive response thanks

Proposal

2018-07-12 Thread Miss Victoria Mehmet

Hello

I have a business proposal of mutual benefits i would like to discuss with
you.

1 2 3 >

1 - 100 of 227 matches

Mail list logo