Re: [RFC net-next] net: stmmac: should not modify RX descriptor when STMMAC resume

2021-04-20 Thread Jon Hunter



On 20/04/2021 02:49, Joakim Zhang wrote:

...

>> I have tested this patch, but unfortunately the board still fails to resume
>> correctly. So it appears to suffer with the same issue we saw on the previous
>> implementation.
> 
> Could I double check with you? Have you reverted Commit 9c63faaa931e ("net: 
> stmmac: re-init rx buffers when mac resume back") and then apply above patch 
> to do the test?
> 
> If yes, you still saw the same issue with Commit 9c63faaa931e? Let's recall 
> the problem, system suspended, but system hang when STMMAC resume back, right?


I tested your patch on top of next-20210419 which has Thierry's revert
of 9c63faaa931e. So yes this is reverted. Unfortunately, with this
change resuming from suspend still does not work.

Jon 

-- 
nvpublic


[PATCH] ptp: Don't print an error if ptp_kvm is not supported

2021-04-20 Thread Jon Hunter
Commit 300bb1fe7671 ("ptp: arm/arm64: Enable ptp_kvm for arm/arm64")
enable ptp_kvm support for ARM platforms and for any ARM platform that
does not support this, the following error message is displayed ...

 ERR KERN fail to initialize ptp_kvm

For platforms that do not support ptp_kvm this error is a bit misleading
and so fix this by only printing this message if the error returned by
kvm_arch_ptp_init() is not -EOPNOTSUPP. Note that -EOPNOTSUPP is only
returned by ARM platforms today if ptp_kvm is not supported.

Fixes: 300bb1fe7671 ("ptp: arm/arm64: Enable ptp_kvm for arm/arm64")
Signed-off-by: Jon Hunter 
---
 drivers/ptp/ptp_kvm_common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/ptp/ptp_kvm_common.c b/drivers/ptp/ptp_kvm_common.c
index 721ddcede5e1..fcae32f56f25 100644
--- a/drivers/ptp/ptp_kvm_common.c
+++ b/drivers/ptp/ptp_kvm_common.c
@@ -138,7 +138,8 @@ static int __init ptp_kvm_init(void)
 
ret = kvm_arch_ptp_init();
if (ret) {
-   pr_err("fail to initialize ptp_kvm");
+   if (ret != -EOPNOTSUPP)
+   pr_err("fail to initialize ptp_kvm");
return ret;
}
 
-- 
2.25.1



Re: [RFC net-next] net: stmmac: should not modify RX descriptor when STMMAC resume

2021-04-19 Thread Jon Hunter
Hi Joakim,

On 19/04/2021 12:59, Joakim Zhang wrote:
> When system resume back, STMMAC will clear RX descriptors:
> stmmac_resume()
>   ->stmmac_clear_descriptors()
>   ->stmmac_clear_rx_descriptors()
>   ->stmmac_init_rx_desc()
>   ->dwmac4_set_rx_owner()
>   //p->des3 |= cpu_to_le32(RDES3_OWN | 
> RDES3_BUFFER1_VALID_ADDR);
> It only assets OWN and BUF1V bits in desc3 field, doesn't clear desc0/1/2 
> fields.
> 
> Let's take a case into account, when system suspend, it is possible that
> there are packets have not received yet, so the RX descriptors are wrote
> back by DMA, e.g.
> 008 [0xc4310080]: 0x0 0x40 0x0 0x34010040
> 
> When system resume back, after above process, it became a broken
> descriptor:
> 008 [0xc4310080]: 0x0 0x40 0x0 0xb5010040
> 
> The issue is that it only changes the owner of this descriptor, but do nothing
> about desc0/1/2 fields. The descriptor of STMMAC a bit special, applicaton
> prepares RX descriptors for DMA, after DMA recevie the packets, it will write
> back the descriptors, so the same field of a descriptor have different
> meanings to application and DMA. It should be a software bug there, and may
> not easy to reproduce, but there is a certain probability that it will
> occur.
> 
> Commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back") 
> tried
> to re-init desc0/desc1 (buffer address fields) to fix this issue, but it
> is not a proper solution, and made regression on Jetson TX2 boards.
> 
> It is unreasonable to modify RX descriptors outside of stmmac_rx_refill() 
> function,
> where it will clear all desc0/desc1/desc2/desc3 fields together.
> 
> This patch removes RX descriptors modification when STMMAC resume.
> 
> Signed-off-by: Joakim Zhang 
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 9f396648d76f..b784304a22e8 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -7186,6 +7186,8 @@ static void stmmac_reset_queues_param(struct 
> stmmac_priv *priv)
>   tx_q->mss = 0;
>  
>   netdev_tx_reset_queue(netdev_get_tx_queue(priv->dev, queue));
> +
> + stmmac_clear_tx_descriptors(priv, queue);
>   }
>  }
>  
> @@ -7250,7 +7252,6 @@ int stmmac_resume(struct device *dev)
>   stmmac_reset_queues_param(priv);
>  
>   stmmac_free_tx_skbufs(priv);
> - stmmac_clear_descriptors(priv);
>  
>   stmmac_hw_setup(ndev, false);
>   stmmac_init_coalesce(priv);
> 


I have tested this patch, but unfortunately the board still fails to
resume correctly. So it appears to suffer with the same issue we saw on
the previous implementation.

Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-13 Thread Jon Hunter


On 01/04/2021 17:28, Jon Hunter wrote:
> 
> On 31/03/2021 12:41, Joakim Zhang wrote:
> 
> ...
> 
>>> In answer to your question, resuming from suspend does work on this board
>>> without your change. We have been testing suspend/resume now on this board
>>> since Linux v5.8 and so we have the ability to bisect such regressions. So 
>>> it is
>>> clear to me that this is the change that caused this, but I am not sure why.
>>
>> Yes, I know this issue is regression caused by my patch. I just want to 
>> analyze the potential reasons. Due to the code change only related to the 
>> page recycle and reallocate.
>> So I guess if this page operate need IOMMU works when IOMMU is enabled. 
>> Could you help check if IOMMU driver resume before STMMAC? Our common desire 
>> is to find the root cause, right?
> 
> 
> Yes of course that is the desire here indeed. I had assumed that the
> suspend/resume order was good because we have never seen any problems,
> but nonetheless it is always good to check. Using ftrace I enabled
> tracing of the appropriate suspend/resume functions and this is what
> I see ...
> 
> # tracer: function
> #
> # entries-in-buffer/entries-written: 4/4   #P:6
> #
> #_-=> irqs-off
> #   / _=> need-resched
> #  | / _---=> hardirq/softirq
> #  || / _--=> preempt-depth
> #  ||| / delay
> #   TASK-PID CPU#     TIMESTAMP  FUNCTION
> #  | | |     | |
>  rtcwake-748 [000] ...1   536.700777: stmmac_pltfr_suspend 
> <-platform_pm_suspend
>  rtcwake-748 [000] ...1   536.735532: arm_smmu_pm_suspend 
> <-platform_pm_suspend
>  rtcwake-748 [000] ...1   536.757290: arm_smmu_pm_resume 
> <-platform_pm_resume
>  rtcwake-748 [003] ...1   536.856771: stmmac_pltfr_resume 
> <-platform_pm_resume
> 
> 
> So I don't see any ordering issues that could be causing this. 


Another thing I have found is that for our platform, if the driver for
the ethernet PHY (in this case broadcom PHY) is enabled, then it fails
to resume but if I disable the PHY in the kernel configuration, then
resume works. I have found that if I move the reinit of the RX buffers
to before the startup of the phy, then it can resume OK with the PHY
enabled.

Does the following work for you? Does your platform use a specific
ethernet PHY driver?

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 208cae344ffa..071d15d86dbe 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
return ret;
}
+   rtnl_lock();
+   mutex_lock(&priv->lock);
+   stmmac_reinit_rx_buffers(priv);
+   mutex_unlock(&priv->lock);
+
if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
-   rtnl_lock();
phylink_start(priv->phylink);
/* We may have called phylink_speed_down before */
phylink_speed_up(priv->phylink);
-   rtnl_unlock();
}
-   rtnl_lock();
mutex_lock(&priv->lock);
stmmac_reset_queues_param(priv);
-   stmmac_reinit_rx_buffers(priv);
stmmac_free_tx_skbufs(priv);
stmmac_clear_descriptors(priv);


It is still not clear to us why the existing call to
stmmac_clear_descriptors() is not sufficient to fix your problem.

How often does the issue you see occur?

Thanks
Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-04-01 Thread Jon Hunter


On 31/03/2021 12:41, Joakim Zhang wrote:

...

>> In answer to your question, resuming from suspend does work on this board
>> without your change. We have been testing suspend/resume now on this board
>> since Linux v5.8 and so we have the ability to bisect such regressions. So 
>> it is
>> clear to me that this is the change that caused this, but I am not sure why.
> 
> Yes, I know this issue is regression caused by my patch. I just want to 
> analyze the potential reasons. Due to the code change only related to the 
> page recycle and reallocate.
> So I guess if this page operate need IOMMU works when IOMMU is enabled. Could 
> you help check if IOMMU driver resume before STMMAC? Our common desire is to 
> find the root cause, right?


Yes of course that is the desire here indeed. I had assumed that the
suspend/resume order was good because we have never seen any problems,
but nonetheless it is always good to check. Using ftrace I enabled
tracing of the appropriate suspend/resume functions and this is what
I see ...

# tracer: function
#
# entries-in-buffer/entries-written: 4/4   #P:6
#
#_-=> irqs-off
#   / _=> need-resched
#  | / _---=> hardirq/softirq
#  || / _--=> preempt-depth
#  ||| / delay
#   TASK-PID CPU#     TIMESTAMP  FUNCTION
#  | | |     | |
 rtcwake-748 [000] ...1   536.700777: stmmac_pltfr_suspend 
<-platform_pm_suspend
 rtcwake-748 [000] ...1   536.735532: arm_smmu_pm_suspend 
<-platform_pm_suspend
 rtcwake-748 [000] ...1   536.757290: arm_smmu_pm_resume 
<-platform_pm_resume
 rtcwake-748 [003] ...1   536.856771: stmmac_pltfr_resume 
<-platform_pm_resume


So I don't see any ordering issues that could be causing this. 

Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Jon Hunter


On 31/03/2021 12:10, Joakim Zhang wrote:

...

>>>>>>>> You mean one of your boards? Does other boards with STMMAC can
>>>>>>>> work
>>>>>>> fine?
>>>>>>>
>>>>>>> We have two devices with the STMMAC and one works OK and the
>>>>>>> other
>>>>> fails.
>>>>>>> They are different generation of device and so there could be
>>>>>>> some architectural differences which is causing this to only be
>>>>>>> seen on one
>>> device.
>>>>>> It's really strange, but I also don't know what architectural
>>>>>> differences could
>>>>> affect this. Sorry.
>>>
>>>
>>> I realised that for the board which fails after this change is made,
>>> it has the IOMMU enabled. The other board does not at the moment
>>> (although work is in progress to enable). If I add
>>> 'iommu.passthrough=1' to cmdline for the failing board, then it works
>>> again. So in my case, the problem is linked to the IOMMU being enabled.
>>>
>>> Does you platform enable the IOMMU?
>>
>> Hi Jon,
>>
>> There is no IOMMU hardware available on our boards. But why IOMMU would
>> affect it during suspend/resume, and no problem in normal mode?
> 
> One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support 
> suspend/resume, is it possible iommu resume back after stmmac?


This board is the tegra186-p2771- (Jetson TX2) and uses the
arm,mmu-500 and not the above driver.

In answer to your question, resuming from suspend does work on this
board without your change. We have been testing suspend/resume now on
this board since Linux v5.8 and so we have the ability to bisect such
regressions. So it is clear to me that this is the change that caused
this, but I am not sure why.

Thanks
Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-31 Thread Jon Hunter


On 31/03/2021 08:43, Joakim Zhang wrote:

...

>>>>>>> You mean one of your boards? Does other boards with STMMAC can
>>>>>>> work
>>>>>> fine?
>>>>>>
>>>>>> We have two devices with the STMMAC and one works OK and the other
>>>> fails.
>>>>>> They are different generation of device and so there could be some
>>>>>> architectural differences which is causing this to only be seen on one
>> device.
>>>>> It's really strange, but I also don't know what architectural
>>>>> differences could
>>>> affect this. Sorry.
>>
>>
>> I realised that for the board which fails after this change is made, it has 
>> the
>> IOMMU enabled. The other board does not at the moment (although work is in
>> progress to enable). If I add 'iommu.passthrough=1' to cmdline for the 
>> failing
>> board, then it works again. So in my case, the problem is linked to the IOMMU
>> being enabled.
>>
>> Does you platform enable the IOMMU?
> 
> Hi Jon,
> 
> There is no IOMMU hardware available on our boards. But why IOMMU would 
> affect it during suspend/resume, and no problem in normal mode?


I am not sure either and I don't see anything obvious.

Guiseppe, Alexandre, Jose, do you see anything that is wrong with
Joakim's change 9c63faaa931e? This is completely breaking resume from
suspend on one of our boards and I would like to get your inputs?

Thanks
Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-30 Thread Jon Hunter



On 25/03/2021 08:12, Joakim Zhang wrote:

...

>>>>> You mean one of your boards? Does other boards with STMMAC can work
>>>> fine?
>>>>
>>>> We have two devices with the STMMAC and one works OK and the other
>> fails.
>>>> They are different generation of device and so there could be some
>>>> architectural differences which is causing this to only be seen on one 
>>>> device.
>>> It's really strange, but I also don't know what architectural differences 
>>> could
>> affect this. Sorry.


I realised that for the board which fails after this change is made, it
has the IOMMU enabled. The other board does not at the moment (although
work is in progress to enable). If I add 'iommu.passthrough=1' to
cmdline for the failing board, then it works again. So in my case, the
problem is linked to the IOMMU being enabled.

Does you platform enable the IOMMU?

Thanks
Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-25 Thread Jon Hunter


On 25/03/2021 07:53, Joakim Zhang wrote:
> 
>> -Original Message-
>> From: Jon Hunter 
>> Sent: 2021年3月24日 20:39
>> To: Joakim Zhang 
>> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
>> ; linux-tegra ;
>> Jakub Kicinski 
>> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
>> resume back
>>
>>
>>
>> On 24/03/2021 12:20, Joakim Zhang wrote:
>>
>> ...
>>
>>> Sorry for this breakage at your side.
>>>
>>> You mean one of your boards? Does other boards with STMMAC can work
>> fine?
>>
>> We have two devices with the STMMAC and one works OK and the other fails.
>> They are different generation of device and so there could be some
>> architectural differences which is causing this to only be seen on one 
>> device.
> It's really strange, but I also don't know what architectural differences 
> could affect this. Sorry.


Maybe caching somewhere? In other words, could there be any cache
flushing that we are missing here?

>>> We do daily test with NFS to mount rootfs, on issue found. And I add this
>> patch at the resume patch, and on error check, this should not break suspend.
>>> I even did the overnight stress test, there is no issue found.
>>>
>>> Could you please do more test to see where the issue happen?
>>
>> The issue occurs 100% of the time on the failing board and always on the 
>> first
>> resume from suspend. Is there any more debug I can enable to track down
>> what the problem is?
>>
> 
> As commit messages described, the patch aims to re-init rx buffers address, 
> since the address is not fixed, so I only can 
> recycle and then re-allocate all of them. The page pool is allocated once 
> when open the net device.
> 
> Could you please debug if it fails at some functions, such as 
> page_pool_dev_alloc_pages() ?


Yes that was the first thing I tried, but no obvious failures from
allocating the pools.

Are you certain that the problem you are seeing, that is being fixed by
this change, is generic to all devices? The commit message states that
'descriptor write back by DMA could exhibit unusual behavior', is this a
known issue in the STMMAC controller? If so does this impact all
versions and what is the actual problem?

Jon

-- 
nvpublic


Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-24 Thread Jon Hunter



On 24/03/2021 12:20, Joakim Zhang wrote:

...

> Sorry for this breakage at your side.
> 
> You mean one of your boards? Does other boards with STMMAC can work fine?

We have two devices with the STMMAC and one works OK and the other
fails. They are different generation of device and so there could be
some architectural differences which is causing this to only be seen on
one device.

> We do daily test with NFS to mount rootfs, on issue found. And I add this 
> patch at the resume patch, and on error check, this should not break suspend.
> I even did the overnight stress test, there is no issue found.
> 
> Could you please do more test to see where the issue happen?

The issue occurs 100% of the time on the failing board and always on the
first resume from suspend. Is there any more debug I can enable to track
down what the problem is?

Jon

-- 
nvpublic


Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

2021-03-24 Thread Jon Hunter
Hi Joakim,

Starting with v5.12-rc3 I noticed that one of our boards, Tegra186
Jetson TX2, was not long resuming from suspend. Bisect points to commit
9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back")
and reverting this on top of mainline fixes the problem.

Interestingly, the board appears to partially resume from suspend and I
see ethernet appear to resume ...

 dwc-eth-dwmac 249.ethernet eth0: configuring for phy/rgmii link
 mode
 dwmac4: Master AXI performs any burst length
 dwc-eth-dwmac 249.ethernet eth0: No Safety Features support found
 dwc-eth-dwmac 249.ethernet eth0: Link is Up - 1Gbps/Full - flow
 control rx/tx

I don't see any crash, but then it hangs at some point. Please note that
this board is using NFS for mounting the rootfs.

Let me know if there is any more info I can provide or tests I can run.

Thanks
Jon




Re: phy_attach_direct()'s use of device_bind_driver()

2021-02-11 Thread Jon Hunter


On 10/02/2021 22:56, Andrew Lunn wrote:
> On Wed, Feb 10, 2021 at 02:13:48PM -0800, Saravana Kannan wrote:
>> Hi,
>>
>> This email was triggered by this other email[1].
> 
> And it appears the Tegra194 Jetson Xavier uses the Marvell 88E1512
> PHY. So ensure the Marvell driver is available, and it should get
> probed in the usual way, the fallback driver will not be needed.


Yes that is correct. Enabling the Marvell PHY does fix this indeed and
so I can enable that as part of our testsuite. We were seeing the same
warning on Tegra186 Jetson TX2 and enabling the BRCM PHY resolves that
as well. I will ensure that these are enabled going forward.

Cheers
Jon

-- 
nvpublic


Re: [PATCH v6 12/16] net: tip: fix a couple kernel-doc markups

2021-01-14 Thread Jon Maloy




On 1/14/21 3:04 AM, Mauro Carvalho Chehab wrote:

A function has a different name between their prototype
and its kernel-doc markup:

../net/tipc/link.c:2551: warning: expecting prototype for 
link_reset_stats(). Prototype was for tipc_link_reset_stats() instead
../net/tipc/node.c:1678: warning: expecting prototype for is the 
general link level function for message sending(). Prototype was for 
tipc_node_xmit() instead

Signed-off-by: Mauro Carvalho Chehab 
---
  net/tipc/link.c | 2 +-
  net/tipc/node.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index a6a694b78927..115109259430 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -2544,7 +2544,7 @@ void tipc_link_set_queue_limits(struct tipc_link *l, u32 
min_win, u32 max_win)
  }
  
  /**

- * link_reset_stats - reset link statistics
+ * tipc_link_reset_stats - reset link statistics
   * @l: pointer to link
   */
  void tipc_link_reset_stats(struct tipc_link *l)
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 83d9eb830592..008670d1f43e 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1665,7 +1665,7 @@ static void tipc_lxc_xmit(struct net *peer_net, struct 
sk_buff_head *list)
  }
  
  /**

- * tipc_node_xmit() is the general link level function for message sending
+ * tipc_node_xmit() - general link level function for message sending
   * @net: the applicable net namespace
   * @list: chain of buffers containing message
   * @dnode: address of destination node

Acked-by: Jon Maloy 



Re: [PATCH RESEND net-next 1/2] dpaa2-eth: send a scatter-gather FD instead of realloc-ing

2020-12-12 Thread Jon Nettleton
3279ca #52
> > > > > [  714.465196] Hardware name: SolidRun LX2160A Honeycomb (DT)
> > > > > [  714.465202] pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > > > > [  714.465207] pc : __arm_lpae_map+0x2d4/0x30c
> > > > > [  714.465211] lr : __arm_lpae_map+0x114/0x30c
> > > > > [  714.465215] sp : 80001006b340
> > > > > [  714.465219] x29: 80001006b340 x28: 002086538003
> > > > > [  714.465227] x27: 0a20 x26: 1000
> > > > > [  714.465236] x25: 0f44 x24: 0020adf8d000
> > > > > [  714.465245] x23: 0001 x22: faeca000
> > > > > [  714.465253] x21: 0003 x20: 19b60d64d200
> > > > > [  714.465261] x19: 00ca x18: 
> > > > > [  714.465270] x17:  x16: cccb7cf3ca20
> > > > > [  714.465278] x15:  x14: 
> > > > > [  714.465286] x13: 0003 x12: 0010
> > > > > [  714.465294] x11:  x10: 0002
> > > > > [  714.465302] x9 : cccb7d5b6e78 x8 : 01ff
> > > > > [  714.465311] x7 : 19b606538650 x6 : 19b606538000
> > > > > [  714.465319] x5 : 0009 x4 : 0f44
> > > > > [  714.465327] x3 : 1000 x2 : 0020adf8d000
> > > > > [  714.465335] x1 : 0002 x0 : 0003
> > > > > [  714.465343] Call trace:
> > > > > [  714.465348]  __arm_lpae_map+0x2d4/0x30c
> > > > > [  714.465353]  __arm_lpae_map+0x114/0x30c
> > > > > [  714.465357]  __arm_lpae_map+0x114/0x30c
> > > > > [  714.465362]  __arm_lpae_map+0x114/0x30c
> > > > > [  714.465366]  arm_lpae_map+0xf4/0x180
> > > > > [  714.465373]  arm_smmu_map+0x4c/0xc0
> > > > > [  714.465379]  __iommu_map+0x100/0x2bc
> > > > > [  714.465385]  iommu_map_atomic+0x20/0x30
> > > > > [  714.465391]  __iommu_dma_map+0xb0/0x110
> > > > > [  714.465397]  iommu_dma_map_page+0xb8/0x120
> > > > > [  714.465404]  dma_map_page_attrs+0x1a8/0x210
> > > > > [  714.465413]  __dpaa2_eth_tx+0x384/0xbd0 [fsl_dpaa2_eth]
> > > > > [  714.465421]  dpaa2_eth_tx+0x84/0x134 [fsl_dpaa2_eth]
> > > > > [  714.465427]  dev_hard_start_xmit+0x10c/0x2b0
> > > > > [  714.465433]  sch_direct_xmit+0x1a0/0x550
> > > > > [  714.465438]  __qdisc_run+0x140/0x670
> > > > > [  714.465443]  __dev_queue_xmit+0x6c4/0xa74
> > > > > [  714.465449]  dev_queue_xmit+0x20/0x2c
> > > > > [  714.465463]  br_dev_queue_push_xmit+0xc4/0x1a0 [bridge]
> > > > > [  714.465476]  br_forward_finish+0xdc/0xf0 [bridge]
> > > > > [  714.465489]  __br_forward+0x160/0x1c0 [bridge]
> > > > > [  714.465502]  br_forward+0x13c/0x160 [bridge]
> > > > > [  714.465514]  br_dev_xmit+0x228/0x3b0 [bridge]
> > > > > [  714.465520]  dev_hard_start_xmit+0x10c/0x2b0
> > > > > [  714.465526]  __dev_queue_xmit+0x8f0/0xa74
> > > > > [  714.465531]  dev_queue_xmit+0x20/0x2c
> > > > > [  714.465538]  arp_xmit+0xc0/0xd0
> > > > > [  714.465544]  arp_send_dst+0x78/0xa0
> > > > > [  714.465550]  arp_solicit+0xf4/0x260
> > > > > [  714.465554]  neigh_probe+0x64/0xb0
> > > > > [  714.465560]  neigh_timer_handler+0x2f4/0x400
> > > > > [  714.465566]  call_timer_fn+0x3c/0x184
> > > > > [  714.465572]  __run_timers.part.0+0x2bc/0x370
> > > > > [  714.465578]  run_timer_softirq+0x48/0x80
> > > > > [  714.465583]  __do_softirq+0x120/0x36c
> > > > > [  714.465589]  irq_exit+0xac/0x100
> > > > > [  714.465596]  __handle_domain_irq+0x8c/0xf0
> > > > > [  714.465600]  gic_handle_irq+0xcc/0x14c
> > > > > [  714.465605]  el1_irq+0xc4/0x180
> > > > > [  714.465610]  arch_cpu_idle+0x18/0x30
> > > > > [  714.465617]  default_idle_call+0x4c/0x180
> > > > > [  714.465623]  do_idle+0x238/0x2b0
> > > > > [  714.465629]  cpu_startup_entry+0x30/0xa0
> > > > > [  714.465636]  secondary_start_kernel+0x134/0x180
> > > > > [  714.465640] ---[ end trace a84a7f61b559005f ]---
> > > > >
> > > > >
> > > > > Given it is the iommu code that is provoking the warning I should
> > > > > probably mention that the board I have requires
> > > > > arm-smmu.disable_bypass=0 on the kernel command line in order to boot.
> > > > > Also if it matters I am running the latest firmware from Solidrun
> > > > > which is based on LSDK-20.04.
> > > > >
> > > >
> > > > Hmmm, from what I remember I think I tested this with the smmu bypassed
> > > > so that is why I didn't catch it.
> > > >
> > > > > Is there any reason for this code not to be working for LX2160A?
> > > >
> > > > I wouldn't expect this to be LX2160A specific but rather a bug in the
> > > > implementation.. sorry.
> > > >
> > > > Let me reproduce it and see if I can get to the bottom of it and I will
> > > > get back with some more info.
> > > >
> > >
> > > Hi Daniel,
> > >
> > > It seems that the dma-unmapping on the SGT buffer was incorrectly done
> > > with a zero size since on the Tx path I initialized the improper field.
> > >
> > > Could you test the following diff and let me know if you can generate
> > > the WARNINGs anymore?
> >
> > I fired this up and, with your change, I've not been able to trigger
> > the warning with the tests that I used the drive my bisect.
> >
>
> Great, thanks for testing this.
>
> I will take care of sending the fix to -net.
>
> Ioana

Ioana,

Please CC me when you send the patch to -net, I will put my Tested-by: on it.

Thanks
Jon

>
> > Thanks for the quick response.
> >
> >
> > Daniel.
> >
> >
> > >
> > > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > > @@ -878,7 +878,7 @@ static int dpaa2_eth_build_sg_fd_single_buf(struct 
> > > dpaa2_eth_priv *priv,
> > > swa = (struct dpaa2_eth_swa *)sgt_buf;
> > > swa->type = DPAA2_ETH_SWA_SINGLE;
> > > swa->single.skb = skb;
> > > -   swa->sg.sgt_size = sgt_buf_size;
> > > +   swa->single.sgt_size = sgt_buf_size;
> > >
> > > /* Separately map the SGT buffer */
> > > sgt_addr = dma_map_single(dev, sgt_buf, sgt_buf_size, 
> > > DMA_BIDIRECTIONAL);
> > >
> > >
> > > Ioana


Re: [PATCH RESEND net-next 1/2] dpaa2-eth: send a scatter-gather FD instead of realloc-ing

2020-12-10 Thread Jon Nettleton
 714.465343] Call trace:
> > [  714.465348]  __arm_lpae_map+0x2d4/0x30c
> > [  714.465353]  __arm_lpae_map+0x114/0x30c
> > [  714.465357]  __arm_lpae_map+0x114/0x30c
> > [  714.465362]  __arm_lpae_map+0x114/0x30c
> > [  714.465366]  arm_lpae_map+0xf4/0x180
> > [  714.465373]  arm_smmu_map+0x4c/0xc0
> > [  714.465379]  __iommu_map+0x100/0x2bc
> > [  714.465385]  iommu_map_atomic+0x20/0x30
> > [  714.465391]  __iommu_dma_map+0xb0/0x110
> > [  714.465397]  iommu_dma_map_page+0xb8/0x120
> > [  714.465404]  dma_map_page_attrs+0x1a8/0x210
> > [  714.465413]  __dpaa2_eth_tx+0x384/0xbd0 [fsl_dpaa2_eth]
> > [  714.465421]  dpaa2_eth_tx+0x84/0x134 [fsl_dpaa2_eth]
> > [  714.465427]  dev_hard_start_xmit+0x10c/0x2b0
> > [  714.465433]  sch_direct_xmit+0x1a0/0x550
> > [  714.465438]  __qdisc_run+0x140/0x670
> > [  714.465443]  __dev_queue_xmit+0x6c4/0xa74
> > [  714.465449]  dev_queue_xmit+0x20/0x2c
> > [  714.465463]  br_dev_queue_push_xmit+0xc4/0x1a0 [bridge]
> > [  714.465476]  br_forward_finish+0xdc/0xf0 [bridge]
> > [  714.465489]  __br_forward+0x160/0x1c0 [bridge]
> > [  714.465502]  br_forward+0x13c/0x160 [bridge]
> > [  714.465514]  br_dev_xmit+0x228/0x3b0 [bridge]
> > [  714.465520]  dev_hard_start_xmit+0x10c/0x2b0
> > [  714.465526]  __dev_queue_xmit+0x8f0/0xa74
> > [  714.465531]  dev_queue_xmit+0x20/0x2c
> > [  714.465538]  arp_xmit+0xc0/0xd0
> > [  714.465544]  arp_send_dst+0x78/0xa0
> > [  714.465550]  arp_solicit+0xf4/0x260
> > [  714.465554]  neigh_probe+0x64/0xb0
> > [  714.465560]  neigh_timer_handler+0x2f4/0x400
> > [  714.465566]  call_timer_fn+0x3c/0x184
> > [  714.465572]  __run_timers.part.0+0x2bc/0x370
> > [  714.465578]  run_timer_softirq+0x48/0x80
> > [  714.465583]  __do_softirq+0x120/0x36c
> > [  714.465589]  irq_exit+0xac/0x100
> > [  714.465596]  __handle_domain_irq+0x8c/0xf0
> > [  714.465600]  gic_handle_irq+0xcc/0x14c
> > [  714.465605]  el1_irq+0xc4/0x180
> > [  714.465610]  arch_cpu_idle+0x18/0x30
> > [  714.465617]  default_idle_call+0x4c/0x180
> > [  714.465623]  do_idle+0x238/0x2b0
> > [  714.465629]  cpu_startup_entry+0x30/0xa0
> > [  714.465636]  secondary_start_kernel+0x134/0x180
> > [  714.465640] ---[ end trace a84a7f61b559005f ]---
> >
> >
> > Given it is the iommu code that is provoking the warning I should
> > probably mention that the board I have requires
> > arm-smmu.disable_bypass=0 on the kernel command line in order to boot.
> > Also if it matters I am running the latest firmware from Solidrun
> > which is based on LSDK-20.04.
> >
>
> Hmmm, from what I remember I think I tested this with the smmu bypassed
> so that is why I didn't catch it.
>
> > Is there any reason for this code not to be working for LX2160A?
>
> I wouldn't expect this to be LX2160A specific but rather a bug in the
> implementation.. sorry.
>
> Let me reproduce it and see if I can get to the bottom of it and I will
> get back with some more info.
>
> Ioana

Ioana,

I reported this issue to Calvin last week.  I can verify that reverting that
change also fixes the issue for me.

-Jon

>
> >
> > Daniel.
> >
> >
> > PS A few months have gone by so I decided not to trim the patch out
> >of this reply so you don't have to go digging!
> >
> >
> >
> > >  .../freescale/dpaa2/dpaa2-eth-debugfs.c   |   7 +-
> > >  .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  | 177 +++---
> > >  .../net/ethernet/freescale/dpaa2/dpaa2-eth.h  |   9 +-
> > >  .../ethernet/freescale/dpaa2/dpaa2-ethtool.c  |   1 -
> > >  4 files changed, 160 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth-debugfs.c 
> > > b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth-debugfs.c
> > > index 2880ca02d7e7..5cb357c74dec 100644
> > > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth-debugfs.c
> > > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth-debugfs.c
> > > @@ -19,14 +19,14 @@ static int dpaa2_dbg_cpu_show(struct seq_file *file, 
> > > void *offset)
> > > int i;
> > >
> > > seq_printf(file, "Per-CPU stats for %s\n", priv->net_dev->name);
> > > -   seq_printf(file, "%s%16s%16s%16s%16s%16s%16s%16s%16s%16s\n",
> > > +   seq_printf(file, "%s%16s%16s%16s%16s%16s%16s%16s%16s\n",
> > >"CPU", "Rx", "Rx Err", "Rx SG", "Tx", "Tx Err", "Tx conf",
> >

Re: [PATCH net-next 2/4] net: mvpp2: add mvpp2_phylink_to_port() helper

2020-12-10 Thread Jon Nettleton
On Thu, Dec 10, 2020 at 9:27 PM Andrew Lunn  wrote:
>
> > +1. As soon as the MDIO+ACPI lands, I plan to do the rework.
>
> Don't hold you breath. It has gone very quiet about ACPI in net
> devices.

NXP resources were re-allocated for their next internal BSP release.
I have been working with Calvin over the past week and a half and the new
patchset will be submitted early next week most likely.

-Jon

>
> Andrew


Re: [PATCH net-next 4/6] net: ipa: add support to code for IPA v4.5

2020-12-01 Thread Jon Hunter


On 25/11/2020 20:45, Alex Elder wrote:
> Update the IPA code to make use of the updated IPA v4.5 register
> definitions.  Generally what this patch does is, if IPA v4.5
> hardware is in use:
>   - Ensure new registers or fields in IPA v4.5 are updated where
> required
>   - Ensure registers or fields not supported in IPA v4.5 are not
> examined when read, or are set to 0 when written
> It does this while preserving the existing functionality for IPA
> versions lower than v4.5.
> 
> The values to program for QSB_MAX_READS and QSB_MAX_WRITES and the
> source and destination resource counts are updated to be correct for
> all versions through v4.5 as well.
> 
> Note that IPA_RESOURCE_GROUP_SRC_MAX and IPA_RESOURCE_GROUP_DST_MAX
> already reflect that 5 is an acceptable number of resources (which
> IPA v4.5 implements).
> 
> Signed-off-by: Alex Elder 


This change is generating the following build error on ARM64 ...

In file included from drivers/net/ipa/ipa_main.c:9:0:
In function ‘u32_encode_bits’,
inlined from ‘ipa_hardware_config_qsb.isra.7’ at 
drivers/net/ipa/ipa_main.c:286:6,
inlined from ‘ipa_hardware_config’ at drivers/net/ipa/ipa_main.c:363:2,
inlined from ‘ipa_config.isra.12’ at drivers/net/ipa/ipa_main.c:555:2,
inlined from ‘ipa_probe’ at drivers/net/ipa/ipa_main.c:835:6:
./include/linux/bitfield.h:131:3: error: call to ‘__field_overflow’ declared 
with attribute error: value doesn't fit into mask
   __field_overflow(); \
   ^~
./include/linux/bitfield.h:151:2: note: in expansion of macro ‘MAKE_OP’
  MAKE_OP(u##size,u##size,,)
  ^~~
./include/linux/bitfield.h:154:1: note: in expansion of macro ‘__MAKE_OP’
 __MAKE_OP(32)
 ^

Cheers
Jon

-- 
nvpublic


Re: [net] tipc: fix NULL pointer dereference in tipc_named_rcv

2020-10-09 Thread Jon Maloy




On 10/9/20 12:12 AM, Hoang Huu Le wrote:

Hi Jon,  Jakub,

I tried with your comment. But looks like we got into circular locking and 
deadlock could happen like this:
 CPU0CPU1
 
lock(&n->lock#2);
 lock(&tn->nametbl_lock);
 lock(&n->lock#2);
lock(&tn->nametbl_lock);

   *** DEADLOCK ***

Regards,
Hoang

Ok. So although your solution is not optimal, we know it is safe.
Again:
Acked-by: Jon Maloy 

-Original Message-
From: Jon Maloy 
Sent: Friday, October 9, 2020 1:01 AM
To: Jakub Kicinski ; Hoang Huu Le 
Cc: ma...@donjonn.com; ying@windriver.com; 
tipc-discuss...@lists.sourceforge.net; netdev@vger.kernel.org
Subject: Re: [net] tipc: fix NULL pointer dereference in tipc_named_rcv



On 10/8/20 1:25 PM, Jakub Kicinski wrote:

On Thu,  8 Oct 2020 14:31:56 +0700 Hoang Huu Le wrote:

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 2f9c148f17e2..fe4edce459ad 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -327,8 +327,13 @@ static struct sk_buff *tipc_named_dequeue(struct 
sk_buff_head *namedq,
struct tipc_msg *hdr;
u16 seqno;

+   spin_lock_bh(&namedq->lock);
skb_queue_walk_safe(namedq, skb, tmp) {
-   skb_linearize(skb);
+   if (unlikely(skb_linearize(skb))) {
+   __skb_unlink(skb, namedq);
+   kfree_skb(skb);
+   continue;
+   }
hdr = buf_msg(skb);
seqno = msg_named_seqno(hdr);
if (msg_is_last_bulk(hdr)) {
@@ -338,12 +343,14 @@ static struct sk_buff *tipc_named_dequeue(struct 
sk_buff_head *namedq,

if (msg_is_bulk(hdr) || msg_is_legacy(hdr)) {
__skb_unlink(skb, namedq);
+   spin_unlock_bh(&namedq->lock);
return skb;
}

if (*open && (*rcv_nxt == seqno)) {
(*rcv_nxt)++;
__skb_unlink(skb, namedq);
+   spin_unlock_bh(&namedq->lock);
return skb;
}

@@ -353,6 +360,7 @@ static struct sk_buff *tipc_named_dequeue(struct 
sk_buff_head *namedq,
continue;
}
}
+   spin_unlock_bh(&namedq->lock);
return NULL;
   }

diff --git a/net/tipc/node.c b/net/tipc/node.c
index cf4b239fc569..d269ebe382e1 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1496,7 +1496,7 @@ static void node_lost_contact(struct tipc_node *n,

/* Clean up broadcast state */
tipc_bcast_remove_peer(n->net, n->bc_entry.link);
-   __skb_queue_purge(&n->bc_entry.namedq);
+   skb_queue_purge(&n->bc_entry.namedq);

Patch looks fine, but I'm not sure why not hold
spin_unlock_bh(&tn->nametbl_lock) here instead?

Seems like node_lost_contact() should be relatively rare,
so adding another lock to tipc_named_dequeue() is not the
right trade off.

Actually, I agree with previous speaker here. We already have the
nametbl_lock when tipc_named_dequeue() is called, and the same lock is
accessible from no.c where node_lost_contact() is executed. The patch
and the code becomes simpler.
I suggest you post a v2 of this one.

///jon


/* Abort any ongoing link failover */
for (i = 0; i < MAX_BEARERS; i++) {




Re: [net] tipc: fix NULL pointer dereference in tipc_named_rcv

2020-10-08 Thread Jon Maloy




On 10/8/20 1:25 PM, Jakub Kicinski wrote:

On Thu,  8 Oct 2020 14:31:56 +0700 Hoang Huu Le wrote:

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 2f9c148f17e2..fe4edce459ad 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -327,8 +327,13 @@ static struct sk_buff *tipc_named_dequeue(struct 
sk_buff_head *namedq,
struct tipc_msg *hdr;
u16 seqno;
  
+	spin_lock_bh(&namedq->lock);

skb_queue_walk_safe(namedq, skb, tmp) {
-   skb_linearize(skb);
+   if (unlikely(skb_linearize(skb))) {
+   __skb_unlink(skb, namedq);
+   kfree_skb(skb);
+   continue;
+   }
hdr = buf_msg(skb);
seqno = msg_named_seqno(hdr);
if (msg_is_last_bulk(hdr)) {
@@ -338,12 +343,14 @@ static struct sk_buff *tipc_named_dequeue(struct 
sk_buff_head *namedq,
  
  		if (msg_is_bulk(hdr) || msg_is_legacy(hdr)) {

__skb_unlink(skb, namedq);
+   spin_unlock_bh(&namedq->lock);
return skb;
}
  
  		if (*open && (*rcv_nxt == seqno)) {

(*rcv_nxt)++;
__skb_unlink(skb, namedq);
+   spin_unlock_bh(&namedq->lock);
return skb;
}
  
@@ -353,6 +360,7 @@ static struct sk_buff *tipc_named_dequeue(struct sk_buff_head *namedq,

continue;
}
}
+   spin_unlock_bh(&namedq->lock);
return NULL;
  }
  
diff --git a/net/tipc/node.c b/net/tipc/node.c

index cf4b239fc569..d269ebe382e1 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1496,7 +1496,7 @@ static void node_lost_contact(struct tipc_node *n,
  
  	/* Clean up broadcast state */

tipc_bcast_remove_peer(n->net, n->bc_entry.link);
-   __skb_queue_purge(&n->bc_entry.namedq);
+   skb_queue_purge(&n->bc_entry.namedq);

Patch looks fine, but I'm not sure why not hold
spin_unlock_bh(&tn->nametbl_lock) here instead?

Seems like node_lost_contact() should be relatively rare,
so adding another lock to tipc_named_dequeue() is not the
right trade off.
Actually, I agree with previous speaker here. We already have the 
nametbl_lock when tipc_named_dequeue() is called, and the same lock is 
accessible from no.c where node_lost_contact() is executed. The patch 
and the code becomes simpler.

I suggest you post a v2 of this one.

///jon


/* Abort any ongoing link failover */
for (i = 0; i < MAX_BEARERS; i++) {




Re: [net-next PATCH v7 1/6] Documentation: ACPI: DSD: Document MDIO PHY

2020-07-29 Thread Jon Nettleton
regarding sorting this out.  This
is where I see it from an Armada and Layerscape perspective.  This
isn't a silver bullet fix but the small things I think that need to be
done to move this forward.

>From Microsoft's documentation.

Device dependencies

Typically, there are hardware dependencies between devices on a
particular platform. Windows requires that all such dependencies be
described so that it can ensure that all devices function correctly as
things change dynamically in the system (device power is removed,
drivers are stopped and started, and so on). In ACPI, dependencies
between devices are described in the following ways:

1) Namespace hierarchy. Any device that is a child device (listed as a
device within the namespace of another device) is dependent on the
parent device. For example, a USB HSIC device is dependent on the port
(parent) and controller (grandparent) it is connected to. Similarly, a
GPU device listed within the namespace of a system memory-management
unit (MMU) device is dependent on the MMU device.

2) Resource connections. Devices connected to GPIO or SPB controllers
are dependent on those controllers. This type of dependency is
described by the inclusion of Connection Resources in the device's
_CRS.

3) OpRegion dependencies. For ASL control methods that use OpRegions
to perform I/O, dependencies are not implicitly known by the operating
system because they are only determined during control method
evaluation. This issue is particularly applicable to GeneralPurposeIO
and GenericSerialBus OpRegions in which Plug and Play drivers provide
access to the region. To mitigate this issue, ACPI defines the
OpRegion Dependency (_DEP) object. _DEP should be used in any device
namespace in which an OpRegion (HW resource) is referenced by a
control method, and neither 1 nor 2 above already applies for the
referenced OpRegion's connection resource. For more information, see
section 6.5.8, "_DEP (Operation Region Dependencies)", of the ACPI 5.0
specification.

We can forget about 3 because even though _DEP would solve many of our
problems, and Intel has kind of used it for some of their
architectures, according to the ACPI spec it should not be used this
way.

1) can be achievable on some platforms like the LX2160a.  We have the
mcbin firmware which is the bus (the ACPI spec does allow you to
define a platform defined bus), which has MACs as the children, which
then can have phy's or SFP modules as their children.  This works okay
for enumeration and parenting but how do they talk?

This is where 2) comes into play.  The big problem is that MDIO isn't
designated as a SPB
(https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/simple-peripheral-bus--spb-)
We have GPIO, I2C, SPI, UART, MIPI and a couple of others.  While not
a silver bullet I think getting MDIO added to the spec would be the
next step forward to being able to implement this in Linux with
phylink / phylib in a sane manner.  Currently SFP definitions are
using the SPB to designate the various GPIO and I2C interfaces that
are needed to probe devices and handle interrupts.

The other alternatives is the ACPI maintainers agree on the _DSD
method (would be quickest and should be easy to migrate to SBP if MDIO
were adopter), or nothing is done at all (which I know seems to be a
popular opinion).

-Jon


Re: [net-next PATCH v7 1/6] Documentation: ACPI: DSD: Document MDIO PHY

2020-07-27 Thread Jon Nettleton
On Fri, Jul 24, 2020 at 9:14 PM Andrew Lunn  wrote:
>
> > Hence my previous comment that we should consider this an escape
> > hatch rather than the last word in how to describe networking on
> > ACPI/SBSA platforms.
>
> One problem i have is that this patch set suggests ACPI can be used to
> describe complex network hardware. It is opening the door for others
> to follow and add more ACPI support in networking. How long before it
> is not considered an escape hatch, but the front door?
>
> For an example, see
>
> https://patchwork.ozlabs.org/project/netdev/patch/1595417547-18957-3-git-send-email-vikas.si...@puresoftware.com/
>
> It is hard to see what the big picture is here. The [0/2] patch is not
> particularly good. But it makes it clear that people are wanting to
> add fixed-link PHYs into ACPI. These are pseudo devices, used to make
> the MAC think it is connected to a PHY when it is not. The MAC still
> gets informed of link speed, etc via the standard PHYLIB API. They are
> mostly used for when the Ethernet MAC is directly connected to an
> Ethernet Switch, at a MAC to MAC level.
>
> Now i could be wrong, but are Ethernet switches something you expect
> to see on ACPI/SBSA platforms? Or is this a legitimate use of the
> escape hatch?

I think with the rise in adoption of Smart-NICs in datacenters there
will definitely be a lot more crossover between ACPI/SBSA and network
appliance oriented hardware.

-Jon


Re: [PATCH net-next 18/20] net: tipc: kerneldoc fixes

2020-07-13 Thread Jon Maloy




On 7/12/20 7:15 PM, Andrew Lunn wrote:

Simple fixes which require no deep knowledge of the code.

Cc: Jon Maloy 
Cc: Ying Xue 
Signed-off-by: Andrew Lunn 
---
  net/tipc/bearer.c| 2 +-
  net/tipc/discover.c  | 5 ++---
  net/tipc/link.c  | 6 +++---
  net/tipc/msg.c   | 2 +-
  net/tipc/node.c  | 4 ++--
  net/tipc/socket.c| 8 +++-
  net/tipc/udp_media.c | 2 +-
  7 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index e366ec9a7e4d..808b147df7d5 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -595,7 +595,7 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
  
  /**

   * tipc_l2_rcv_msg - handle incoming TIPC message from an interface
- * @buf: the received packet
+ * @skb: the received message
   * @dev: the net device that the packet was received on
   * @pt: the packet_type structure which was used to register this handler
   * @orig_dev: the original receive net device in case the device is a bond
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index bfe43da127c0..d4ecacddb40c 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -74,7 +74,7 @@ struct tipc_discoverer {
  /**
   * tipc_disc_init_msg - initialize a link setup message
   * @net: the applicable net namespace
- * @type: message type (request or response)
+ * @mtyp: message type (request or response)
   * @b: ptr to bearer issuing message
   */
  static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb,
@@ -339,7 +339,7 @@ static void tipc_disc_timeout(struct timer_list *t)
   * @net: the applicable net namespace
   * @b: ptr to bearer issuing requests
   * @dest: destination address for request messages
- * @dest_domain: network domain to which links can be established
+ * @skb: pointer to created frame
   *
   * Returns 0 if successful, otherwise -errno.
   */
@@ -393,7 +393,6 @@ void tipc_disc_delete(struct tipc_discoverer *d)
   * tipc_disc_reset - reset object to send periodic link setup requests
   * @net: the applicable net namespace
   * @b: ptr to bearer issuing requests
- * @dest_domain: network domain to which links can be established
   */
  void tipc_disc_reset(struct net *net, struct tipc_bearer *b)
  {
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 1c579357ccdf..6aca0ebb391a 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -445,7 +445,7 @@ u32 tipc_link_state(struct tipc_link *l)
  
  /**

   * tipc_link_create - create a new link
- * @n: pointer to associated node
+ * @net: pointer to associated network namespace
   * @if_name: associated interface name
   * @bearer_id: id (index) of associated bearer
   * @tolerance: link tolerance to be used by link
@@ -530,7 +530,7 @@ bool tipc_link_create(struct net *net, char *if_name, int 
bearer_id,
  
  /**

   * tipc_link_bc_create - create new link to be used for broadcast
- * @n: pointer to associated node
+ * @net: pointer to associated network namespace
   * @mtu: mtu to be used initially if no peers
   * @window: send window to be used
   * @inputq: queue to put messages ready for delivery
@@ -974,7 +974,7 @@ void tipc_link_reset(struct tipc_link *l)
  
  /**

   * tipc_link_xmit(): enqueue buffer list according to queue situation
- * @link: link to use
+ * @l: link to use
   * @list: chain of buffers containing message
   * @xmitq: returned list of packets to be sent by caller
   *
diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 01b64869a173..848fae674532 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -202,7 +202,7 @@ int tipc_buf_append(struct sk_buff **headbuf, struct 
sk_buff **buf)
  
  /**

   * tipc_msg_append(): Append data to tail of an existing buffer queue
- * @hdr: header to be used
+ * @_hdr: header to be used
   * @m: the data to be appended
   * @mss: max allowable size of buffer
   * @dlen: size of data to be appended
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 030a51c4d1fa..4edcee3088da 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1515,7 +1515,7 @@ static void node_lost_contact(struct tipc_node *n,
   * tipc_node_get_linkname - get the name of a link
   *
   * @bearer_id: id of the bearer
- * @node: peer node address
+ * @addr: peer node address
   * @linkname: link name output buffer
   *
   * Returns 0 on success
@@ -2022,7 +2022,7 @@ static bool tipc_node_check_state(struct tipc_node *n, 
struct sk_buff *skb,
   * tipc_rcv - process TIPC packets/messages arriving from off-node
   * @net: the applicable net namespace
   * @skb: TIPC packet
- * @bearer: pointer to bearer message arrived on
+ * @b: pointer to bearer message arrived on
   *
   * Invoked with no locks held. Bearer pointer must point to a valid bearer
   * structure (i.e. cannot be NULL), but bearer can be inactive.
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index a94f38333698..fc388cef6471 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -711,7 +711,6 @@ static int tipc_bind(struct socket *sock, struct sockaddr 
*uaddr

Re: [iproute2-next] tipc: fixed a compile warning in tipc/link.c

2020-07-09 Thread Jon Maloy




On 7/9/20 12:25 AM, Hoang Huu Le wrote:

Fixes: 5027f233e35b ("tipc: add link broadcast get")
Signed-off-by: Hoang Huu Le 
---
  tipc/link.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tipc/link.c b/tipc/link.c
index ba77a20152ea..192736eaa154 100644
--- a/tipc/link.c
+++ b/tipc/link.c
@@ -217,7 +217,7 @@ static int cmd_link_get_bcast_cb(const struct nlmsghdr 
*nlh, void *data)
print_string(PRINT_ANY, "method", "%s", "AUTOSELECT");
close_json_object();
open_json_object(NULL);
-   print_uint(PRINT_ANY, "ratio", " ratio:%u%\n",
+   print_uint(PRINT_ANY, "ratio", " ratio:%u\n",
   mnl_attr_get_u32(props[prop_ratio]));
break;
default:

Acked-by: Jon Maloy 



Re: [net-next] tipc: update a binding service via broadcast

2020-06-07 Thread Jon Maloy




On 6/7/20 12:24 AM, Hoang Huu Le wrote:

Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.

It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.

Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.

Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.

Add a line feed between these paragraphs before you send the patch.
Otherwise, still acked by me.

///join


2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.

Signed-off-by: Hoang Huu Le 
Acked-by: Jon Maloy 
---
  net/tipc/bcast.c  |   6 +--
  net/tipc/bcast.h  |   4 +-
  net/tipc/link.c   |   2 +-
  net/tipc/msg.h|  40 
  net/tipc/name_distr.c | 109 +++---
  net/tipc/name_distr.h |   9 ++--
  net/tipc/name_table.c |   9 +++-
  net/tipc/name_table.h |   2 +
  net/tipc/node.c   |  29 ---
  net/tipc/node.h   |   8 ++--
  10 files changed, 170 insertions(+), 48 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 383f87bc1061..940d176e0e87 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -250,8 +250,8 @@ static void tipc_bcast_select_xmit_method(struct net *net, 
int dests,
   * Consumes the buffer chain.
   * Returns 0 if success, otherwise errno: -EHOSTUNREACH,-EMSGSIZE
   */
-static int tipc_bcast_xmit(struct net *net, struct sk_buff_head *pkts,
-  u16 *cong_link_cnt)
+int tipc_bcast_xmit(struct net *net, struct sk_buff_head *pkts,
+   u16 *cong_link_cnt)
  {
struct tipc_link *l = tipc_bc_sndlink(net);
struct sk_buff_head xmitq;
@@ -752,7 +752,7 @@ void tipc_nlist_purge(struct tipc_nlist *nl)
nl->local = false;
  }
  
-u32 tipc_bcast_get_broadcast_mode(struct net *net)

+u32 tipc_bcast_get_mode(struct net *net)
  {
struct tipc_bc_base *bb = tipc_bc_base(net);
  
diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h

index 4240c95188b1..2d9352dc7b0e 100644
--- a/net/tipc/bcast.h
+++ b/net/tipc/bcast.h
@@ -90,6 +90,8 @@ void tipc_bcast_toggle_rcast(struct net *net, bool supp);
  int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts,
struct tipc_mc_method *method, struct tipc_nlist *dests,
u16 *cong_link_cnt);
+int tipc_bcast_xmit(struct net *net, struct sk_buff_head *pkts,
+   u16 *cong_link_cnt);
  int tipc_bcast_rcv(struct net *net, struct tipc_link *l, struct sk_buff *skb);
  void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
struct tipc_msg *hdr);
@@ -101,7 +103,7 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg 
*msg,
  int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]);
  int tipc_bclink_reset_stats(struct net *net, struct tipc_link *l);
  
-u32 tipc_bcast_get_broadcast_mode(struct net *net);

+u32 tipc_bcast_get_mode(struct net *net);
  u32 tipc_b

Re: [PATCH net] tipc: block BH before using dst_cache

2020-05-22 Thread Jon Maloy




On 5/22/20 4:10 PM, Eric Dumazet wrote:


On 5/22/20 12:47 PM, Jon Maloy wrote:


On 5/22/20 11:57 AM, Eric Dumazet wrote:

On 5/22/20 8:01 AM, Jon Maloy wrote:

On 5/22/20 2:18 AM, Xin Long wrote:

On Fri, May 22, 2020 at 1:55 PM Eric Dumazet  wrote:

Resend to the list in non HTML form


On Thu, May 21, 2020 at 10:53 PM Eric Dumazet  wrote:

On Thu, May 21, 2020 at 10:50 PM Xin Long  wrote:

On Fri, May 22, 2020 at 2:30 AM Eric Dumazet  wrote:

dst_cache_get() documents it must be used with BH disabled.

Interesting, I thought under rcu_read_lock() is enough, which calls
preempt_disable().

rcu_read_lock() does not disable BH, never.

And rcu_read_lock() does not necessarily disable preemption.

Then I need to think again if it's really worth using dst_cache here.

Also add tipc-discussion and Jon to CC list.

The suggested solution will affect all bearers, not only UDP, so it is not a 
good.
Is there anything preventing us from disabling preemtion inside the scope of 
the rcu lock?

///jon


BH is disabled any way few nano seconds later, disabling it a bit earlier wont 
make any difference.

The point is that if we only disable inside tipc_udp_xmit() (the function 
pointer call) the change will only affect the UDP bearer, where dst_cache is 
used.
The corresponding calls for the Ethernet and Infiniband bearers don't use 
dst_cache, and don't need this disabling. So it does makes a difference.


I honestly do not understand your concern, this makes no sense to me.

I have disabled BH _right_ before the dst_cache_get(cache) call, so has no 
effect if the dst_cache is not used, this should be obvious.
Forget my comment. I thought we were discussing to Tetsuo Handa's 
original patch, and missed that you had posted your own.

I have no problems with this one.

///jon



If some other paths do not use dst)cache, how can my patch have any effect on 
them ?

What alternative are you suggesting ?





Re: [PATCH net] tipc: block BH before using dst_cache

2020-05-22 Thread Jon Maloy




On 5/22/20 11:57 AM, Eric Dumazet wrote:


On 5/22/20 8:01 AM, Jon Maloy wrote:


On 5/22/20 2:18 AM, Xin Long wrote:

On Fri, May 22, 2020 at 1:55 PM Eric Dumazet  wrote:

Resend to the list in non HTML form


On Thu, May 21, 2020 at 10:53 PM Eric Dumazet  wrote:


On Thu, May 21, 2020 at 10:50 PM Xin Long  wrote:

On Fri, May 22, 2020 at 2:30 AM Eric Dumazet  wrote:

dst_cache_get() documents it must be used with BH disabled.

Interesting, I thought under rcu_read_lock() is enough, which calls
preempt_disable().

rcu_read_lock() does not disable BH, never.

And rcu_read_lock() does not necessarily disable preemption.

Then I need to think again if it's really worth using dst_cache here.

Also add tipc-discussion and Jon to CC list.

The suggested solution will affect all bearers, not only UDP, so it is not a 
good.
Is there anything preventing us from disabling preemtion inside the scope of 
the rcu lock?

///jon


BH is disabled any way few nano seconds later, disabling it a bit earlier wont 
make any difference.
The point is that if we only disable inside tipc_udp_xmit() (the 
function pointer call) the change will only affect the UDP bearer, where 
dst_cache is used.
The corresponding calls for the Ethernet and Infiniband bearers don't 
use dst_cache, and don't need this disabling. So it does makes a 
difference.

///jon



Also, if you intend to make dst_cache BH reentrant, you will have to make that 
for net-next, not net tree.

Please carefully read include/net/dst_cache.h

It is very clear about BH requirements.






Re: [PATCH net] tipc: block BH before using dst_cache

2020-05-22 Thread Jon Maloy




On 5/22/20 2:18 AM, Xin Long wrote:

On Fri, May 22, 2020 at 1:55 PM Eric Dumazet  wrote:

Resend to the list in non HTML form


On Thu, May 21, 2020 at 10:53 PM Eric Dumazet  wrote:



On Thu, May 21, 2020 at 10:50 PM Xin Long  wrote:

On Fri, May 22, 2020 at 2:30 AM Eric Dumazet  wrote:

dst_cache_get() documents it must be used with BH disabled.

Interesting, I thought under rcu_read_lock() is enough, which calls
preempt_disable().


rcu_read_lock() does not disable BH, never.

And rcu_read_lock() does not necessarily disable preemption.

Then I need to think again if it's really worth using dst_cache here.

Also add tipc-discussion and Jon to CC list.
The suggested solution will affect all bearers, not only UDP, so it is 
not a good.
Is there anything preventing us from disabling preemtion inside the 
scope of the rcu lock?


///jon



Thanks.




have you checked other places where dst_cache_get() are used?



Yes, other paths are fine.




sysbot reported :

BUG: using smp_processor_id() in preemptible [] code: /21697
caller is dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68
CPU: 0 PID: 21697 Comm:  Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x188/0x20d lib/dump_stack.c:118
  check_preemption_disabled lib/smp_processor_id.c:47 [inline]
  debug_smp_processor_id.cold+0x88/0x9b lib/smp_processor_id.c:57
  dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68
  tipc_udp_xmit.isra.0+0xb9/0xad0 net/tipc/udp_media.c:164
  tipc_udp_send_msg+0x3e6/0x490 net/tipc/udp_media.c:244
  tipc_bearer_xmit_skb+0x1de/0x3f0 net/tipc/bearer.c:526
  tipc_enable_bearer+0xb2f/0xd60 net/tipc/bearer.c:331
  __tipc_nl_bearer_enable+0x2bf/0x390 net/tipc/bearer.c:995
  tipc_nl_bearer_enable+0x1e/0x30 net/tipc/bearer.c:1003
  genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline]
  genl_family_rcv_msg net/netlink/genetlink.c:718 [inline]
  genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735
  netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
  genl_rcv+0x24/0x40 net/netlink/genetlink.c:746
  netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
  netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
  netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
  sock_sendmsg_nosec net/socket.c:652 [inline]
  sock_sendmsg+0xcf/0x120 net/socket.c:672
  sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362
  ___sys_sendmsg+0x100/0x170 net/socket.c:2416
  __sys_sendmsg+0xec/0x1b0 net/socket.c:2449
  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45ca29

Fixes: e9c1a793210f ("tipc: add dst_cache support for udp media")
Cc: Xin Long 
Cc: Jon Maloy 
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
  net/tipc/udp_media.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 
d6620ad535461a4d04ed5ba90569ce8b7df9f994..28a283f26a8dff24d613e6ed57e5e69d894dae66
 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -161,9 +161,11 @@ static int tipc_udp_xmit(struct net *net, struct sk_buff 
*skb,
  struct udp_bearer *ub, struct udp_media_addr *src,
  struct udp_media_addr *dst, struct dst_cache *cache)
  {
-   struct dst_entry *ndst = dst_cache_get(cache);
+   struct dst_entry *ndst;
 int ttl, err = 0;

+   local_bh_disable();
+   ndst = dst_cache_get(cache);
 if (dst->proto == htons(ETH_P_IP)) {
 struct rtable *rt = (struct rtable *)ndst;

@@ -210,9 +212,11 @@ static int tipc_udp_xmit(struct net *net, struct sk_buff 
*skb,
src->port, dst->port, false);
  #endif
 }
+   local_bh_enable();
 return err;

  tx_error:
+   local_bh_enable();
 kfree_skb(skb);
 return err;
  }
--
2.27.0.rc0.183.gde8f92d652-goog





RE: [patch net-next] net: tipc: prepare attrs in __tipc_nl_compat_dumpit()

2019-10-08 Thread Jon Maloy
Acked. Thanks Jiri.

///jon


> -Original Message-
> From: Jiri Pirko 
> Sent: 8-Oct-19 07:02
> To: netdev@vger.kernel.org
> Cc: da...@davemloft.net; Jon Maloy ;
> ying@windriver.com; johannes.b...@intel.com; mkube...@suse.cz;
> ml...@mellanox.com
> Subject: [patch net-next] net: tipc: prepare attrs in
> __tipc_nl_compat_dumpit()
> 
> From: Jiri Pirko 
> 
> __tipc_nl_compat_dumpit() calls tipc_nl_publ_dump() which expects the
> attrs to be available by genl_dumpit_info(cb)->attrs. Add info struct and attr
> parsing in compat dumpit function.
> 
> Reported-by: syzbot+8d37c50ffb0f52941...@syzkaller.appspotmail.com
> Fixes: 057af7071344 ("net: tipc: have genetlink code to parse the attrs during
> dumpit")
> 
> Signed-off-by: Jiri Pirko 
> ---
>  net/tipc/netlink_compat.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c index
> 4950b754dacd..17a529739f8d 100644
> --- a/net/tipc/netlink_compat.c
> +++ b/net/tipc/netlink_compat.c
> @@ -181,6 +181,7 @@ static int __tipc_nl_compat_dumpit(struct
> tipc_nl_compat_cmd_dump *cmd,
>  struct tipc_nl_compat_msg *msg,
>  struct sk_buff *arg)
>  {
> + struct genl_dumpit_info info;
>   int len = 0;
>   int err;
>   struct sk_buff *buf;
> @@ -191,6 +192,7 @@ static int __tipc_nl_compat_dumpit(struct
> tipc_nl_compat_cmd_dump *cmd,
>   memset(&cb, 0, sizeof(cb));
>   cb.nlh = (struct nlmsghdr *)arg->data;
>   cb.skb = arg;
> + cb.data = &info;
> 
>   buf = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>   if (!buf)
> @@ -209,6 +211,13 @@ static int __tipc_nl_compat_dumpit(struct
> tipc_nl_compat_cmd_dump *cmd,
>   goto err_out;
>   }
> 
> + info.attrs = attrbuf;
> + err = nlmsg_parse_deprecated(cb.nlh, GENL_HDRLEN, attrbuf,
> +  tipc_genl_family.maxattr,
> +  tipc_genl_family.policy, NULL);
> + if (err)
> + goto err_out;
> +
>   do {
>   int rem;
> 
> --
> 2.21.0



Official request

2019-08-24 Thread Jon




[net-next v2 1/1] tipc: clean up skb list lock handling on send path

2019-08-15 Thread Jon Maloy
The policy for handling the skb list locks on the send and receive paths
is simple.

- On the send path we never need to grab the lock on the 'xmitq' list
  when the destination is an exernal node.

- On the receive path we always need to grab the lock on the 'inputq'
  list, irrespective of source node.

However, when transmitting node local messages those will eventually
end up on the receive path of a local socket, meaning that the argument
'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in  the
function tipc_sk_rcv(). This has been handled by always initializing
the spinlock of the 'xmitq' list at message creation, just in case it
may end up on the receive path later, and despite knowing that the lock
in most cases never will be used.

This approach is inaccurate and confusing, and has also concealed the
fact that the stated 'no lock grabbing' policy for the send path is
violated in some cases.

We now clean up this by never initializing the lock at message creation,
instead doing this at the moment we find that the message actually will
enter the receive path. At the same time we fix the four locations
where we incorrectly access the spinlock on the send/error path.

This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
is initialised") which has now become redundant.

CC: Eric Dumazet 
Reported-by: Chris Packham 
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 

---
v2: removed more unnecessary lock initializations after feedback
from Xin Long.
---
 net/tipc/bcast.c  | 10 +-
 net/tipc/group.c  |  4 ++--
 net/tipc/link.c   | 14 +++---
 net/tipc/name_distr.c |  2 +-
 net/tipc/node.c   |  7 ---
 net/tipc/socket.c | 14 +++---
 6 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 34f3e56..6ef1abd 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -185,7 +185,7 @@ static void tipc_bcbase_xmit(struct net *net, struct 
sk_buff_head *xmitq)
}
 
/* We have to transmit across all bearers */
-   skb_queue_head_init(&_xmitq);
+   __skb_queue_head_init(&_xmitq);
for (bearer_id = 0; bearer_id < MAX_BEARERS; bearer_id++) {
if (!bb->dests[bearer_id])
continue;
@@ -256,7 +256,7 @@ static int tipc_bcast_xmit(struct net *net, struct 
sk_buff_head *pkts,
struct sk_buff_head xmitq;
int rc = 0;
 
-   skb_queue_head_init(&xmitq);
+   __skb_queue_head_init(&xmitq);
tipc_bcast_lock(net);
if (tipc_link_bc_peers(l))
rc = tipc_link_xmit(l, pkts, &xmitq);
@@ -286,7 +286,7 @@ static int tipc_rcast_xmit(struct net *net, struct 
sk_buff_head *pkts,
u32 dnode, selector;
 
selector = msg_link_selector(buf_msg(skb_peek(pkts)));
-   skb_queue_head_init(&_pkts);
+   __skb_queue_head_init(&_pkts);
 
list_for_each_entry_safe(dst, tmp, &dests->list, list) {
dnode = dst->node;
@@ -344,7 +344,7 @@ static int tipc_mcast_send_sync(struct net *net, struct 
sk_buff *skb,
msg_set_size(_hdr, MCAST_H_SIZE);
msg_set_is_rcast(_hdr, !msg_is_rcast(hdr));
 
-   skb_queue_head_init(&tmpq);
+   __skb_queue_head_init(&tmpq);
__skb_queue_tail(&tmpq, _skb);
if (method->rcast)
tipc_bcast_xmit(net, &tmpq, cong_link_cnt);
@@ -378,7 +378,7 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head 
*pkts,
int rc = 0;
 
skb_queue_head_init(&inputq);
-   skb_queue_head_init(&localq);
+   __skb_queue_head_init(&localq);
 
/* Clone packets before they are consumed by next call */
if (dests->local && !tipc_msg_reassemble(pkts, &localq)) {
diff --git a/net/tipc/group.c b/net/tipc/group.c
index 5f98d38..89257e2 100644
--- a/net/tipc/group.c
+++ b/net/tipc/group.c
@@ -199,7 +199,7 @@ void tipc_group_join(struct net *net, struct tipc_group 
*grp, int *sk_rcvbuf)
struct tipc_member *m, *tmp;
struct sk_buff_head xmitq;
 
-   skb_queue_head_init(&xmitq);
+   __skb_queue_head_init(&xmitq);
rbtree_postorder_for_each_entry_safe(m, tmp, tree, tree_node) {
tipc_group_proto_xmit(grp, m, GRP_JOIN_MSG, &xmitq);
tipc_group_update_member(m, 0);
@@ -435,7 +435,7 @@ bool tipc_group_cong(struct tipc_group *grp, u32 dnode, u32 
dport,
return true;
if (state == MBR_PENDING && adv == ADV_IDLE)
return true;
-   skb_queue_head_init(&xmitq);
+   __skb_queue_head_init(&xmitq);
tipc_group_proto_xmit(grp, m, GRP_ADV_MSG, &xmitq);
tipc_node_distr_xmit(grp->net, &xmitq);
return true;
diff --git a/net/tipc/link.c b/net/tipc/link.c
index dd3155b..289e848 100644
--

RE: [net-next 1/1] tipc: clean up skb list lock handling on send path

2019-08-15 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org  On
> Behalf Of Xin Long
> Sent: 15-Aug-19 01:58
> To: Jon Maloy 
> Cc: da...@davemloft.net; netdev@vger.kernel.org; Tung Quang Nguyen
> ; Hoang Huu Le
> ; shu...@redhat.com; ying xue
> ; eduma...@google.com; tipc-
> discuss...@lists.sourceforge.net
> Subject: Re: [net-next 1/1] tipc: clean up skb list lock handling on send path
> 
> 

[...]

> > /* Try again later if socket is busy */
> > --
> > 2.1.4
> >
> >
> Patch looks good, can you also check those tmp tx queues in:
> 
>   tipc_group_cong()
>   tipc_group_join()
>   tipc_link_create_dummy_tnl_msg()
>   tipc_link_tnl_prepare()
> 
> which are using skb_queue_head_init() to init?
> 
> Thanks.

You are right. I missed those. I'll post a v2 of this patch.
///jon


[net-next 1/1] tipc: clean up skb list lock handling on send path

2019-08-14 Thread Jon Maloy
The policy for handling the skb list locks on the send and receive paths
is simple.

- On the send path we never need to grab the lock on the 'xmitq' list
  when the destination is an exernal node.

- On the receive path we always need to grab the lock on the 'inputq'
  list, irrespective of source node.

However, when transmitting node local messages those will eventually
end up on the receive path of a local socket, meaning that the argument
'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in  the
function tipc_sk_rcv(). This has been handled by always initializing
the spinlock of the 'xmitq' list at message creation, just in case it
may end up on the receive path later, and despite knowing that the lock
in most cases never will be used.

This approach is inaccurate and confusing, and has also concealed the
fact that the stated 'no lock grabbing' policy for the send path is
violated in some cases.

We now clean up this by never initializing the lock at message creation,
instead doing this at the moment we find that the message actually will
enter the receive path. At the same time we fix the four locations
where we incorrectly access the spinlock on the send/error path.

This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
is initialised") which has now become redundant.

CC: Eric Dumazet 
Reported-by: Chris Packham 
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/bcast.c  | 10 +-
 net/tipc/link.c   |  4 ++--
 net/tipc/name_distr.c |  2 +-
 net/tipc/node.c   |  7 ---
 net/tipc/socket.c | 14 +++---
 5 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 34f3e56..6ef1abd 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -185,7 +185,7 @@ static void tipc_bcbase_xmit(struct net *net, struct 
sk_buff_head *xmitq)
}
 
/* We have to transmit across all bearers */
-   skb_queue_head_init(&_xmitq);
+   __skb_queue_head_init(&_xmitq);
for (bearer_id = 0; bearer_id < MAX_BEARERS; bearer_id++) {
if (!bb->dests[bearer_id])
continue;
@@ -256,7 +256,7 @@ static int tipc_bcast_xmit(struct net *net, struct 
sk_buff_head *pkts,
struct sk_buff_head xmitq;
int rc = 0;
 
-   skb_queue_head_init(&xmitq);
+   __skb_queue_head_init(&xmitq);
tipc_bcast_lock(net);
if (tipc_link_bc_peers(l))
rc = tipc_link_xmit(l, pkts, &xmitq);
@@ -286,7 +286,7 @@ static int tipc_rcast_xmit(struct net *net, struct 
sk_buff_head *pkts,
u32 dnode, selector;
 
selector = msg_link_selector(buf_msg(skb_peek(pkts)));
-   skb_queue_head_init(&_pkts);
+   __skb_queue_head_init(&_pkts);
 
list_for_each_entry_safe(dst, tmp, &dests->list, list) {
dnode = dst->node;
@@ -344,7 +344,7 @@ static int tipc_mcast_send_sync(struct net *net, struct 
sk_buff *skb,
msg_set_size(_hdr, MCAST_H_SIZE);
msg_set_is_rcast(_hdr, !msg_is_rcast(hdr));
 
-   skb_queue_head_init(&tmpq);
+   __skb_queue_head_init(&tmpq);
__skb_queue_tail(&tmpq, _skb);
if (method->rcast)
tipc_bcast_xmit(net, &tmpq, cong_link_cnt);
@@ -378,7 +378,7 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head 
*pkts,
int rc = 0;
 
skb_queue_head_init(&inputq);
-   skb_queue_head_init(&localq);
+   __skb_queue_head_init(&localq);
 
/* Clone packets before they are consumed by next call */
if (dests->local && !tipc_msg_reassemble(pkts, &localq)) {
diff --git a/net/tipc/link.c b/net/tipc/link.c
index dd3155b..ba057a9 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -959,7 +959,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head 
*list,
pr_warn("Too large msg, purging xmit list %d %d %d %d %d!\n",
skb_queue_len(list), msg_user(hdr),
msg_type(hdr), msg_size(hdr), mtu);
-   skb_queue_purge(list);
+   __skb_queue_purge(list);
return -EMSGSIZE;
}
 
@@ -988,7 +988,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head 
*list,
if (likely(skb_queue_len(transmq) < maxwin)) {
_skb = skb_clone(skb, GFP_ATOMIC);
if (!_skb) {
-   skb_queue_purge(list);
+   __skb_queue_purge(list);
return -ENOBUFS;
}
__skb_dequeue(list);
diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 44abc8e..61219f0 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -190,7 +190,7 @@

[net 1/1] tipc: fix unitilized skb list crash

2019-07-30 Thread Jon Maloy
Our test suite somtimes provokes the following crash:

Description of problem:
[ 1092.597234] BUG: unable to handle kernel NULL pointer dereference at 
00e8
[ 1092.605072] PGD 0 P4D 0
[ 1092.607620] Oops:  [#1] SMP PTI
[ 1092.68] CPU: 37 PID: 0 Comm: swapper/37 Kdump: loaded Not tainted 
4.18.0-122.el8.x86_64 #1
[ 1092.619724] Hardware name: Dell Inc. PowerEdge R740/08D89F, BIOS 1.3.7 
02/08/2018
[ 1092.627215] RIP: 0010:tipc_mcast_filter_msg+0x93/0x2d0 [tipc]
[ 1092.632955] Code: 0f 84 aa 01 00 00 89 cf 4d 01 ca 4c 8b 26 c1 ef 19 83 e7 
0f 83 ff 0c 4d 0f 45 d1 41 8b 6a 10 0f cd 4c 39 e6 0f 84 81 01 00 00 <4d> 8b 9c 
24 e8 00 00 00 45 8b 13 41 0f ca 44 89 d7 c1 ef 13 83 e7
[ 1092.651703] RSP: 0018:929e5fa83a18 EFLAGS: 00010282
[ 1092.656927] RAX: 929e3fb38100 RBX: 069f29ee RCX: 416c0045
[ 1092.664058] RDX: 929e5fa83a88 RSI: 929e31a28420 RDI: 
[ 1092.671209] RBP: 29b11821 R08:  R09: 929e39b4407a
[ 1092.678343] R10: 929e39b4407a R11: 0007 R12: 
[ 1092.685475] R13: 0001 R14: 929e3fb38100 R15: 929e39b4407a
[ 1092.692614] FS:  () GS:929e5fa8() 
knlGS:
[ 1092.700702] CS:  0010 DS:  ES:  CR0: 80050033
[ 1092.706447] CR2: 00e8 CR3: 00031300a004 CR4: 007606e0
[ 1092.713579] DR0:  DR1:  DR2: 
[ 1092.720712] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1092.727843] PKRU: 5554
[ 1092.730556] Call Trace:
[ 1092.733010]  
[ 1092.735034]  tipc_sk_filter_rcv+0x7ca/0xb80 [tipc]
[ 1092.739828]  ? __kmalloc_node_track_caller+0x1cb/0x290
[ 1092.744974]  ? dev_hard_start_xmit+0xa5/0x210
[ 1092.749332]  tipc_sk_rcv+0x389/0x640 [tipc]
[ 1092.753519]  tipc_sk_mcast_rcv+0x23c/0x3a0 [tipc]
[ 1092.758224]  tipc_rcv+0x57a/0xf20 [tipc]
[ 1092.762154]  ? ktime_get_real_ts64+0x40/0xe0
[ 1092.766432]  ? tpacket_rcv+0x50/0x9f0
[ 1092.770098]  tipc_l2_rcv_msg+0x4a/0x70 [tipc]
[ 1092.774452]  __netif_receive_skb_core+0xb62/0xbd0
[ 1092.779164]  ? enqueue_entity+0xf6/0x630
[ 1092.783084]  ? kmem_cache_alloc+0x158/0x1c0
[ 1092.787272]  ? __build_skb+0x25/0xd0
[ 1092.790849]  netif_receive_skb_internal+0x42/0xf0
[ 1092.795557]  napi_gro_receive+0xba/0xe0
[ 1092.799417]  mlx5e_handle_rx_cqe+0x83/0xd0 [mlx5_core]
[ 1092.804564]  mlx5e_poll_rx_cq+0xd5/0x920 [mlx5_core]
[ 1092.809536]  mlx5e_napi_poll+0xb2/0xce0 [mlx5_core]
[ 1092.814415]  ? __wake_up_common_lock+0x89/0xc0
[ 1092.818861]  net_rx_action+0x149/0x3b0
[ 1092.822616]  __do_softirq+0xe3/0x30a
[ 1092.826193]  irq_exit+0x100/0x110
[ 1092.829512]  do_IRQ+0x85/0xd0
[ 1092.832483]  common_interrupt+0xf/0xf
[ 1092.836147]  
[ 1092.838255] RIP: 0010:cpuidle_enter_state+0xb7/0x2a0
[ 1092.843221] Code: e8 3e 79 a5 ff 80 7c 24 03 00 74 17 9c 58 0f 1f 44 00 00 
f6 c4 02 0f 85 d7 01 00 00 31 ff e8 a0 6b ab ff fb 66 0f 1f 44 00 00 <48> b8 ff 
ff ff ff f3 01 00 00 4c 29 f3 ba ff ff ff 7f 48 39 c3 7f
[ 1092.861967] RSP: 0018:aa5ec6533e98 EFLAGS: 0246 ORIG_RAX: 
ffdd
[ 1092.869530] RAX: 929e5faa3100 RBX: 00fe63dd2092 RCX: 001f
[ 1092.876665] RDX: 00fe63dd2092 RSI: 3a518aaa RDI: 
[ 1092.883795] RBP: 0003 R08: 0004 R09: 00022940
[ 1092.890929] R10: 040cb0666b56 R11: 929e5faa20a8 R12: 929e5faade78
[ 1092.898060] R13: b59258f8 R14: 00fe60f3228d R15: 
[ 1092.905196]  ? cpuidle_enter_state+0x92/0x2a0
[ 1092.909555]  do_idle+0x236/0x280
[ 1092.912785]  cpu_startup_entry+0x6f/0x80
[ 1092.916715]  start_secondary+0x1a7/0x200
[ 1092.920642]  secondary_startup_64+0xb7/0xc0
[...]

The reason is that the skb list tipc_socket::mc_method.deferredq only
is initialized for connectionless sockets, while nothing stops arriving
multicast messages from being filtered by connection oriented sockets,
with subsequent access to the said list.

We fix this by initializing the list unconditionally at socket creation.
This eliminates the crash, while the message still is dropped further
down in tipc_sk_filter_rcv() as it should be.

Reported-by: Li Shuang 
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index dd8537f..83ae41d 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -485,9 +485,8 @@ static int tipc_sk_create(struct net *net, struct socket 
*sock,
tsk_set_unreturnable(tsk, true);
if (sock->type == SOCK_DGRAM)
tsk_set_unreliable(tsk, true);
-   __skb_queue_head_init(&tsk->mc_method.deferredq);
}
-
+   __skb_queue_head_init(&tsk->mc_method.deferredq);
trace_tipc_sk_create(sk, NULL, TIPC_DUMP_NONE, " ");
return 0;
 }
-- 
2.1.4



[net-next 1/1] tipc: reduce risk of wakeup queue starvation

2019-07-30 Thread Jon Maloy
In commit 365ad353c256 ("tipc: reduce risk of user starvation during
link congestion") we allowed senders to add exactly one list of extra
buffers to the link backlog queues during link congestion (aka
"oversubscription"). However, the criteria for when to stop adding
wakeup messages to the input queue when the overload abates is
inaccurate, and may cause starvation problems during very high load.

Currently, we stop adding wakeup messages after 10 total failed attempts
where we find that there is no space left in the backlog queue for a
certain importance level. The counter for this is accumulated across all
levels, which may lead the algorithm to leave the loop prematurely,
although there may still be plenty of space available at some levels.
The result is sometimes that messages near the wakeup queue tail are not
added to the input queue as they should be.

We now introduce a more exact algorithm, where we keep adding wakeup
messages to a level as long as the backlog queue has free slots for
the corresponding level, and stop at the moment there are no more such
slots or when there are no more wakeup messages to dequeue.

Fixes: 365ad35 ("tipc: reduce risk of user starvation during link congestion")
Reported-by: Tung Nguyen 
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 2c27477..dd3155b 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -854,18 +854,31 @@ static int link_schedule_user(struct tipc_link *l, struct 
tipc_msg *hdr)
  */
 static void link_prepare_wakeup(struct tipc_link *l)
 {
+   struct sk_buff_head *wakeupq = &l->wakeupq;
+   struct sk_buff_head *inputq = l->inputq;
struct sk_buff *skb, *tmp;
-   int imp, i = 0;
+   struct sk_buff_head tmpq;
+   int avail[5] = {0,};
+   int imp = 0;
+
+   __skb_queue_head_init(&tmpq);
 
-   skb_queue_walk_safe(&l->wakeupq, skb, tmp) {
+   for (; imp <= TIPC_SYSTEM_IMPORTANCE; imp++)
+   avail[imp] = l->backlog[imp].limit - l->backlog[imp].len;
+
+   skb_queue_walk_safe(wakeupq, skb, tmp) {
imp = TIPC_SKB_CB(skb)->chain_imp;
-   if (l->backlog[imp].len < l->backlog[imp].limit) {
-   skb_unlink(skb, &l->wakeupq);
-   skb_queue_tail(l->inputq, skb);
-   } else if (i++ > 10) {
-   break;
-   }
+   if (avail[imp] <= 0)
+   continue;
+   avail[imp]--;
+   __skb_unlink(skb, wakeupq);
+   __skb_queue_tail(&tmpq, skb);
}
+
+   spin_lock_bh(&inputq->lock);
+   skb_queue_splice_tail(&tmpq, inputq);
+   spin_unlock_bh(&inputq->lock);
+
 }
 
 void tipc_link_reset(struct tipc_link *l)
-- 
2.1.4



[net 1/1] tipc: initialize 'validated' field of received packets

2019-07-17 Thread Jon Maloy
The tipc_msg_validate() function leaves a boolean flag 'validated' in
the validated buffer's control block, to avoid performing this action
more than once. However, at reception of new packets, the position of
this field may already have been set by lower layer protocols, so
that the packet is erroneously perceived as already validated by TIPC.

We fix this by initializing the said field to 'false' before performing
the initial validation.

Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 324a1f9..3a5be1d 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1807,6 +1807,7 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, 
struct tipc_bearer *b)
__skb_queue_head_init(&xmitq);
 
/* Ensure message is well-formed before touching the header */
+   TIPC_SKB_CB(skb)->validated = false;
if (unlikely(!tipc_msg_validate(&skb)))
goto discard;
hdr = buf_msg(skb);
-- 
2.1.4



[net-next 1/1] tipc: embed jiffies in macro TIPC_BC_RETR_LIM

2019-06-28 Thread Jon Maloy
The macro TIPC_BC_RETR_LIM is always used in combination with 'jiffies',
so we can just as well perform the addition in the macro itself. This
way, we get a few shorter code lines and one less line break.

Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index f8bf63b..66d3a07 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -207,7 +207,7 @@ enum {
BC_NACK_SND_SUPPRESS,
 };
 
-#define TIPC_BC_RETR_LIM msecs_to_jiffies(10)   /* [ms] */
+#define TIPC_BC_RETR_LIM  (jiffies + msecs_to_jiffies(10))
 #define TIPC_UC_RETR_TIME (jiffies + msecs_to_jiffies(1))
 
 /*
@@ -976,8 +976,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head 
*list,
__skb_queue_tail(transmq, skb);
/* next retransmit attempt */
if (link_is_bc_sndlink(l))
-   TIPC_SKB_CB(skb)->nxt_retr =
-   jiffies + TIPC_BC_RETR_LIM;
+   TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
__skb_queue_tail(xmitq, _skb);
TIPC_SKB_CB(skb)->ackers = l->ackers;
l->rcv_unacked = 0;
@@ -1027,7 +1026,7 @@ static void tipc_link_advance_backlog(struct tipc_link *l,
__skb_queue_tail(&l->transmq, skb);
/* next retransmit attempt */
if (link_is_bc_sndlink(l))
-   TIPC_SKB_CB(skb)->nxt_retr = jiffies + TIPC_BC_RETR_LIM;
+   TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
 
__skb_queue_tail(xmitq, _skb);
TIPC_SKB_CB(skb)->ackers = l->ackers;
@@ -1123,7 +1122,7 @@ static int tipc_link_bc_retrans(struct tipc_link *l, 
struct tipc_link *r,
if (link_is_bc_sndlink(l)) {
if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
continue;
-   TIPC_SKB_CB(skb)->nxt_retr = jiffies + TIPC_BC_RETR_LIM;
+   TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
}
_skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE, GFP_ATOMIC);
if (!_skb)
-- 
2.1.4



[net-next 1/1] tipc: rename function msg_get_wrapped() to msg_inner_hdr()

2019-06-25 Thread Jon Maloy
We rename the inline function msg_get_wrapped() to the more
comprehensible msg_inner_hdr().

Signed-off-by: Jon Maloy 
---
 net/tipc/bcast.c | 4 ++--
 net/tipc/link.c  | 2 +-
 net/tipc/msg.h   | 4 ++--
 net/tipc/node.c  | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 6c997d4..1336f3c 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -323,7 +323,7 @@ static int tipc_mcast_send_sync(struct net *net, struct 
sk_buff *skb,
 
hdr = buf_msg(skb);
if (msg_user(hdr) == MSG_FRAGMENTER)
-   hdr = msg_get_wrapped(hdr);
+   hdr = msg_inner_hdr(hdr);
if (msg_type(hdr) != TIPC_MCAST_MSG)
return 0;
 
@@ -392,7 +392,7 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head 
*pkts,
skb = skb_peek(pkts);
hdr = buf_msg(skb);
if (msg_user(hdr) == MSG_FRAGMENTER)
-   hdr = msg_get_wrapped(hdr);
+   hdr = msg_inner_hdr(hdr);
msg_set_is_rcast(hdr, method->rcast);
 
/* Switch method ? */
diff --git a/net/tipc/link.c b/net/tipc/link.c
index aa79bf8..f8bf63b 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -732,7 +732,7 @@ static void link_profile_stats(struct tipc_link *l)
if (msg_user(msg) == MSG_FRAGMENTER) {
if (msg_type(msg) != FIRST_FRAGMENT)
return;
-   length = msg_size(msg_get_wrapped(msg));
+   length = msg_size(msg_inner_hdr(msg));
}
l->stats.msg_lengths_total += length;
l->stats.msg_length_counts++;
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index 8de02ad..da509f0 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -308,7 +308,7 @@ static inline unchar *msg_data(struct tipc_msg *m)
return ((unchar *)m) + msg_hdr_sz(m);
 }
 
-static inline struct tipc_msg *msg_get_wrapped(struct tipc_msg *m)
+static inline struct tipc_msg *msg_inner_hdr(struct tipc_msg *m)
 {
return (struct tipc_msg *)msg_data(m);
 }
@@ -486,7 +486,7 @@ static inline void msg_set_prevnode(struct tipc_msg *m, u32 
a)
 static inline u32 msg_origport(struct tipc_msg *m)
 {
if (msg_user(m) == MSG_FRAGMENTER)
-   m = msg_get_wrapped(m);
+   m = msg_inner_hdr(m);
return msg_word(m, 4);
 }
 
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 550581d..324a1f9 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1649,7 +1649,7 @@ static bool tipc_node_check_state(struct tipc_node *n, 
struct sk_buff *skb,
int usr = msg_user(hdr);
int mtyp = msg_type(hdr);
u16 oseqno = msg_seqno(hdr);
-   u16 iseqno = msg_seqno(msg_get_wrapped(hdr));
+   u16 iseqno = msg_seqno(msg_inner_hdr(hdr));
u16 exp_pkts = msg_msgcnt(hdr);
u16 rcv_nxt, syncpt, dlv_nxt, inputq_len;
int state = n->state;
-- 
2.1.4



[net-next 1/1] tipc: eliminate unnecessary skb expansion during retransmission

2019-06-25 Thread Jon Maloy
We increase the allocated headroom for the buffer copies to be
retransmitted. This eliminates the need for the lower stack levels
(UDP/IP/L2) to expand the headroom in order to add their own headers.

Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index af50b53..aa79bf8 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1125,7 +1125,7 @@ static int tipc_link_bc_retrans(struct tipc_link *l, 
struct tipc_link *r,
continue;
TIPC_SKB_CB(skb)->nxt_retr = jiffies + TIPC_BC_RETR_LIM;
}
-   _skb = __pskb_copy(skb, MIN_H_SIZE, GFP_ATOMIC);
+   _skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE, GFP_ATOMIC);
if (!_skb)
return 0;
hdr = buf_msg(_skb);
-- 
2.1.4



[net-next 1/1] tipc: simplify stale link failure criteria

2019-06-25 Thread Jon Maloy
In commit a4dc70d46cf1 ("tipc: extend link reset criteria for stale
packet retransmission") we made link retransmission failure events
dependent on the link tolerance, and not only of the number of failed
retransmission attempts, as we did earlier. This works well. However,
keeping the original, additional criteria of 99 failed retransmissions
is now redundant, and may in some cases lead to failure detection
times in the order of minutes instead of the expected 1.5 sec link
tolerance value.

We now remove this criteria altogether.

Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index bcfb0a4..af50b53 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -107,7 +107,6 @@ struct tipc_stats {
  * @backlogq: queue for messages waiting to be sent
  * @snt_nxt: next sequence number to use for outbound messages
  * @prev_from: sequence number of most previous retransmission request
- * @stale_cnt: counter for number of identical retransmit attempts
  * @stale_limit: time when repeated identical retransmits must force link reset
  * @ackers: # of peers that needs to ack each packet before it can be released
  * @acked: # last packet acked by a certain peer. Used for broadcast.
@@ -167,7 +166,6 @@ struct tipc_link {
u16 snd_nxt;
u16 prev_from;
u16 window;
-   u16 stale_cnt;
unsigned long stale_limit;
 
/* Reception */
@@ -910,7 +908,6 @@ void tipc_link_reset(struct tipc_link *l)
l->acked = 0;
l->silent_intv_cnt = 0;
l->rst_cnt = 0;
-   l->stale_cnt = 0;
l->bc_peer_is_up = false;
memset(&l->mon_state, 0, sizeof(l->mon_state));
tipc_link_reset_stats(l);
@@ -1068,8 +1065,7 @@ static bool link_retransmit_failure(struct tipc_link *l, 
struct tipc_link *r,
if (r->prev_from != from) {
r->prev_from = from;
r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance);
-   r->stale_cnt = 0;
-   } else if (++r->stale_cnt > 99 && time_after(jiffies, r->stale_limit)) {
+   } else if (time_after(jiffies, r->stale_limit)) {
pr_warn("Retransmission failure on link <%s>\n", l->name);
link_print(l, "State of link ");
pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
@@ -1515,7 +1511,6 @@ int tipc_link_rcv(struct tipc_link *l, struct sk_buff 
*skb,
 
/* Forward queues and wake up waiting users */
if (likely(tipc_link_release_pkts(l, msg_ack(hdr {
-   l->stale_cnt = 0;
tipc_link_advance_backlog(l, xmitq);
if (unlikely(!skb_queue_empty(&l->wakeupq)))
link_prepare_wakeup(l);
@@ -2584,7 +2579,7 @@ int tipc_link_dump(struct tipc_link *l, u16 dqueues, char 
*buf)
i += scnprintf(buf + i, sz - i, " %u", l->silent_intv_cnt);
i += scnprintf(buf + i, sz - i, " %u", l->rst_cnt);
i += scnprintf(buf + i, sz - i, " %u", l->prev_from);
-   i += scnprintf(buf + i, sz - i, " %u", l->stale_cnt);
+   i += scnprintf(buf + i, sz - i, " %u", 0);
i += scnprintf(buf + i, sz - i, " %u", l->acked);
 
list = &l->transmq;
-- 
2.1.4



RE: [net-next v2] tipc: add loopback device tracking

2019-06-24 Thread Jon Maloy



> -Original Message-
> From: netdev-ow...@vger.kernel.org  On
> Behalf Of David Miller
> Sent: 24-Jun-19 10:29
> To: John Rutherford 
> Cc: netdev@vger.kernel.org
> Subject: Re: [net-next v2] tipc: add loopback device tracking
> 
> From: john.rutherf...@dektech.com.au
> Date: Mon, 24 Jun 2019 16:44:35 +1000
> 
> > Since node internal messages are passed directly to socket it is not
> > possible to observe this message exchange via tcpdump or wireshark.
> >
> > We now remedy this by making it possible to clone such messages and
> > send the clones to the loopback interface.  The clones are dropped at
> > reception and have no functional role except making the traffic visible.
> >
> > The feature is turned on/off by enabling/disabling the loopback "bearer"
> > "eth:lo".
> >
> > Acked-by: Jon Maloy 
> > Signed-off-by: John Rutherford 
> 
> What a waste, just clone the packet, attach loopback to it, and go:
> 
>   if (dev_nit_active(loopback_dev))
>   dev_queue_xmit_nit(skb, loopback_dev);

I was never quite happy with this patch, so thank you for the feedback!

///jon



RE: [PATCH net] tipc: add dst_cache support for udp media

2019-06-20 Thread Jon Maloy
Acked-by: Jon Maloy 

> -Original Message-
> From: netdev-ow...@vger.kernel.org  On
> Behalf Of Xin Long
> Sent: 20-Jun-19 07:04
> To: network dev 
> Cc: da...@davemloft.net; Jon Maloy ; Ying Xue
> ; tipc-discuss...@lists.sourceforge.net; Paolo
> Abeni 
> Subject: [PATCH net] tipc: add dst_cache support for udp media
> 
> As other udp/ip tunnels do, tipc udp media should also have a lockless
> dst_cache supported on its tx path.
> 
> Here we add dst_cache into udp_replicast to support dst cache for both
> rmcast and rcast, and rmcast uses ub->rcast and each rcast uses its own node
> in ub->rcast.list.
> 
> Signed-off-by: Xin Long 
> ---
>  net/tipc/udp_media.c | 72 ++---
> ---
>  1 file changed, 47 insertions(+), 25 deletions(-)
> 
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index
> 1405ccc..b8962df 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -76,6 +76,7 @@ struct udp_media_addr {
>  /* struct udp_replicast - container for UDP remote addresses */  struct
> udp_replicast {
>   struct udp_media_addr addr;
> + struct dst_cache dst_cache;
>   struct rcu_head rcu;
>   struct list_head list;
>  };
> @@ -158,22 +159,27 @@ static int tipc_udp_addr2msg(char *msg, struct
> tipc_media_addr *a)
>  /* tipc_send_msg - enqueue a send request */  static int tipc_udp_xmit(struct
> net *net, struct sk_buff *skb,
>struct udp_bearer *ub, struct udp_media_addr *src,
> -  struct udp_media_addr *dst)
> +  struct udp_media_addr *dst, struct dst_cache *cache)
>  {
> + struct dst_entry *ndst = dst_cache_get(cache);
>   int ttl, err = 0;
> - struct rtable *rt;
> 
>   if (dst->proto == htons(ETH_P_IP)) {
> - struct flowi4 fl = {
> - .daddr = dst->ipv4.s_addr,
> - .saddr = src->ipv4.s_addr,
> - .flowi4_mark = skb->mark,
> - .flowi4_proto = IPPROTO_UDP
> - };
> - rt = ip_route_output_key(net, &fl);
> - if (IS_ERR(rt)) {
> - err = PTR_ERR(rt);
> - goto tx_error;
> + struct rtable *rt = (struct rtable *)ndst;
> +
> + if (!rt) {
> + struct flowi4 fl = {
> + .daddr = dst->ipv4.s_addr,
> + .saddr = src->ipv4.s_addr,
> + .flowi4_mark = skb->mark,
> + .flowi4_proto = IPPROTO_UDP
> + };
> + rt = ip_route_output_key(net, &fl);
> + if (IS_ERR(rt)) {
> + err = PTR_ERR(rt);
> + goto tx_error;
> + }
> + dst_cache_set_ip4(cache, &rt->dst, fl.saddr);
>   }
> 
>   ttl = ip4_dst_hoplimit(&rt->dst);
> @@ -182,17 +188,19 @@ static int tipc_udp_xmit(struct net *net, struct
> sk_buff *skb,
>   dst->port, false, true);
>  #if IS_ENABLED(CONFIG_IPV6)
>   } else {
> - struct dst_entry *ndst;
> - struct flowi6 fl6 = {
> - .flowi6_oif = ub->ifindex,
> - .daddr = dst->ipv6,
> - .saddr = src->ipv6,
> - .flowi6_proto = IPPROTO_UDP
> - };
> - err = ipv6_stub->ipv6_dst_lookup(net, ub->ubsock->sk, &ndst,
> -  &fl6);
> - if (err)
> - goto tx_error;
> + if (!ndst) {
> + struct flowi6 fl6 = {
> + .flowi6_oif = ub->ifindex,
> + .daddr = dst->ipv6,
> + .saddr = src->ipv6,
> + .flowi6_proto = IPPROTO_UDP
> + };
> + err = ipv6_stub->ipv6_dst_lookup(net, ub->ubsock->sk,
> +  &ndst, &fl6);
> + if (err)
> + goto tx_error;
> + dst_cache_set_ip6(cache, ndst, &fl6.saddr);
> + }
>   ttl = ip6_dst_hoplimit(ndst);
>   err = udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb, NULL,
>  &src->ipv6, &dst->ipv6, 0, ttl, 0, 
> @@ -230,7
> +238,8 @@ sta

RE: [PATCH net] tipc: change to use register_pernet_device

2019-06-20 Thread Jon Maloy
Acked-by: Jon Maloy 

> -Original Message-
> From: netdev-ow...@vger.kernel.org  On
> Behalf Of Xin Long
> Sent: 20-Jun-19 06:39
> To: network dev 
> Cc: da...@davemloft.net; Jon Maloy ; Ying Xue
> ; tipc-discuss...@lists.sourceforge.net
> Subject: [PATCH net] tipc: change to use register_pernet_device
> 
> This patch is to fix a dst defcnt leak, which can be reproduced by 
> doing:.ericsson.com>
> 
>   # ip net a c; ip net a s; modprobe tipc
>   # ip net e s ip l a n eth1 type veth peer n eth1 netns c
>   # ip net e c ip l s lo up; ip net e c ip l s eth1 up
>   # ip net e s ip l s lo up; ip net e s ip l s eth1 up
>   # ip net e c ip a a 1.1.1.2/8 dev eth1
>   # ip net e s ip a a 1.1.1.1/8 dev eth1
>   # ip net e c tipc b e m udp n u1 localip 1.1.1.2
>   # ip net e s tipc b e m udp n u1 localip 1.1.1.1
>   # ip net d c; ip net d s; rmmod tipc
> 
> and it will get stuck and keep logging the error:
> 
>   unregister_netdevice: waiting for lo to become free. Usage count = 1
> 
> The cause is that a dst is held by the udp sock's sk_rx_dst set on udp rx path
> with udp_early_demux == 1, and this dst (eventually holding lo dev) can't be
> released as bearer's removal in tipc pernet .exit happens after lo dev's
> removal, default_device pernet .exit.
> 
>  "There are two distinct types of pernet_operations recognized: subsys and
>   device.  At creation all subsys init functions are called before device
>   init functions, and at destruction all device exit functions are called
>   before subsys exit function."
> 
> So by calling register_pernet_device instead to register tipc_net_ops, the
> pernet .exit() will be invoked earlier than loopback dev's removal when a
> netns is being destroyed, as fou/gue does.
> 
> Note that vxlan and geneve udp tunnels don't have this issue, as the udp sock
> is released in their device ndo_stop().
> 
> This fix is also necessary for tipc dst_cache, which will hold dsts on tx 
> path and
> I will introduce in my next patch.
> 
> Reported-by: Li Shuang 
> Signed-off-by: Xin Long 
> ---
>  net/tipc/core.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/net/tipc/core.c b/net/tipc/core.c index ed536c0..c837072 100644
> --- a/net/tipc/core.c
> +++ b/net/tipc/core.c
> @@ -134,7 +134,7 @@ static int __init tipc_init(void)
>   if (err)
>   goto out_sysctl;
> 
> - err = register_pernet_subsys(&tipc_net_ops);
> + err = register_pernet_device(&tipc_net_ops);
>   if (err)
>   goto out_pernet;
> 
> @@ -142,7 +142,7 @@ static int __init tipc_init(void)
>   if (err)
>   goto out_socket;
> 
> - err = register_pernet_subsys(&tipc_topsrv_net_ops);
> + err = register_pernet_device(&tipc_topsrv_net_ops);
>   if (err)
>   goto out_pernet_topsrv;
> 
> @@ -153,11 +153,11 @@ static int __init tipc_init(void)
>   pr_info("Started in single node mode\n");
>   return 0;
>  out_bearer:
> - unregister_pernet_subsys(&tipc_topsrv_net_ops);
> + unregister_pernet_device(&tipc_topsrv_net_ops);
>  out_pernet_topsrv:
>   tipc_socket_stop();
>  out_socket:
> - unregister_pernet_subsys(&tipc_net_ops);
> + unregister_pernet_device(&tipc_net_ops);
>  out_pernet:
>   tipc_unregister_sysctl();
>  out_sysctl:
> @@ -172,9 +172,9 @@ static int __init tipc_init(void)  static void __exit
> tipc_exit(void)  {
>   tipc_bearer_cleanup();
> - unregister_pernet_subsys(&tipc_topsrv_net_ops);
> + unregister_pernet_device(&tipc_topsrv_net_ops);
>   tipc_socket_stop();
> - unregister_pernet_subsys(&tipc_net_ops);
> + unregister_pernet_device(&tipc_net_ops);
>   tipc_netlink_stop();
>   tipc_netlink_compat_stop();
>   tipc_unregister_sysctl();
> --
> 2.1.0



RE: WARNING: locking bug in icmp_send

2019-03-25 Thread Jon Maloy
Yet another duplicate of  syzbot+a25307ad099309f1c...@syzkaller.appspotmail.com

A fix has been posted.

///jon


> -Original Message-
> From: netdev-ow...@vger.kernel.org 
> On Behalf Of syzbot
> Sent: 23-Mar-19 19:03
> To: da...@davemloft.net; Jon Maloy ;
> kuz...@ms2.inr.ac.ru; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org; syzkaller-b...@googlegroups.com; tipc-
> discuss...@lists.sourceforge.net; ying@windriver.com; yoshfuji@linux-
> ipv6.org
> Subject: Re: WARNING: locking bug in icmp_send
> 
> syzbot has bisected this bug to:
> 
> commit 52dfae5c85a4c1078e9f1d5e8947d4a25f73dd81
> Author: Jon Maloy 
> Date:   Thu Mar 22 19:42:52 2018 +
> 
>  tipc: obtain node identity from interface by default
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=11b6dc5d20
> start commit:   b5372fe5 exec: load_script: Do not exec truncated interpre..
> git tree:   upstream
> final crash:https://syzkaller.appspot.com/x/report.txt?x=13b6dc5d20
> console output: https://syzkaller.appspot.com/x/log.txt?x=15b6dc5d20
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7132344728e7ec3f
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=ba21d65f55b562432c58
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=14c90fa740
> 
> Reported-by: syzbot+ba21d65f55b562432...@syzkaller.appspotmail.com
> Fixes: 52dfae5c85a4 ("tipc: obtain node identity from interface by default")
> 
> For information about bisection process see:
> https://goo.gl/tpsmEJ#bisection


[net 1/1] tipc: fix RDM/DGRAM connect() regression

2019-03-04 Thread Jon Maloy
From: Erik Hugne 

Fix regression bug introduced in
commit 365ad353c256 ("tipc: reduce risk of user starvation during link
congestion")

Only signal -EDESTADDRREQ for RDM/DGRAM if we don't have a cached
sockaddr.

Signed-off-by: Erik Hugne 
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 70343ac..139694f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1333,7 +1333,7 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
 
if (unlikely(!dest)) {
dest = &tsk->peer;
-   if (!syn || dest->family != AF_TIPC)
+   if (!syn && dest->family != AF_TIPC)
return -EDESTADDRREQ;
}
 
-- 
2.1.4



RE: [Patch net] tipc: check group dests after tipc_wait_for_cond()

2018-12-17 Thread Jon Maloy
Acked-by: Jon Maloy 

Thank you, Cong.

///jon

> -Original Message-
> From: Cong Wang 
> Sent: 17-Dec-18 02:25
> To: netdev@vger.kernel.org
> Cc: Cong Wang ; Ying Xue
> ; Jon Maloy 
> Subject: [Patch net] tipc: check group dests after tipc_wait_for_cond()
> 
> Similar to commit 143ece654f9f ("tipc: check tsk->group in
> tipc_wait_for_cond()") we have to reload grp->dests too after we re-take
> the sock lock.
> This means we need to move the dsts check after tipc_wait_for_cond() too.
> 
> Fixes: 75da2163dbb6 ("tipc: introduce communication groups")
> Reported-and-tested-by:
> syzbot+99f20222fc5018d2b...@syzkaller.appspotmail.com
> Cc: Ying Xue 
> Cc: Jon Maloy 
> Signed-off-by: Cong Wang 
> ---
>  net/tipc/socket.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c index
> 656940692a44..8f34db2a9785 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -1009,7 +1009,7 @@ static int tipc_send_group_bcast(struct socket
> *sock, struct msghdr *m,
>   struct sock *sk = sock->sk;
>   struct net *net = sock_net(sk);
>   struct tipc_sock *tsk = tipc_sk(sk);
> - struct tipc_nlist *dsts = tipc_group_dests(tsk->group);
> + struct tipc_nlist *dsts;
>   struct tipc_mc_method *method = &tsk->mc_method;
>   bool ack = method->mandatory && method->rcast;
>   int blks = tsk_blocks(MCAST_H_SIZE + dlen); @@ -1018,9 +1018,6
> @@ static int tipc_send_group_bcast(struct socket *sock, struct msghdr *m,
>   struct sk_buff_head pkts;
>   int rc = -EHOSTUNREACH;
> 
> - if (!dsts->local && !dsts->remote)
> - return -EHOSTUNREACH;
> -
>   /* Block or return if any destination link or member is congested */
>   rc = tipc_wait_for_cond(sock, &timeout,
>   !tsk->cong_link_cnt && tsk->group && @@ -
> 1028,6 +1025,10 @@ static int tipc_send_group_bcast(struct socket *sock,
> struct msghdr *m,
>   if (unlikely(rc))
>   return rc;
> 
> + dsts = tipc_group_dests(tsk->group);
> + if (!dsts->local && !dsts->remote)
> + return -EHOSTUNREACH;
> +
>   /* Complete message header */
>   if (dest) {
>   msg_set_type(hdr, TIPC_GRP_MCAST_MSG);
> --
> 2.19.2



RE: KMSAN: uninit-value in __inet6_bind

2018-12-14 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org 
> On Behalf Of Eric Dumazet
> Sent: 14-Dec-18 10:15
> To: Jon Maloy ; Cong Wang
> ; Dmitry Vyukov 
> Cc: syzbot+c56449ed3652e6720...@syzkaller.appspotmail.com; Ying Xue
> ; tipc-discuss...@lists.sourceforge.net; David
> Miller ; Alexey Kuznetsov ;
> LKML ; Linux Kernel Network Developers
> ; syzkaller-b...@googlegroups.com; Hideaki
> YOSHIFUJI 
> Subject: Re: KMSAN: uninit-value in __inet6_bind
> 
> 
> 
> On 12/14/2018 07:04 AM, Jon Maloy wrote:
> >
> >
> >> -Original Message-
> >> From: Cong Wang 
> >> Sent: 12-Dec-18 01:17
> >> To: Dmitry Vyukov 
> >> Cc: syzbot+c56449ed3652e6720...@syzkaller.appspotmail.com; Jon
> Maloy
> >> ; Ying Xue ; tipc-
> >> discuss...@lists.sourceforge.net; David Miller ;
> >> Alexey Kuznetsov ; LKML  >> ker...@vger.kernel.org>; Linux Kernel Network Developers
> >> ; syzkaller-b...@googlegroups.com; Hideaki
> >> YOSHIFUJI 
> >> Subject: Re: KMSAN: uninit-value in __inet6_bind
> >>
> >> On Tue, Dec 11, 2018 at 1:04 AM Dmitry Vyukov 
> >> wrote:
> >>>
> >>> On Tue, Dec 11, 2018 at 1:41 AM syzbot
> >>>  wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> syzbot found the following crash on:
> >>>>
> >>>> HEAD commit:3f06bda61398 kmsan: remove excessive KMSAN
> >> wrappers from a..
> >>>> git tree:   https://github.com/google/kmsan.git/master
> >>>> console output:
> >>>> https://syzkaller.appspot.com/x/log.txt?x=13ca6b0540
> >>>> kernel config:
> >>>> https://syzkaller.appspot.com/x/.config?x=9b071100dcf8e641
> >>>> dashboard link:
> >> https://syzkaller.appspot.com/bug?extid=c56449ed3652e6720f30
> >>>> compiler:   clang version 8.0.0 (trunk 348261)
> >>>>
> >>>> Unfortunately, I don't have any reproducer for this crash yet.
> >>>>
> >>>> IMPORTANT: if you fix the bug, please add the following tag to the
> >> commit:
> >>>> Reported-by:
> syzbot+c56449ed3652e6720...@syzkaller.appspotmail.com
> >>>
> >>> This looks like a bug in TIPC, +TIPC maintainers.
> >>>
> >>
> >> It looks more like udp_sock_create6() doesn't initialize
> >> udp6_addr.sin6_scope_id.
> >
> > Unfortunately udp_sock_create6() has no way of knowing this value,
> because struct udp_port_cfg is missing a field sin6_scope_id.
> > So this has to be fixed first by adding this field to the struct, and then
> setting it correctly in all current users.
> >
> 
> Do we reasons to believe values other than 0 are needed ?
> 
For TIPC it is ok with 0.

///jon


RE: KMSAN: uninit-value in __inet6_bind

2018-12-14 Thread Jon Maloy


> -Original Message-
> From: Cong Wang 
> Sent: 12-Dec-18 01:17
> To: Dmitry Vyukov 
> Cc: syzbot+c56449ed3652e6720...@syzkaller.appspotmail.com; Jon Maloy
> ; Ying Xue ; tipc-
> discuss...@lists.sourceforge.net; David Miller ;
> Alexey Kuznetsov ; LKML  ker...@vger.kernel.org>; Linux Kernel Network Developers
> ; syzkaller-b...@googlegroups.com; Hideaki
> YOSHIFUJI 
> Subject: Re: KMSAN: uninit-value in __inet6_bind
> 
> On Tue, Dec 11, 2018 at 1:04 AM Dmitry Vyukov 
> wrote:
> >
> > On Tue, Dec 11, 2018 at 1:41 AM syzbot
> >  wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following crash on:
> > >
> > > HEAD commit:3f06bda61398 kmsan: remove excessive KMSAN
> wrappers from a..
> > > git tree:   https://github.com/google/kmsan.git/master
> > > console output:
> > > https://syzkaller.appspot.com/x/log.txt?x=13ca6b0540
> > > kernel config:
> > > https://syzkaller.appspot.com/x/.config?x=9b071100dcf8e641
> > > dashboard link:
> https://syzkaller.appspot.com/bug?extid=c56449ed3652e6720f30
> > > compiler:   clang version 8.0.0 (trunk 348261)
> > >
> > > Unfortunately, I don't have any reproducer for this crash yet.
> > >
> > > IMPORTANT: if you fix the bug, please add the following tag to the
> commit:
> > > Reported-by: syzbot+c56449ed3652e6720...@syzkaller.appspotmail.com
> >
> > This looks like a bug in TIPC, +TIPC maintainers.
> >
> 
> It looks more like udp_sock_create6() doesn't initialize
> udp6_addr.sin6_scope_id.

Unfortunately udp_sock_create6() has no way of knowing this value, because 
struct udp_port_cfg is missing a field sin6_scope_id.
So this has to be fixed first by adding this field to the struct, and then 
setting it correctly in all current users.

///jon




RE: KASAN: use-after-free Read in tipc_group_cong

2018-12-13 Thread Jon Maloy


> -Original Message-
> From: Dmitry Vyukov 
> Sent: 13-Dec-18 04:47
> To: Jon Maloy 
> Cc: syzbot+9845fed98688e01f4...@syzkaller.appspotmail.com; David Miller
> ; LKML ; netdev
> ; syzkaller-bugs  b...@googlegroups.com>; tipc-discuss...@lists.sourceforge.net; Ying Xue
> 
> Subject: Re: KASAN: use-after-free Read in tipc_group_cong
> 
> On Thu, Dec 13, 2018 at 1:16 AM Jon Maloy 
> wrote:
> > > -Original Message-
> > > From: syzbot
> 
> > > Sent: 12-Dec-18 06:11
> > > To: da...@davemloft.net; Jon Maloy ; linux-
> > > ker...@vger.kernel.org; netdev@vger.kernel.org; syzkaller-
> > > b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net;
> > > ying@windriver.com
> > > Subject: KASAN: use-after-free Read in tipc_group_cong
> >
> > This seems to be an effect of the same bug as reported in
> > https://syzkaller.appspot.com/bug?extid=10a9db47c3a0e13eb31c
> 
> Let's do
> 
> #syz dup: KASAN: use-after-free Read in tipc_group_bc_cong
> 
> then.
> 
> 
> > Cong posted a fix for that one. Did you see the crash after applying his
> patch?
> 
> Which patch do you mean? Unfortunately kernel development process is so
> that it's not possible to figure out what fixes what.

This one:
[Patch net] tipc: check tsk->group in tipc_wait_for_cond()

///jon

> 
> I would just wait for new syzbot results.
> 
> 
> 
> > > Hello,
> > >
> > > syzbot found the following crash on:
> > >
> > > HEAD commit:f5d582777bcb Merge branch 'for-linus' of
> git://git.kernel...
> > > git tree:   upstream
> > > console output:
> > > https://syzkaller.appspot.com/x/log.txt?x=1705d52540
> > > kernel config:
> > > https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
> > > dashboard link:
> > > https://syzkaller.appspot.com/bug?extid=9845fed98688e01f431e
> > > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > > syz repro:
> https://syzkaller.appspot.com/x/repro.syz?x=101b6ba340
> > >
> > > IMPORTANT: if you fix the bug, please add the following tag to the
> commit:
> > > Reported-by: syzbot+9845fed98688e01f4...@syzkaller.appspotmail.com
> > >
> > > 8021q: adding VLAN 0 to HW filter on device team0
> > > 8021q: adding VLAN 0 to HW filter on device team0
> > > audit: type=1400 audit(1544592509.246:38): avc:  denied  { associate
> > > } for
> > > pid=6204 comm="syz-executor5" name="syz5"
> > > scontext=unconfined_u:object_r:unlabeled_t:s0
> > > tcontext=system_u:object_r:unlabeled_t:s0 tclass=filesystem
> > > permissive=1
> > >
> ==
> > > 
> > > BUG: KASAN: use-after-free in tipc_group_find_dest
> > > net/tipc/group.c:255 [inline]
> > > BUG: KASAN: use-after-free in tipc_group_cong+0x566/0x5d0
> > > net/tipc/group.c:416
> > > Read of size 8 at addr 8881c59f5000 by task syz-executor4/10565
> > >
> > > CPU: 1 PID: 10565 Comm: syz-executor4 Not tainted 4.20.0-rc6+ #151
> > > Hardware name: Google Google Compute Engine/Google Compute
> Engine,
> > > BIOS Google 01/01/2011 Call Trace:
> > >   __dump_stack lib/dump_stack.c:77 [inline]
> > >   dump_stack+0x244/0x39d lib/dump_stack.c:113
> > >   print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
> > >   kasan_report_error mm/kasan/report.c:354 [inline]
> > >   kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
> > >   __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
> > >   tipc_group_find_dest net/tipc/group.c:255 [inline]
> > >   tipc_group_cong+0x566/0x5d0 net/tipc/group.c:416
> > >   tipc_send_group_anycast+0x9bb/0xc80 net/tipc/socket.c:972
> > >   __tipc_sendmsg+0x12b1/0x1d40 net/tipc/socket.c:1309
> > >   tipc_sendmsg+0x50/0x70 net/tipc/socket.c:1272
> > >   sock_sendmsg_nosec net/socket.c:621 [inline]
> > >   sock_sendmsg+0xd5/0x120 net/socket.c:631
> > >   ___sys_sendmsg+0x7fd/0x930 net/socket.c:2116
> > >   __sys_sendmsg+0x11d/0x280 net/socket.c:2154
> > >   __do_sys_sendmsg net/socket.c:2163 [inline]
> > >   __se_sys_sendmsg net/socket.c:2161 [inline]
> > >   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2161
> > >   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> > >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > RIP: 0033:0x457679
> > > Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00

RE: general protection fault in __ip_append_data

2018-12-12 Thread Jon Maloy


> -Original Message-
> From: Dmitry Vyukov 
> Sent: 12-Dec-18 06:03
> To: syzbot+aab62b9c7b12e7c6a...@syzkaller.appspotmail.com; Jon Maloy
> ; Ying Xue ; David
> Miller ; tipc-discuss...@lists.sourceforge.net
> Cc: Alexey Kuznetsov ; LKML  ker...@vger.kernel.org>; netdev ; syzkaller-bugs
> ; Hideaki YOSHIFUJI  ipv6.org>
> Subject: Re: general protection fault in __ip_append_data
> 
> On Wed, Dec 12, 2018 at 11:57 AM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:f5d582777bcb Merge branch 'for-linus' of git://git.kernel...
> > git tree:   upstream
> > console output:
> > https://syzkaller.appspot.com/x/log.txt?x=16e03afb40
> > kernel config:
> > https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
> > dashboard link:
> https://syzkaller.appspot.com/bug?extid=aab62b9c7b12e7c6ab0b
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syz repro:
> https://syzkaller.appspot.com/x/repro.syz?x=13bb9c8b40
> > C reproducer:
> https://syzkaller.appspot.com/x/repro.c?x=1261667d40
> 
> From the reproducer it looks like a dup of TIPC bug:
> 
> #syz dup: KASAN: use-after-free Read in kfree_skb (2)

Agree with that, although it is not totally obvious. 
Let's see what the further testing gives after Cong's patch is applied.

///jon

> 
> 
> 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+aab62b9c7b12e7c6a...@syzkaller.appspotmail.com
> >
> > Enabling of bearer  rejected, already enabled Enabling of
> > bearer  rejected, already enabled Enabling of bearer
> >  rejected, already enabled
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 4.20.0-rc6+ #371 Hardware
> > name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google
> > 01/01/2011
> > RIP: 0010:__ip_append_data.isra.48+0x31a/0x29b0
> > net/ipv4/ip_output.c:896
> > Code: c7 85 c8 fd ff ff 00 00 00 00 0f 85 12 10 00 00 e8 7b c1 e0 fa
> > 48 8b
> > 95 48 fe ff ff 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00
> > 0f
> > 85 e5 22 00 00 48 8b 85 48 fe ff ff 48 8b 18 48 b8
> > RSP: 0018:8881d9b569c0 EFLAGS: 00010246
> > RAX: dc00 RBX:  RCX: 869ec275
> > RDX:  RSI: 869ec2f5 RDI: 0001
> > RBP: 8881d9b56c28 R08: 8881d9b4a440 R09: 86b113b0
> > R10: 8881d9b56da0 R11:  R12: 8881d2c18a88
> > R13: 86258ba0 R14: 8bc37110 R15: 8881d2c18cd8
> > FS:  () GS:8881daf0()
> > knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 20001ac0 CR3: 0001cb6ea000 CR4: 001406e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400 Call
> > Trace:
> >   ip_append_data.part.49+0xef/0x170 net/ipv4/ip_output.c:1197
> >   ip_append_data+0x6d/0x90 net/ipv4/ip_output.c:1186
> >   icmp_push_reply+0x18e/0x540 net/ipv4/icmp.c:375
> >   icmp_send+0x1544/0x1bd0 net/ipv4/icmp.c:736
> >   __udp4_lib_rcv+0x2484/0x32e0 net/ipv4/udp.c:2233
> >   udp_rcv+0x21/0x30 net/ipv4/udp.c:2392
> >   ip_local_deliver_finish+0x2e9/0xda0 net/ipv4/ip_input.c:215
> >   NF_HOOK include/linux/netfilter.h:289 [inline]
> >   ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
> >   dst_input include/net/dst.h:450 [inline]
> >   ip_rcv_finish+0x1f9/0x300 net/ipv4/ip_input.c:415
> >   NF_HOOK include/linux/netfilter.h:289 [inline]
> >   ip_rcv+0xed/0x600 net/ipv4/ip_input.c:524
> >   __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4946
> >   __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5056
> >   process_backlog+0x24e/0x7a0 net/core/dev.c:5864
> >   napi_poll net/core/dev.c:6287 [inline]
> >   net_rx_action+0x7fa/0x19b0 net/core/dev.c:6353
> >   __do_softirq+0x308/0xb7e kernel/softirq.c:292
> >   run_ksoftirqd+0x5e/0x100 kernel/softirq.c:654
> >   smpboot_thread_fn+0x68b/0xa00 kernel/smpboot.c:164
> >   kthread+0x35a/0x440 kernel/kthread.c:246
> >   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 Modules linked
> > in:
> > ---[ end trace 762165cda5fdc138 ]---
> > Enabling of bearer  rejected, already enabled
> > RIP: 0010:__ip_append_data.isra.48+0x

RE: KASAN: use-after-free Read in tipc_group_cong

2018-12-12 Thread Jon Maloy


> -Original Message-
> From: syzbot 
> Sent: 12-Dec-18 06:11
> To: da...@davemloft.net; Jon Maloy ; linux-
> ker...@vger.kernel.org; netdev@vger.kernel.org; syzkaller-
> b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net;
> ying@windriver.com
> Subject: KASAN: use-after-free Read in tipc_group_cong

This seems to be an effect of the same bug as reported in
https://syzkaller.appspot.com/bug?extid=10a9db47c3a0e13eb31c

Cong posted a fix for that one. Did you see the crash after applying his patch?

///jon

> 
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:f5d582777bcb Merge branch 'for-linus' of git://git.kernel...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1705d52540
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=9845fed98688e01f431e
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=101b6ba340
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+9845fed98688e01f4...@syzkaller.appspotmail.com
> 
> 8021q: adding VLAN 0 to HW filter on device team0
> 8021q: adding VLAN 0 to HW filter on device team0
> audit: type=1400 audit(1544592509.246:38): avc:  denied  { associate } for
> pid=6204 comm="syz-executor5" name="syz5"
> scontext=unconfined_u:object_r:unlabeled_t:s0
> tcontext=system_u:object_r:unlabeled_t:s0 tclass=filesystem permissive=1
> ==
> 
> BUG: KASAN: use-after-free in tipc_group_find_dest net/tipc/group.c:255
> [inline]
> BUG: KASAN: use-after-free in tipc_group_cong+0x566/0x5d0
> net/tipc/group.c:416
> Read of size 8 at addr 8881c59f5000 by task syz-executor4/10565
> 
> CPU: 1 PID: 10565 Comm: syz-executor4 Not tainted 4.20.0-rc6+ #151
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011 Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x244/0x39d lib/dump_stack.c:113
>   print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
>   kasan_report_error mm/kasan/report.c:354 [inline]
>   kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
>   __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>   tipc_group_find_dest net/tipc/group.c:255 [inline]
>   tipc_group_cong+0x566/0x5d0 net/tipc/group.c:416
>   tipc_send_group_anycast+0x9bb/0xc80 net/tipc/socket.c:972
>   __tipc_sendmsg+0x12b1/0x1d40 net/tipc/socket.c:1309
>   tipc_sendmsg+0x50/0x70 net/tipc/socket.c:1272
>   sock_sendmsg_nosec net/socket.c:621 [inline]
>   sock_sendmsg+0xd5/0x120 net/socket.c:631
>   ___sys_sendmsg+0x7fd/0x930 net/socket.c:2116
>   __sys_sendmsg+0x11d/0x280 net/socket.c:2154
>   __do_sys_sendmsg net/socket.c:2163 [inline]
>   __se_sys_sendmsg net/socket.c:2161 [inline]
>   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2161
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x457679
> Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 
> 0f 83
> cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f813d748c78 EFLAGS: 0246 ORIG_RAX:
> 002e
> RAX: ffda RBX: 0003 RCX: 00457679
> RDX:  RSI: 26c0 RDI: 0005
> RBP: 0072bfa0 R08:  R09: 
> R10:  R11: 0246 R12: 7f813d7496d4
> R13: 004c44dd R14: 004d74c8 R15: 
> 
> Allocated by task 10551:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
>   kmem_cache_alloc_trace+0x152/0x750 mm/slab.c:3620
>   kmalloc include/linux/slab.h:546 [inline]
>   kzalloc include/linux/slab.h:741 [inline]
>   tipc_group_create+0x152/0xa70 net/tipc/group.c:171
>   tipc_sk_join net/tipc/socket.c:2829 [inline]
>   tipc_setsockopt+0x2d1/0xd70 net/tipc/socket.c:2944
>   __sys_setsockopt+0x1ba/0x3c0 net/socket.c:1902
>   __do_sys_setsockopt net/socket.c:1913 [inline]
>   __se_sys_setsockopt net/socket.c:1910 [inline]
>   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1910
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Freed by task 10567:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   __kasan_slab_fre

RE: [Patch net] tipc: compare remote and local protocols in tipc_udp_enable()

2018-12-12 Thread Jon Maloy



> -Original Message-
> From: Cong Wang 
> Sent: 10-Dec-18 18:24
> To: netdev@vger.kernel.org
> Cc: Cong Wang ; Ying Xue
> ; Jon Maloy 
> Subject: [Patch net] tipc: compare remote and local protocols in
> tipc_udp_enable()
> 
> When TIPC_NLA_UDP_REMOTE is an IPv6 mcast address but
> TIPC_NLA_UDP_LOCAL is an IPv4 address, a NULL-ptr deref is triggered as
> the UDP tunnel sock is initialized to IPv4 or IPv6 sock merely based on the
> protocol in local address.
> 
> We should just error out when the remote address and local address have
> different protocols.

Acked-by: Jon Maloy 

Thank you for your help, Cong.

> 
> Reported-by: syzbot+eb4da3a20fad2e525...@syzkaller.appspotmail.com
> Cc: Ying Xue 
> Cc: Jon Maloy 
> Signed-off-by: Cong Wang 
> ---
>  net/tipc/udp_media.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index
> 1b1ba1310ea7..4d85d71f16e2 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -679,6 +679,11 @@ static int tipc_udp_enable(struct net *net, struct
> tipc_bearer *b,
>   if (err)
>   goto err;
> 
> + if (remote.proto != local.proto) {
> + err = -EINVAL;
> + goto err;
> + }
> +
>   /* Checking remote ip address */
>   rmcast = tipc_udp_is_mcast_addr(&remote);
> 
> --
> 2.19.2



RE: [Patch net] tipc: fix a double kfree_skb()

2018-12-10 Thread Jon Maloy
Acked. 
Thank you for both your quick fixes, Cong.

///jon


> -Original Message-
> From: Cong Wang 
> Sent: 10-Dec-18 15:46
> To: netdev@vger.kernel.org
> Cc: Cong Wang ; Ying Xue
> ; Jon Maloy 
> Subject: [Patch net] tipc: fix a double kfree_skb()
> 
> tipc_udp_xmit() drops the packet on error, there is no need to drop it again.
> 
> Fixes: ef20cd4dd163 ("tipc: introduce UDP replicast")
> Reported-and-tested-by:
> syzbot+eae585ba2cc2752d3...@syzkaller.appspotmail.com
> Cc: Ying Xue 
> Cc: Jon Maloy 
> Signed-off-by: Cong Wang 
> ---
>  net/tipc/udp_media.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index
> 10dc59ce9c82..1b1ba1310ea7 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -245,10 +245,8 @@ static int tipc_udp_send_msg(struct net *net, struct
> sk_buff *skb,
>   }
> 
>   err = tipc_udp_xmit(net, _skb, ub, src, &rcast->addr);
> - if (err) {
> - kfree_skb(_skb);
> + if (err)
>   goto out;
> - }
>   }
>   err = 0;
>  out:
> --
> 2.19.2



[net 1/1] tipc: fix lockdep warning during node delete

2018-11-26 Thread Jon Maloy
We see the following lockdep warning:

[ 2284.078521] ==
[ 2284.078604] WARNING: possible circular locking dependency detected
[ 2284.078604] 4.19.0+ #42 Tainted: GE
[ 2284.078604] --
[ 2284.078604] rmmod/254 is trying to acquire lock:
[ 2284.078604] acd94e28 ((&n->timer)#2){+.-.}, at: 
del_timer_sync+0x5/0xa0
[ 2284.078604]
[ 2284.078604] but task is already holding lock:
[ 2284.078604] f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: 
tipc_node_stop+0xac/0x190 [tipc]
[ 2284.078604]
[ 2284.078604] which lock already depends on the new lock.
[ 2284.078604]
[ 2284.078604]
[ 2284.078604] the existing dependency chain (in reverse order) is:
[ 2284.078604]
[ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
[ 2284.078604]tipc_node_timeout+0x20a/0x330 [tipc]
[ 2284.078604]call_timer_fn+0xa1/0x280
[ 2284.078604]run_timer_softirq+0x1f2/0x4d0
[ 2284.078604]__do_softirq+0xfc/0x413
[ 2284.078604]irq_exit+0xb5/0xc0
[ 2284.078604]smp_apic_timer_interrupt+0xac/0x210
[ 2284.078604]apic_timer_interrupt+0xf/0x20
[ 2284.078604]default_idle+0x1c/0x140
[ 2284.078604]do_idle+0x1bc/0x280
[ 2284.078604]cpu_startup_entry+0x19/0x20
[ 2284.078604]start_secondary+0x187/0x1c0
[ 2284.078604]secondary_startup_64+0xa4/0xb0
[ 2284.078604]
[ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
[ 2284.078604]del_timer_sync+0x34/0xa0
[ 2284.078604]tipc_node_delete+0x1a/0x40 [tipc]
[ 2284.078604]tipc_node_stop+0xcb/0x190 [tipc]
[ 2284.078604]tipc_net_stop+0x154/0x170 [tipc]
[ 2284.078604]tipc_exit_net+0x16/0x30 [tipc]
[ 2284.078604]ops_exit_list.isra.8+0x36/0x70
[ 2284.078604]unregister_pernet_operations+0x87/0xd0
[ 2284.078604]unregister_pernet_subsys+0x1d/0x30
[ 2284.078604]tipc_exit+0x11/0x6f2 [tipc]
[ 2284.078604]__x64_sys_delete_module+0x1df/0x240
[ 2284.078604]do_syscall_64+0x66/0x460
[ 2284.078604]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2284.078604]
[ 2284.078604] other info that might help us debug this:
[ 2284.078604]
[ 2284.078604]  Possible unsafe locking scenario:
[ 2284.078604]
[ 2284.078604]CPU0CPU1
[ 2284.078604]
[ 2284.078604]   lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604]lock((&n->timer)#2);
[ 2284.078604]
lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604]   lock((&n->timer)#2);
[ 2284.078604]
[ 2284.078604]  *** DEADLOCK ***
[ 2284.078604]
[ 2284.078604] 3 locks held by rmmod/254:
[ 2284.078604]  #0: 3368be9b (pernet_ops_rwsem){+.+.}, at: 
unregister_pernet_subsys+0x15/0x30
[ 2284.078604]  #1: 46ed9c86 (rtnl_mutex){+.+.}, at: 
tipc_net_stop+0x144/0x170 [tipc]
[ 2284.078604]  #2: f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: 
tipc_node_stop+0xac/0x19
[...}

The reason is that the node timer handler sometimes needs to delete a
node which has been disconnected for too long. To do this, it grabs
the lock 'node_list_lock', which may at the same time be held by the
generic node cleanup function, tipc_node_stop(), during module removal.
Since the latter is calling del_timer_sync() inside the same lock, we
have a potential deadlock.

We fix this letting the timer cleanup function use spin_trylock()
instead of just spin_lock(), and when it fails to grab the lock it
just returns so that the timer handler can terminate its execution.
This is safe to do, since tipc_node_stop() anyway is about to
delete both the timer and the node instance.

Fixes: 6a939f365bdb ("tipc: Auto removal of peer down node instance")
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 2afc4f8..4880197 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -584,12 +584,15 @@ static void  tipc_node_clear_links(struct tipc_node *node)
 /* tipc_node_cleanup - delete nodes that does not
  * have active links for NODE_CLEANUP_AFTER time
  */
-static int tipc_node_cleanup(struct tipc_node *peer)
+static bool tipc_node_cleanup(struct tipc_node *peer)
 {
struct tipc_net *tn = tipc_net(peer->net);
bool deleted = false;
 
-   spin_lock_bh(&tn->node_list_lock);
+   /* If lock held by tipc_node_stop() the node will be deleted anyway */
+   if (!spin_trylock_bh(&tn->node_list_lock))
+   return false;
+
tipc_node_write_lock(peer);
 
if (!node_is_up(peer) && time_after(jiffies, peer->delete_at)) {
-- 
1.8.3.1



RE: [PATCH net] tipc: eliminate possible recursive locking detected by LOCKDEP

2018-10-11 Thread Jon Maloy
Acked-by: Jon Maloy 

///jon


> -Original Message-
> From: Ying Xue 
> Sent: October 11, 2018 7:58 AM
> To: Jon Maloy ; dvyu...@google.com
> Cc: da...@davemloft.net; parthasarathy.bhuvara...@ericsson.com;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org; tipc-
> discuss...@lists.sourceforge.net
> Subject: [PATCH net] tipc: eliminate possible recursive locking detected by
> LOCKDEP
> 
> When booting kernel with LOCKDEP option, below warning info was found:
> 
> WARNING: possible recursive locking detected 4.19.0-rc7+ #14 Not tainted
> 
> swapper/0/1 is trying to acquire lock:
> dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
> 
> but task is already holding lock:
> cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(&(&list->lock)->rlock#4);
>   lock(&(&list->lock)->rlock#4);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by swapper/0/1:
>  #0: f7539d34 (pernet_ops_rwsem){+.+.}, at:
> register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
>  #1: cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> spin_lock_bh include/linux/spinlock.h:334 [inline]
>  #1: cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> stack backtrace:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14 Hardware name:
> QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1af/0x295 lib/dump_stack.c:113  print_deadlock_bug
> kernel/locking/lockdep.c:1759 [inline]  check_deadlock
> kernel/locking/lockdep.c:1803 [inline]  validate_chain
> kernel/locking/lockdep.c:2399 [inline]
>  __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
>  lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
> __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168  spin_lock_bh
> include/linux/spinlock.h:334 [inline]
>  tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
>  tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
>  tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
>  tipc_init_net+0x472/0x610 net/tipc/core.c:82
>  ops_init+0xf7/0x520 net/core/net_namespace.c:129
> __register_pernet_operations net/core/net_namespace.c:940 [inline]
>  register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
>  register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
>  tipc_init+0x83/0x104 net/tipc/core.c:140  do_one_initcall+0x109/0x70a
> init/main.c:885  do_initcall_level init/main.c:953 [inline]  do_initcalls
> init/main.c:961 [inline]  do_basic_setup init/main.c:979 [inline]
> kernel_init_freeable+0x4bd/0x57f init/main.c:1144
>  kernel_init+0x13/0x180 init/main.c:1063
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
> 
> The reason why the noise above was complained by LOCKDEP is because we
> nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
> function. In fact it's unnecessary to move skb buffer from l->wakeupq queue
> to l->inputq queue while holding the two locks at the same time.
> Instead, we can move skb buffers in l->wakeupq queue to a temporary list
> first and then move the buffers of the temporary list to l->inputq queue,
> which is also safe for us.
> 
> Fixes: 3f32d0be6c16 ("tipc: lock wakeup & inputq at tipc_link_reset()")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Ying Xue 
> ---
>  net/tipc/link.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c index fb886b5..1d21ae4 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -843,14 +843,21 @@ static void link_prepare_wakeup(struct tipc_link *l)
> 
>  void tipc_link_reset(struct tipc_link *l)  {
> + struct sk_buff_head list;
> +
> + __skb_queue_head_init(&list);
> +
>   l->in_session = false;
>   l->session++;
>   l->mtu = l->advertised_mtu;
> +
>   spin_lock_bh(&l->wakeupq.lock);
> + skb_queue_splice_init(&l->wakeupq, &list);
> + spin_unlock_bh(&l->wakeupq.lock);
> +
>   spin_lock_bh(&l->inputq->lock);
> - skb_queue_splice_init(&l->wakeupq, l->inputq);
> + skb_queue_splice_init(&list, l->inputq);
>   spin_unlock_bh(&l->inputq->lock);
> - spin_unlock_bh(&l->wakeupq.lock);
> 
>   __skb_queue_purge(&l->transmq);
>   __skb_queue_purge(&l->deferdq);
> --
> 2.7.4



RE: net/tipc: recursive locking in tipc_link_reset

2018-10-11 Thread Jon Maloy
Hi Dmitry,
Yes, we are aware of this, the kernel test robot warned us about this a few 
days ago.
I am looking into it.

///jon

> -Original Message-
> From: Dmitry Vyukov 
> Sent: October 11, 2018 3:55 AM
> To: parthasarathy.bhuvara...@ericsson.com; Jon Maloy
> ; David Miller ; Ying Xue
> ; netdev ; tipc-
> discuss...@lists.sourceforge.net; LKML  Subject: net/tipc: recursive locking in tipc_link_reset
> 
> Hi,
> 
> I am getting the following error while booting the latest kernel on
> bb2d8f2f61047cbde08b78ec03e4ebdb01ee5434 (Oct 10). Config is attached.
> 
> Since this happens during boot, this makes LOCKDEP completely unusable,
> does not allow to discover any other locking issues and masks all new bugs
> being introduced into kernel.
> Please fix asap.
> Thanks
> 
> 
> WARNING: possible recursive locking detected 4.19.0-rc7+ #14 Not tainted
> 
> swapper/0/1 is trying to acquire lock:
> dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
> 
> but task is already holding lock:
> cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(&(&list->lock)->rlock#4);
>   lock(&(&list->lock)->rlock#4);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by swapper/0/1:
>  #0: f7539d34 (pernet_ops_rwsem){+.+.}, at:
> register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
>  #1: cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> spin_lock_bh include/linux/spinlock.h:334 [inline]
>  #1: cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> stack backtrace:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14 Hardware name:
> QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1af/0x295 lib/dump_stack.c:113  print_deadlock_bug
> kernel/locking/lockdep.c:1759 [inline]  check_deadlock
> kernel/locking/lockdep.c:1803 [inline]  validate_chain
> kernel/locking/lockdep.c:2399 [inline]
>  __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
>  lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
> __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168  spin_lock_bh
> include/linux/spinlock.h:334 [inline]
>  tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
>  tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
>  tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
>  tipc_init_net+0x472/0x610 net/tipc/core.c:82
>  ops_init+0xf7/0x520 net/core/net_namespace.c:129
> __register_pernet_operations net/core/net_namespace.c:940 [inline]
>  register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
>  register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
>  tipc_init+0x83/0x104 net/tipc/core.c:140  do_one_initcall+0x109/0x70a
> init/main.c:885  do_initcall_level init/main.c:953 [inline]  do_initcalls
> init/main.c:961 [inline]  do_basic_setup init/main.c:979 [inline]
> kernel_init_freeable+0x4bd/0x57f init/main.c:1144
>  kernel_init+0x13/0x180 init/main.c:1063
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413


Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-24 Thread Jon Olson
On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > >From the above linked patch, I understand that there are yet
> > > > other special cases in production, such as a hard cap on #tx queues to
> > > > 32 regardless of number of vcpus.
> > >
> > > I don't think upstream kernels have this limit - we can
> > > now use vmalloc for higher number of queues.
> >
> > Yes. that patch* mentioned it as a google compute engine imposed
> > limit. It is exactly such cloud provider imposed rules that I'm
> > concerned about working around in upstream drivers.
> >
> > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>
> Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)

The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h) -- it seems like it would be fairly involved/heavyweight
to plumb one solely for this unless NAPI Tx is something many users
will want to tune (and for which other drivers would support tuning).

--
Jon Olson


[PATCH net-next] ifb: fix packets checksum

2018-05-24 Thread Jon Maxwell
Fixup the checksum for CHECKSUM_COMPLETE when pulling skbs on RX path. 
Otherwise we get splats when tc mirred is used to redirect packets to ifb.

Before fix:

nic: hw csum failure

Signed-off-by: Jon Maxwell 
---
 drivers/net/ifb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 5f2897ec0edc..d345c61d476c 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -102,7 +102,7 @@ static void ifb_ri_tasklet(unsigned long _txp)
if (!skb->tc_from_ingress) {
dev_queue_xmit(skb);
} else {
-   skb_pull(skb, skb->mac_len);
+   skb_pull_rcsum(skb, skb->mac_len);
netif_receive_skb(skb);
}
}
-- 
2.13.6



RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)


> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Wednesday, May 23, 2018 9:37 AM
> To: Jon Rosen (jrosen) 
> Cc: David S. Miller ; Willem de Bruijn 
> ; Eric Dumazet
> ; Kees Cook ; David Windsor 
> ; Rosen,
> Rami ; Reshetova, Elena ; 
> Mike Maloney
> ; Benjamin Poirier ; Thomas Gleixner 
> ; Greg
> Kroah-Hartman ; open list:NETWORKING [GENERAL] 
> ;
> open list 
> Subject: Re: [PATCH v2] packet: track ring entry use using a shadow ring to 
> prevent RX ring overrun
> 
> On Wed, May 23, 2018 at 7:54 AM, Jon Rosen (jrosen)  wrote:
> >> > For the ring, there is no requirement to allocate exactly the amount
> >> > specified by the user request. Safer than relying on shared memory
> >> > and simpler than the extra allocation in this patch would be to allocate
> >> > extra shadow memory at the end of the ring (and not mmap that).
> >> >
> >> > That still leaves an extra cold cacheline vs using tp_padding.
> >>
> >> Given my lack of experience and knowledge in writing kernel code
> >> it was easier for me to allocate the shadow ring as a separate
> >> structure.  Of course it's not about me and my skills so if it's
> >> more appropriate to allocate at the tail of the existing ring
> >> then certainly I can look at doing that.
> >
> > The memory for the ring is not one contiguous block, it's an array of
> > blocks of pages (or 'order' sized blocks of pages). I don't think
> > increasing the size of each of the blocks to provided storage would be
> > such a good idea as it will risk spilling over into the next order and
> > wasting lots of memory. I suspect it's also more complex than a single
> > shadow ring to do both the allocation and the access.
> >
> > It could be tacked onto the end of the pg_vec[] used to store the
> > pointers to the blocks. The challenge with that is that a pg_vec[] is
> > created for each of RX and TX rings so either it would have to
> > allocate unnecessary storage for TX or the caller will have to say if
> > extra space should be allocated or not.  E.g.:
> >
> > static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int 
> > scratch, void **scratch_p)
> >
> > I'm not sure avoiding the extra allocation and moving it to the
> > pg_vec[] for the RX ring is going to get the simplification you were
> > hoping for.  Is there another way of storing the shadow ring which
> > I should consider?
> 
> I did indeed mean attaching extra pages to pg_vec[]. It should be
> simpler than a separate structure, but I may be wrong.

I don't think it would be too bad, it may actually turn out to be
convenient to implement.

> 
> Either way, I still would prefer to avoid the shadow buffer completely.
> It incurs complexity and cycle cost on all users because of only the
> rare (non-existent?) consumer that overwrites the padding bytes.

I prefer that as well.  I'm just not sure there is a bulletproof
solution without the shadow state.  I also wish it were only a
theoretical issue but unfortunately it is actually something our
customers have seen.
> 
> Perhaps we can use padding yet avoid deadlock by writing a
> timed value. The simplest would be jiffies >> N. Then only a
> process that writes this exact value would be subject to drops and
> then still only for a limited period.
> 
> Instead of depending on wall clock time, like jiffies, another option
> would be to keep a percpu array of values. Each cpu has a zero
> entry if it is not writing, nonzero if it is. If a writer encounters a
> number in padding that is > num_cpus, then the state is garbage
> from userspace. If <= num_cpus, it is adhered to only until that cpu
> clears its entry, which is guaranteed to happen eventually.
> 
> Just a quick thought. This might not fly at all upon closer scrutiny.

I'm not sure I understand the suggestion, but I'll think on it
some more.

Some other options maybe worth considering (in no specific order):
- test the application to see if it will consume entries if tp_status
  is set to anything other than TP_STATUS_USER, only use shadow if
  it doesn't strictly honor the TP_STATUS_USER bit.

- skip shadow if we see new TP_STATUS_USER_TO_KERNEL is used

- use tp_len == -1 to indicate inuse





RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)
> > For the ring, there is no requirement to allocate exactly the amount
> > specified by the user request. Safer than relying on shared memory
> > and simpler than the extra allocation in this patch would be to allocate
> > extra shadow memory at the end of the ring (and not mmap that).
> >
> > That still leaves an extra cold cacheline vs using tp_padding.
> 
> Given my lack of experience and knowledge in writing kernel code
> it was easier for me to allocate the shadow ring as a separate
> structure.  Of course it's not about me and my skills so if it's
> more appropriate to allocate at the tail of the existing ring
> then certainly I can look at doing that.

The memory for the ring is not one contiguous block, it's an array of
blocks of pages (or 'order' sized blocks of pages). I don't think
increasing the size of each of the blocks to provided storage would be
such a good idea as it will risk spilling over into the next order and
wasting lots of memory. I suspect it's also more complex than a single
shadow ring to do both the allocation and the access.

It could be tacked onto the end of the pg_vec[] used to store the
pointers to the blocks. The challenge with that is that a pg_vec[] is
created for each of RX and TX rings so either it would have to
allocate unnecessary storage for TX or the caller will have to say if
extra space should be allocated or not.  E.g.:

static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int 
scratch, void **scratch_p)

I'm not sure avoiding the extra allocation and moving it to the
pg_vec[] for the RX ring is going to get the simplification you were
hoping for.  Is there another way of storing the shadow ring which
I should consider?


RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)
> >>> I think the bigger issues as you've pointed out are the cost of
> >>> the additional spin lock and should the additional state be
> >>> stored in-band (fewer cache lines) or out-of band (less risk of
> >>> breaking due to unpredictable application behavior).
> >>
> >> We don't need the spinlock if clearing the shadow byte after
> >> setting the status to user.
> >>
> >> Worst case, user will set it back to kernel while the shadow
> >> byte is not cleared yet and the next producer will drop a packet.
> >> But next producers will make progress, so there is no deadlock
> >> or corruption.
> >
> > I thought so too for a while but after spending more time than I
> > care to admit I relized the following sequence was occuring:
> >
> >Core A   Core B
> >--   --
> >- Enter spin_lock
> >-   Get tp_status of head (X)
> >tp_status == 0
> >-   Check inuse
> >inuse == 0
> >-   Allocate entry X
> >advance head (X+1)
> >set inuse=1
> >- Exit spin_lock
> >
> >  
> >
> >  > where N = size of ring>
> >
> > - Enter spin_lock
> > -   get tp_status of head (X+N)
> > tp_status == 0 (but slot
> > in use for X on core A)
> >
> >- write tp_status of <--- trouble!
> >  X = TP_STATUS_USER <--- trouble!
> >- write inuse=0  <--- trouble!
> >
> > -   Check inuse
> > inuse == 0
> > -   Allocate entry X+N
> > advance head (X+N+1)
> > set inuse=1
> > - Exit spin_lock
> >
> >
> > At this point Core A just passed slot X to userspace with a
> > packet and Core B has just been assigned slot X+N (same slot as
> > X) for it's new packet. Both cores A and B end up filling in that
> > slot.  Tracking ths donw was one of the reasons it took me a
> > while to produce these updated diffs.
> 
> Is this not just an ordering issue? Since inuse is set after tp_status,
> it has to be tested first (and barriers are needed to avoid reordering).

I changed the code as you suggest to do the inuse check first and
removed the extra added spin_lock/unlock and it seems to be working.
I was able to run through the night without an issue (normally I would
hit the ring corruption in 1 to 2 hours).

Thanks for pointing that out, I should have caught that myself.  Next
I'll look at your suggestion for where to put the shadow ring.


RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-22 Thread Jon Rosen (jrosen)
On Monday, May 21, 2018 2:17 PM, Jon Rosen (jrosen)  wrote:
> On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
>  wrote:
>> On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen)  wrote:

...snip...

>>
>> A setsockopt for userspace to signal a stricter interpretation of
>> tp_status to elide the shadow hack could then be considered.
>> It's not pretty. Either way, no full new version is required.
>>
>>> As much as I would like to find a solution that doesn't require
>>> the spin lock I have yet to do so. Maybe the answer is that
>>> existing applications will need to suffer the performance impact
>>> but a new version or option for TPACKET_V1/V2 could be added to
>>> indicate strict adherence of the TP_STATUS_USER bit and then the
>>> original diffs could be used.

It looks like adding new socket options is pretty rare so I
wonder if a better option might be to define a new TP_STATUS_XXX
bit which would signal from a userspace application to the kernel
that it strictly interprets the TP_STATUS_USER bit to determine
ownership.

Todays applications set tp_status = TP_STATUS_KERNEL(0) for the
kernel to pick up the entry.  We could define a new value to pass
ownership as well as one to indicate to other kernel threads that
an entry is inuse:

#define TP_STATUS_USER_TO_KERNEL(1 << 8)
#define TP_STATUS_INUSE (1 << 9)

If the kernel sees tp_status == TP_STATUS_KERNEL then it should
use the shadow method for tacking ownership. If it sees tp_status
== TP_STATUS_USER_TO_KERNEL then it can use the TP_STATUS_INUSE
method.

>>>
>>> There is another option I was considering but have yet to try
>>> which would avoid needing a shadow ring by using counter(s) to
>>> track maximum sequence number queued to userspace vs. the next
>>> sequence number to be allocated in the ring.  If the difference
>>> is greater than the size of the ring then the ring can be
>>> considered full and the allocation would fail. Of course this may
>>> create an additional hotspot between cores, not sure if that
>>> would be significant or not.
>>
>> Please do have a look, but I don't think that this will work in this
>> case in practice. It requires tracking the producer tail. Updating
>> the slowest writer requires probing each subsequent slot's status
>> byte to find the new tail, which is a lot of (by then cold) cacheline
>> reads.
>
> I've thought about it a little more and am not convinced it's
> workable but I'll spend a little more time on it before giving
> up.

I've given up on this method.  Just don't see how to make it work.



RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-21 Thread Jon Rosen (jrosen)
On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
 wrote:
>On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen)  wrote:
>> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
>>  wrote:
>>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
>>>  wrote:
>>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen  wrote:
>>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>>>> casues the ring to get corrupted by allowing multiple kernel threads
>>>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>>>> ring structure to prevent other kernel threads from reusing the same
>>>>> entry before it's fully filled in, passed to user space, and then
>>>>> eventually passed back to the kernel for use with a new packet.
>>>>>
>>>>> Signed-off-by: Jon Rosen 
>>>>> ---
>>>>>
>>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>>>> it possible for multiple kernel threads to claim ownership of the same
>>>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>>>
>>>>> These diffs are the second proposed solution, previous proposal was 
>>>>> described
>>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>>>> to prevent RX ring overrun
>>>>>
>>>>> Those diffs would have changed the binary interface and have broken 
>>>>> certain
>>>>> applications. Consensus was that such a change would be inappropriate.
>>>>>
>>>>> These new diffs use a shadow ring in kernel space for tracking 
>>>>> intermediate
>>>>> state of an entry and prevent more than one kernel thread from 
>>>>> simultaneously
>>>>> allocating a ring entry. This avoids any impact to the binary interface
>>>>> between kernel and userspace but comes at the additional cost of 
>>>>> requiring a
>>>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>>>
>>>>> Jon Rosen (1):
>>>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>>>> overrun
>>>>>
>>>>>  net/packet/af_packet.c | 64 
>>>>> ++
>>>>>  net/packet/internal.h  | 14 +++
>>>>>  2 files changed, 78 insertions(+)
>>>>>
>>>>
>>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
>>>>> net_device *dev,
>>>>>  #endif
>>>>>
>>>>> if (po->tp_version <= TPACKET_V2) {
>>>>> +   spin_lock(&sk->sk_receive_queue.lock);
>>>>> __packet_set_status(po, h.raw, status);
>>>>> +   packet_rx_shadow_release(rx_shadow_ring_entry);
>>>>> +   spin_unlock(&sk->sk_receive_queue.lock);
>>>>> +
>>>>> sk->sk_data_ready(sk);
>>>>
>>>> Thanks for continuing to look at this. I spent some time on it last time
>>>> around but got stuck, too.
>>>>
>>>> This version takes an extra spinlock in the hot path. That will be very
>>>> expensive. Once we need to accept that, we could opt for a simpler
>>>> implementation akin to the one discussed in the previous thread:
>>>>
>>>> stash a value in tp_padding or similar while tp_status remains
>>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>>>> threads. The issue previously was that that field could not atomically
>>>> be cleared together with __packet_set_status. This is no longer
>>>> an issue when holding the queue lock.
>>>>
>>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>>>> clear it after flipping status (userspace should treat it as undefined).
>>>>
>>>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>>>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
>>>> platforms.
>>>>
>>>> The danger with using padding is that a process may w

RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-21 Thread Jon Rosen (jrosen)
On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
 wrote:
> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
>  wrote:
>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen  wrote:
>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>> casues the ring to get corrupted by allowing multiple kernel threads
>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>> ring structure to prevent other kernel threads from reusing the same
>>> entry before it's fully filled in, passed to user space, and then
>>> eventually passed back to the kernel for use with a new packet.
>>>
>>> Signed-off-by: Jon Rosen 
>>> ---
>>>
>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>> it possible for multiple kernel threads to claim ownership of the same
>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>
>>> These diffs are the second proposed solution, previous proposal was 
>>> described
>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>> to prevent RX ring overrun
>>>
>>> Those diffs would have changed the binary interface and have broken certain
>>> applications. Consensus was that such a change would be inappropriate.
>>>
>>> These new diffs use a shadow ring in kernel space for tracking intermediate
>>> state of an entry and prevent more than one kernel thread from 
>>> simultaneously
>>> allocating a ring entry. This avoids any impact to the binary interface
>>> between kernel and userspace but comes at the additional cost of requiring a
>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>
>>> Jon Rosen (1):
>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>> overrun
>>>
>>>  net/packet/af_packet.c | 64 
>>> ++
>>>  net/packet/internal.h  | 14 +++
>>>  2 files changed, 78 insertions(+)
>>>
>>
>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
>>> net_device *dev,
>>>  #endif
>>>
>>> if (po->tp_version <= TPACKET_V2) {
>>> +   spin_lock(&sk->sk_receive_queue.lock);
>>> __packet_set_status(po, h.raw, status);
>>> +   packet_rx_shadow_release(rx_shadow_ring_entry);
>>> +   spin_unlock(&sk->sk_receive_queue.lock);
>>> +
>>> sk->sk_data_ready(sk);
>>
>> Thanks for continuing to look at this. I spent some time on it last time
>> around but got stuck, too.
>>
>> This version takes an extra spinlock in the hot path. That will be very
>> expensive. Once we need to accept that, we could opt for a simpler
>> implementation akin to the one discussed in the previous thread:
>>
>> stash a value in tp_padding or similar while tp_status remains
>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>> threads. The issue previously was that that field could not atomically
>> be cleared together with __packet_set_status. This is no longer
>> an issue when holding the queue lock.
>>
>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>> clear it after flipping status (userspace should treat it as undefined).
>>
>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
>> platforms.
>>
>> The danger with using padding is that a process may write to it
>> and cause deadlock, of course. There is no logical reason for doing
>> so.
>
> For the ring, there is no requirement to allocate exactly the amount
> specified by the user request. Safer than relying on shared memory
> and simpler than the extra allocation in this patch would be to allocate
> extra shadow memory at the end of the ring (and not mmap that).
>
> That still leaves an extra cold cacheline vs using tp_padding.

Given my lack of experience and knowledge in writing kernel code
it was easier for me to allocate the shadow ring as a separate
structure.  Of course it's not about me and my skills so if it's
more appropriate to allocate at the tail of the existing ring
then certainly I can look at doing that.

I think the bigger issues as you've 

RE: [PATCH net-next] tipc: eliminate complaint of KMSAN uninit-value in tipc_conn_rcv_sub

2018-05-21 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org 
> On Behalf Of David Miller
> Sent: Saturday, May 19, 2018 23:00
> To: ying@windriver.com
> Cc: netdev@vger.kernel.org; Jon Maloy ;
> syzkaller-b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net
> Subject: Re: [PATCH net-next] tipc: eliminate complaint of KMSAN uninit-
> value in tipc_conn_rcv_sub
> 
> From: Ying Xue 
> Date: Fri, 18 May 2018 19:50:55 +0800
> 
> > As variable s of struct tipc_subscr type is not initialized in
> > tipc_conn_rcv_from_sock() before it is used in tipc_conn_rcv_sub(),
> > KMSAN reported the following uninit-value type complaint:
> 
> I agree with others that the short read is the bug.
> 
> You need to decide what should happen if not a full tipc_subscr object is
> obtained from the sock_recvmsg() call.
> 
> Proceeding to pass it on to tipc_conn_rcv_sub() cannot possibly be correct.
> 
> You're not getting what you are expecting from the peer, the memset() you
> are adding doesn't change that.
> 
> And once you get this badly sized read, what does that do to the stream of
> subsequent recvmsg calls here?

This socket/connection of type SOCK_SEQPACKET, so if anything like this 
happens, it is an error, and the connection should be aborted.
///jon



RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-05-19 Thread Jon Rosen (jrosen)
Forward link to a new proposed patch at:
https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html



[PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-19 Thread Jon Rosen
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry. Track ownership in a shadow
ring structure to prevent other kernel threads from reusing the same
entry before it's fully filled in, passed to user space, and then
eventually passed back to the kernel for use with a new packet.

Signed-off-by: Jon Rosen 
---

There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
it possible for multiple kernel threads to claim ownership of the same
ring entry, corrupting the ring and the corresponding packet(s).

These diffs are the second proposed solution, previous proposal was described
in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
to prevent RX ring overrun

Those diffs would have changed the binary interface and have broken certain
applications. Consensus was that such a change would be inappropriate.

These new diffs use a shadow ring in kernel space for tracking intermediate
state of an entry and prevent more than one kernel thread from simultaneously
allocating a ring entry. This avoids any impact to the binary interface
between kernel and userspace but comes at the additional cost of requiring a
second spin_lock when passing ownership of a ring entry to userspace.

Jon Rosen (1):
  packet: track ring entry use using a shadow ring to prevent RX ring
overrun

 net/packet/af_packet.c | 64 ++
 net/packet/internal.h  | 14 +++
 2 files changed, 78 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e0f3f4a..4d08c8e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2165,6 +2165,26 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev,
return 0;
 }
 
+static inline void *packet_rx_shadow_aquire_head(struct packet_sock *po)
+{
+   struct packet_ring_shadow_entry *entry;
+
+   entry = &po->rx_shadow.ring[po->rx_ring.head];
+   if (unlikely(entry->inuse))
+   return NULL;
+
+   entry->inuse = 1;
+   return (void *)entry;
+}
+
+static inline void packet_rx_shadow_release(void *_entry)
+{
+   struct packet_ring_shadow_entry *entry;
+
+   entry = (struct packet_ring_shadow_entry *)_entry;
+   entry->inuse = 0;
+}
+
 static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
   struct packet_type *pt, struct net_device *orig_dev)
 {
@@ -2182,6 +2202,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
__u32 ts_status;
bool is_drop_n_account = false;
bool do_vnet = false;
+   void *rx_shadow_ring_entry = NULL;
 
/* struct tpacket{2,3}_hdr is aligned to a multiple of 
TPACKET_ALIGNMENT.
 * We may add members to them until current aligned size without forcing
@@ -2277,7 +2298,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!h.raw)
goto drop_n_account;
if (po->tp_version <= TPACKET_V2) {
+   /* Attempt to allocate shadow ring entry.
+* If already inuse then the ring is full.
+*/
+   rx_shadow_ring_entry = packet_rx_shadow_aquire_head(po);
+   if (unlikely(!rx_shadow_ring_entry))
+   goto ring_is_full;
+
packet_increment_rx_head(po, &po->rx_ring);
+
/*
 * LOSING will be reported till you read the stats,
 * because it's COR - Clear On Read.
@@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
 #endif
 
if (po->tp_version <= TPACKET_V2) {
+   spin_lock(&sk->sk_receive_queue.lock);
__packet_set_status(po, h.raw, status);
+   packet_rx_shadow_release(rx_shadow_ring_entry);
+   spin_unlock(&sk->sk_receive_queue.lock);
+
sk->sk_data_ready(sk);
} else {
prb_clear_blk_fill_status(&po->rx_ring);
@@ -4197,6 +4230,25 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
goto out;
 }
 
+static struct packet_ring_shadow_entry *
+   packet_rx_shadow_alloc(unsigned int tp_frame_nr)
+{
+   struct packet_ring_shadow_entry *rx_shadow_ring;
+   int ring_size;
+   int i;
+
+   ring_size = tp_frame_nr * sizeof(*rx_shadow_ring);
+   rx_shadow_ring = kmalloc(ring_size, GFP_KERNEL);
+
+   if (!rx_shadow_ring)
+   return NULL;
+
+   for (i = 0; i < tp_frame_nr; i++)
+   rx_shadow_ring[i].inuse = 0;
+
+   return rx_shadow_ring;
+}
+
 static int packet_set_ring(struct sock *sk, union tpacket_re

[iproute2-next v3 1/1] tipc: fixed node and name table listings

2018-05-17 Thread Jon Maloy
We make it easier for users to correlate between 128-bit node
identities and 32-bit node hash number by extending the 'node list'
command to also show the hash number.

We also improve the 'nametable show' command to show the node identity
instead of the node hash number. Since the former potentially is much
longer than the latter, we make room for it by eliminating the (to the
user) irrelevant publication key. We also reorder some of the columns so
that the node id comes last, since this looks nicer and is more logical.

---
v2: Fixed compiler warning as per comment from David Ahern
v3: Fixed leaking socket as per comment from David Ahern

Signed-off-by: Jon Maloy 
---
 tipc/misc.c  | 20 
 tipc/misc.h  |  1 +
 tipc/nametable.c | 18 ++
 tipc/node.c  | 19 ---
 tipc/peer.c  |  4 
 5 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/tipc/misc.c b/tipc/misc.c
index 16849f1..e4b1cd0 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -13,6 +13,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -109,3 +113,19 @@ void nodeid2str(uint8_t *id, char *str)
for (i = 31; str[i] == '0'; i--)
str[i] = 0;
 }
+
+void hash2nodestr(uint32_t hash, char *str)
+{
+   struct tipc_sioc_nodeid_req nr = {};
+   int sd;
+
+   sd = socket(AF_TIPC, SOCK_RDM, 0);
+   if (sd < 0) {
+   fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
+   return;
+   }
+   nr.peer = hash;
+   if (!ioctl(sd, SIOCGETNODEID, &nr))
+   nodeid2str((uint8_t *)nr.node_id, str);
+   close(sd);
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 6e8afdd..ff2f31f 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -17,5 +17,6 @@
 uint32_t str2addr(char *str);
 int str2nodeid(char *str, uint8_t *id);
 void nodeid2str(uint8_t *id, char *str);
+void hash2nodestr(uint32_t hash, char *str);
 
 #endif
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 2578940..ae73dfa 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -20,6 +20,7 @@
 #include "cmdl.h"
 #include "msg.h"
 #include "nametable.h"
+#include "misc.h"
 
 #define PORTID_STR_LEN 45 /* Four u32 and five delimiter chars */
 
@@ -31,6 +32,7 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, void 
*data)
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1] = {};
const char *scope[] = { "", "zone", "cluster", "node" };
+   char str[33] = {0,};
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NAME_TABLE])
@@ -45,20 +47,20 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
-  "Type", "Lower", "Upper", "Node", "Port",
-  "Publication Scope");
+   printf("%-10s %-10s %-10s %-8s %-10s %-33s\n",
+  "Type", "Lower", "Upper", "Scope", "Port",
+  "Node");
(*iteration)++;
 
-   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
+   hash2nodestr(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]), str);
+
+   printf("%-10u %-10u %-10u %-8s %-10u %s\n",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])],
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
-
-   printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
+  str);
 
return MNL_CB_OK;
 }
diff --git a/tipc/node.c b/tipc/node.c
index b73b644..0fa1064 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -26,10 +26,11 @@
 
 static int node_list_cb(const struct nlmsghdr *nlh, void *data)
 {
-   uint32_t addr;
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1] = {};
+   char str[33] = {};
+   uint32_t addr;
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NODE])
@@ -40,13 +41,12 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MNL_

[iproute2-next v2 1/1] tipc: fixed node and name table listings

2018-05-15 Thread Jon Maloy
We make it easier for users to correlate between 128-bit node
identities and 32-bit node hash number by extending the 'node list'
command to also show the hash number.

We also improve the 'nametable show' command to show the node identity
instead of the node hash number. Since the former potentially is much
longer than the latter, we make room for it by eliminating the (to the
user) irrelevant publication key. We also reorder some of the columns so
that the node id comes last, since this looks nicer and is more logical.

---
v2: Fixed compiler warning as per comment from David Ahern

Signed-off-by: Jon Maloy 
---
 tipc/misc.c  | 18 ++
 tipc/misc.h  |  1 +
 tipc/nametable.c | 18 ++
 tipc/node.c  | 19 ---
 tipc/peer.c  |  4 
 5 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/tipc/misc.c b/tipc/misc.c
index 16849f1..e8b726f 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -109,3 +112,18 @@ void nodeid2str(uint8_t *id, char *str)
for (i = 31; str[i] == '0'; i--)
str[i] = 0;
 }
+
+void hash2nodestr(uint32_t hash, char *str)
+{
+   struct tipc_sioc_nodeid_req nr = {};
+   int sd;
+
+   sd = socket(AF_TIPC, SOCK_RDM, 0);
+   if (sd < 0) {
+   fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
+   return;
+   }
+   nr.peer = hash;
+   if (!ioctl(sd, SIOCGETNODEID, &nr))
+   nodeid2str((uint8_t *)nr.node_id, str);
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 6e8afdd..ff2f31f 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -17,5 +17,6 @@
 uint32_t str2addr(char *str);
 int str2nodeid(char *str, uint8_t *id);
 void nodeid2str(uint8_t *id, char *str);
+void hash2nodestr(uint32_t hash, char *str);
 
 #endif
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 2578940..ae73dfa 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -20,6 +20,7 @@
 #include "cmdl.h"
 #include "msg.h"
 #include "nametable.h"
+#include "misc.h"
 
 #define PORTID_STR_LEN 45 /* Four u32 and five delimiter chars */
 
@@ -31,6 +32,7 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, void 
*data)
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1] = {};
const char *scope[] = { "", "zone", "cluster", "node" };
+   char str[33] = {0,};
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NAME_TABLE])
@@ -45,20 +47,20 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
-  "Type", "Lower", "Upper", "Node", "Port",
-  "Publication Scope");
+   printf("%-10s %-10s %-10s %-8s %-10s %-33s\n",
+  "Type", "Lower", "Upper", "Scope", "Port",
+  "Node");
(*iteration)++;
 
-   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
+   hash2nodestr(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]), str);
+
+   printf("%-10u %-10u %-10u %-8s %-10u %s\n",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])],
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
-
-   printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
+  str);
 
return MNL_CB_OK;
 }
diff --git a/tipc/node.c b/tipc/node.c
index b73b644..0fa1064 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -26,10 +26,11 @@
 
 static int node_list_cb(const struct nlmsghdr *nlh, void *data)
 {
-   uint32_t addr;
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1] = {};
+   char str[33] = {};
+   uint32_t addr;
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NODE])
@@ -40,13 +41,12 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MNL_CB_ERROR;
 
addr = mnl_attr_get_u32(attrs[TIPC_NLA_NODE_ADDR]);
-   printf("%

[PATCH net-next v2] tcp: Add mark for TIMEWAIT sockets

2018-05-09 Thread Jon Maxwell
This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP 
races. 
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement. 
- Factorize code as sk_fullsock() check is not necessary.

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via 
setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark 
location. 
Then progate this so that the skb gets sent with the correct mark. Do the same 
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell 
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c |  2 +-
 net/ipv4/tcp_ipv4.c  | 16 ++--
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv6/tcp_ipv6.c  |  6 +-
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
+   __u32   tw_mark;
volatile unsigned char  tw_substate;
unsigned char   tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
oif = skb->skb_iif;
 
flowi4_init_output(&fl4, oif,
-  IP4_REPLY_MARK(net, skb->mark),
+  IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
   RT_TOS(arg->tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..caf23de88f8a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
struct net *net;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,16 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk)
+   ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+  inet_twsk(sk)->tw_mark : sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, &TCP_SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  &arg, arg.iov[0].iov_len);
 
+   ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +765,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+   struct sock *ctl_sk;
 
memset(&rep.th, 0, sizeof(struct tcphdr));
memset(&arg, 0, sizeof(arg));
@@ -809,11 +816,16 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk)
+   ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+  inet_twsk(sk)->tw_mark : sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, &TCP_SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  &arg, arg.iov[0].iov_len);

[PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets

2018-05-09 Thread Jon Maxwell
This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP 
races. 
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement. 

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via 
setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark 
location. 
Then progate this so that the skb gets sent with the correct mark. Do the same 
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell 
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c |  2 +-
 net/ipv4/tcp_ipv4.c  | 18 --
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv6/tcp_ipv6.c  |  7 ++-
 5 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
+   __u32   tw_mark;
volatile unsigned char  tw_substate;
unsigned char   tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
oif = skb->skb_iif;
 
flowi4_init_output(&fl4, oif,
-  IP4_REPLY_MARK(net, skb->mark),
+  IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
   RT_TOS(arg->tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..fbee36579c83 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
struct net *net;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, &TCP_SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  &arg, arg.iov[0].iov_len);
 
+   ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+   struct sock *ctl_sk;
 
memset(&rep.th, 0, sizeof(struct tcphdr));
memset(&arg, 0, sizeof(arg));
@@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, &TCP_SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
 

[PATCH net-next] tcp: Add mark for TIMEWAIT sockets

2018-05-09 Thread Jon Maxwell
Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TIME_WAIT sockets where the original socket had 
sk->sk_mark set via setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark 
location. 
Then copy this into ctl_sk->sk_mark so that the skb gets sent with the correct 
mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over 
sk->sk_mark so that netfilter rules are still honored.

Signed-off-by: Jon Maxwell 
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c |  3 ++-
 net/ipv4/tcp_ipv4.c  | 18 --
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv6/tcp_ipv6.c  |  8 +++-
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
+   __u32   tw_mark;
volatile unsigned char  tw_substate;
unsigned char   tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..cca4412dc4cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1539,6 +1539,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
struct sk_buff *nskb;
int err;
int oif;
+   __u32 mark = IP4_REPLY_MARK(net, skb->mark);
 
if (__ip_options_echo(net, &replyopts.opt.opt, skb, sopt))
return;
@@ -1561,7 +1562,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
oif = skb->skb_iif;
 
flowi4_init_output(&fl4, oif,
-  IP4_REPLY_MARK(net, skb->mark),
+  mark ? (mark) : sk->sk_mark,
   RT_TOS(arg->tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..fbee36579c83 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
struct net *net;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, &TCP_SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  &arg, arg.iov[0].iov_len);
 
+   ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+   struct sock *ctl_sk;
 
memset(&rep.th, 0, sizeof(struct tcphdr));
memset(&arg, 0, sizeof(arg));
@@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb,

RE: [PATCH net] tipc: fix one byte leak in tipc_sk_set_orig_addr()

2018-05-09 Thread Jon Maloy
Acked-by: Jon Maloy 

Thank you Eric.

> -Original Message-
> From: Eric Dumazet [mailto:eduma...@google.com]
> Sent: Wednesday, May 09, 2018 09:50
> To: David S . Miller 
> Cc: netdev ; Eric Dumazet
> ; Eric Dumazet ; Jon
> Maloy ; Ying Xue 
> Subject: [PATCH net] tipc: fix one byte leak in tipc_sk_set_orig_addr()
> 
> sysbot/KMSAN reported an uninit-value in recvmsg() that I tracked down to
> tipc_sk_set_orig_addr(), missing
> srcaddr->member.scope initialization.
> 
> This patches moves srcaddr->sock.scope init to follow fields order and ease
> future verifications.
> 
> BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184
> [inline]
> BUG: KMSAN: uninit-value in move_addr_to_user+0x32e/0x530
> net/socket.c:226
> CPU: 0 PID: 4549 Comm: syz-executor287 Not tainted 4.17.0-rc3+ #88
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011 Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:113
>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>  kmsan_internal_check_memory+0x135/0x1e0 mm/kmsan/kmsan.c:1157
>  kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199  copy_to_user
> include/linux/uaccess.h:184 [inline]
>  move_addr_to_user+0x32e/0x530 net/socket.c:226
>  ___sys_recvmsg+0x4e2/0x810 net/socket.c:2285  __sys_recvmsg
> net/socket.c:2328 [inline]  __do_sys_recvmsg net/socket.c:2338 [inline]
> __se_sys_recvmsg net/socket.c:2335 [inline]
>  __x64_sys_recvmsg+0x325/0x460 net/socket.c:2335
>  do_syscall_64+0x154/0x220 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x4455e9
> RSP: 002b:7fe3bd36ddb8 EFLAGS: 0246 ORIG_RAX:
> 002f
> RAX: ffda RBX: 006dac24 RCX: 004455e9
> RDX: 2002 RSI: 2400 RDI: 0003
> RBP: 006dac20 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 7fff98ce4b6f R14: 7fe3bd36e9c0 R15: 0003
> 
> Local variable description: addr@___sys_recvmsg Variable was created
> at:
>  ___sys_recvmsg+0xd5/0x810 net/socket.c:2246  __sys_recvmsg
> net/socket.c:2328 [inline]  __do_sys_recvmsg net/socket.c:2338 [inline]
> __se_sys_recvmsg net/socket.c:2335 [inline]
>  __x64_sys_recvmsg+0x325/0x460 net/socket.c:2335
> 
> Byte 19 of 32 is uninitialized
> 
> Fixes: 31c82a2d9d51 ("tipc: add second source address to
> recvmsg()/recvfrom()")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> Cc: Jon Maloy 
> Cc: Ying Xue 
> ---
>  net/tipc/socket.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c index
> 252a52ae0893261fc6f146ad8c59f375fdce..6be21575503aa532014e7aa141
> 5b2bf294757308 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -1516,10 +1516,10 @@ static void tipc_sk_set_orig_addr(struct msghdr
> *m, struct sk_buff *skb)
> 
>   srcaddr->sock.family = AF_TIPC;
>   srcaddr->sock.addrtype = TIPC_ADDR_ID;
> + srcaddr->sock.scope = 0;
>   srcaddr->sock.addr.id.ref = msg_origport(hdr);
>   srcaddr->sock.addr.id.node = msg_orignode(hdr);
>   srcaddr->sock.addr.name.domain = 0;
> - srcaddr->sock.scope = 0;
>   m->msg_namelen = sizeof(struct sockaddr_tipc);
> 
>   if (!msg_in_group(hdr))
> @@ -1528,6 +1528,7 @@ static void tipc_sk_set_orig_addr(struct msghdr
> *m, struct sk_buff *skb)
>   /* Group message users may also want to know sending member's id
> */
>   srcaddr->member.family = AF_TIPC;
>   srcaddr->member.addrtype = TIPC_ADDR_NAME;
> + srcaddr->member.scope = 0;
>   srcaddr->member.addr.name.name.type = msg_nametype(hdr);
>   srcaddr->member.addr.name.name.instance = TIPC_SKB_CB(skb)-
> >orig_member;
>   srcaddr->member.addr.name.domain = 0;
> --
> 2.17.0.441.gb46fe60e1d-goog



[net-next 1/1] tipc: clean up removal of binding table items

2018-05-08 Thread Jon Maloy
In commit be47e41d77fb ("tipc: fix use-after-free in tipc_nametbl_stop")
we fixed a problem caused by premature release of service range items.

That fix is correct, and solved the problem. However, it doesn't address
the root of the problem, which is that we don't lookup the tipc_service
 -> service_range -> publication items in the correct hierarchical
order.

In this commit we try to make this right, and as a side effect obtain
some code simplification.

Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 103 ++
 1 file changed, 53 insertions(+), 50 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index dd1c4fa..bebe88c 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -136,12 +136,12 @@ static struct tipc_service *tipc_service_create(u32 type, 
struct hlist_head *hd)
 }
 
 /**
- * tipc_service_find_range - find service range matching a service instance
+ * tipc_service_first_range - find first service range in tree matching 
instance
  *
  * Very time-critical, so binary search through range rb tree
  */
-static struct service_range *tipc_service_find_range(struct tipc_service *sc,
-u32 instance)
+static struct service_range *tipc_service_first_range(struct tipc_service *sc,
+ u32 instance)
 {
struct rb_node *n = sc->ranges.rb_node;
struct service_range *sr;
@@ -158,6 +158,30 @@ static struct service_range 
*tipc_service_find_range(struct tipc_service *sc,
return NULL;
 }
 
+/*  tipc_service_find_range - find service range matching publication 
parameters
+ */
+static struct service_range *tipc_service_find_range(struct tipc_service *sc,
+u32 lower, u32 upper)
+{
+   struct rb_node *n = sc->ranges.rb_node;
+   struct service_range *sr;
+
+   sr = tipc_service_first_range(sc, lower);
+   if (!sr)
+   return NULL;
+
+   /* Look for exact match */
+   for (n = &sr->tree_node; n; n = rb_next(n)) {
+   sr = container_of(n, struct service_range, tree_node);
+   if (sr->upper == upper)
+   break;
+   }
+   if (!n || sr->lower != lower || sr->upper != upper)
+   return NULL;
+
+   return sr;
+}
+
 static struct service_range *tipc_service_create_range(struct tipc_service *sc,
   u32 lower, u32 upper)
 {
@@ -238,54 +262,19 @@ static struct publication 
*tipc_service_insert_publ(struct net *net,
 /**
  * tipc_service_remove_publ - remove a publication from a service
  */
-static struct publication *tipc_service_remove_publ(struct net *net,
-   struct tipc_service *sc,
-   u32 lower, u32 upper,
-   u32 node, u32 key,
-   struct service_range **rng)
+static struct publication *tipc_service_remove_publ(struct service_range *sr,
+   u32 node, u32 key)
 {
-   struct tipc_subscription *sub, *tmp;
-   struct service_range *sr;
struct publication *p;
-   bool found = false;
-   bool last = false;
-   struct rb_node *n;
-
-   sr = tipc_service_find_range(sc, lower);
-   if (!sr)
-   return NULL;
 
-   /* Find exact matching service range */
-   for (n = &sr->tree_node; n; n = rb_next(n)) {
-   sr = container_of(n, struct service_range, tree_node);
-   if (sr->upper == upper)
-   break;
-   }
-   if (!n || sr->lower != lower || sr->upper != upper)
-   return NULL;
-
-   /* Find publication, if it exists */
list_for_each_entry(p, &sr->all_publ, all_publ) {
if (p->key != key || (node && node != p->node))
continue;
-   found = true;
-   break;
+   list_del(&p->all_publ);
+   list_del(&p->local_publ);
+   return p;
}
-   if (!found)
-   return NULL;
-
-   list_del(&p->all_publ);
-   list_del(&p->local_publ);
-   if (list_empty(&sr->all_publ))
-   last = true;
-
-   /* Notify any waiting subscriptions */
-   list_for_each_entry_safe(sub, tmp, &sc->subscriptions, service_list) {
-   tipc_sub_report_overlap(sub, p->lower, p->upper, TIPC_WITHDRAWN,
-   p->port, p->node, p->scope, last);
-   }
-   *rng = sr;
-   return p;
+   return NULL;
 }
 
 /**
@@ -376,17 +365

[iproute2-next 1/1] tipc: fixed node and name table listings

2018-05-07 Thread Jon Maloy
We make it easier for users to correlate between 128-bit node
identities and 32-bit node hash by extending the 'node list'
command to also show the hash value.

We also improve the 'nametable show' command to show the node identity
instead of the node hash value. Since the former potentially is much
longer than the latter, we make room for it by eliminating the (to the
user) irrelevant publication key. We also reorder some of the columns
so that the node id comes last, since this looks nicer and more logical.

Signed-off-by: Jon Maloy 
---
 tipc/misc.c  | 18 ++
 tipc/misc.h  |  1 +
 tipc/nametable.c | 18 ++
 tipc/node.c  | 19 ---
 tipc/peer.c  |  4 
 5 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/tipc/misc.c b/tipc/misc.c
index 16849f1..13dbaad 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -109,3 +112,18 @@ void nodeid2str(uint8_t *id, char *str)
for (i = 31; str[i] == '0'; i--)
str[i] = 0;
 }
+
+void hash2nodestr(uint32_t hash, char *str)
+{
+   struct tipc_sioc_nodeid_req nr = {};
+   int sd;
+
+   sd = socket(AF_TIPC, SOCK_RDM, 0);
+   if (sd < 0) {
+   fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
+   return;
+   }
+   nr.peer = hash;
+   if (!ioctl(sd, SIOCGETNODEID, &nr))
+   nodeid2str(nr.node_id, str);
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 6e8afdd..ff2f31f 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -17,5 +17,6 @@
 uint32_t str2addr(char *str);
 int str2nodeid(char *str, uint8_t *id);
 void nodeid2str(uint8_t *id, char *str);
+void hash2nodestr(uint32_t hash, char *str);
 
 #endif
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 2578940..ae73dfa 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -20,6 +20,7 @@
 #include "cmdl.h"
 #include "msg.h"
 #include "nametable.h"
+#include "misc.h"
 
 #define PORTID_STR_LEN 45 /* Four u32 and five delimiter chars */
 
@@ -31,6 +32,7 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, void 
*data)
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1] = {};
const char *scope[] = { "", "zone", "cluster", "node" };
+   char str[33] = {0,};
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NAME_TABLE])
@@ -45,20 +47,20 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
-  "Type", "Lower", "Upper", "Node", "Port",
-  "Publication Scope");
+   printf("%-10s %-10s %-10s %-8s %-10s %-33s\n",
+  "Type", "Lower", "Upper", "Scope", "Port",
+  "Node");
(*iteration)++;
 
-   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
+   hash2nodestr(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]), str);
+
+   printf("%-10u %-10u %-10u %-8s %-10u %s\n",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])],
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
-
-   printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
+  str);
 
return MNL_CB_OK;
 }
diff --git a/tipc/node.c b/tipc/node.c
index b73b644..0fa1064 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -26,10 +26,11 @@
 
 static int node_list_cb(const struct nlmsghdr *nlh, void *data)
 {
-   uint32_t addr;
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1] = {};
+   char str[33] = {};
+   uint32_t addr;
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NODE])
@@ -40,13 +41,12 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MNL_CB_ERROR;
 
addr = mnl_attr_get_u32(attrs[TIPC_NLA_NODE_ADDR]);
-   printf("%x: ", addr);
-
+   hash2nodestr(addr, str);
+   printf("

RE: [PATCH net-next] flow_dissector: do not rely on implicit casts

2018-05-07 Thread Jon Maloy
Acked-by: Jon Maloy 


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Paolo Abeni
> Sent: Monday, May 07, 2018 06:06
> To: netdev@vger.kernel.org
> Cc: David S. Miller 
> Subject: [PATCH net-next] flow_dissector: do not rely on implicit casts
> 
> This change fixes a couple of type mismatch reported by the sparse tool,
> explicitly using the requested type for the offending arguments.
> 
> Signed-off-by: Paolo Abeni 
> ---
>  include/net/tipc.h| 4 ++--
>  net/core/flow_dissector.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/tipc.h b/include/net/tipc.h index
> 07670ec022a7..f0e7e6bc1bef 100644
> --- a/include/net/tipc.h
> +++ b/include/net/tipc.h
> @@ -44,11 +44,11 @@ struct tipc_basic_hdr {
>   __be32 w[4];
>  };
> 
> -static inline u32 tipc_hdr_rps_key(struct tipc_basic_hdr *hdr)
> +static inline __be32 tipc_hdr_rps_key(struct tipc_basic_hdr *hdr)
>  {
>   u32 w0 = ntohl(hdr->w[0]);
>   bool keepalive_msg = (w0 & KEEPALIVE_MSG_MASK) ==
> KEEPALIVE_MSG_MASK;
> - int key;
> + __be32 key;
> 
>   /* Return source node identity as key */
>   if (likely(!keepalive_msg))
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index
> 030d4ca177fb..4fc1e84d77ec 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -1316,7 +1316,7 @@ u32 skb_get_poff(const struct sk_buff *skb)  {
>   struct flow_keys_basic keys;
> 
> - if (!skb_flow_dissect_flow_keys_basic(skb, &keys, 0, 0, 0, 0, 0))
> + if (!skb_flow_dissect_flow_keys_basic(skb, &keys, NULL, 0, 0, 0, 0))
>   return 0;
> 
>   return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
> --
> 2.14.3



RE: KMSAN: uninit-value in strcmp

2018-05-03 Thread Jon Maloy


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, May 03, 2018 15:22
> To: syzbot+df0257c92ffd4fcc5...@syzkaller.appspotmail.com
> Cc: Jon Maloy ; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org; syzkaller-b...@googlegroups.com; tipc-
> discuss...@lists.sourceforge.net; ying@windriver.com
> Subject: Re: KMSAN: uninit-value in strcmp
> 
> From: syzbot 
> Date: Thu, 03 May 2018 11:44:02 -0700
> 
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
> >  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
> >  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
> >  strcmp+0xf7/0x160 lib/string.c:329
> >  tipc_nl_node_get_link+0x220/0x6f0 net/tipc/node.c:1881
> > genl_family_rcv_msg net/netlink/genetlink.c:599 [inline]
> 
> Hmmm, TIPC_NL_LINK_GET uses tipc_nl_policy, which has a proper nesting
> entry for TIPC_NLA_LINK.  I wonder how the code goes about validating
> TIPC_NLA_LINK_NAME in such a case?  Does it?

I assume that a strncmp() instead of a strcmp() would avert this particular 
crash, but it doesn't sound like that is what you are after here?
To be honest, I will need to study this code a little myself to understand if 
there is more that has to be done.

///jon

> 
> This may be the problem.


RE: [PATCH] tipc: fix a potential missing-check bug

2018-05-01 Thread Jon Maloy


> -Original Message-
> From: Wenwen Wang [mailto:wang6...@umn.edu]
> Sent: Tuesday, May 01, 2018 00:26
> To: Wenwen Wang 
> Cc: Kangjie Lu ; Jon Maloy ; Ying
> Xue ; David S. Miller ;
> open list:TIPC NETWORK LAYER ; open list:TIPC
> NETWORK LAYER ; open list  ker...@vger.kernel.org>
> Subject: [PATCH] tipc: fix a potential missing-check bug
> 
> In tipc_link_xmit(), the member field "len" of l->backlog[imp] must be less
> than the member field "limit" of l->backlog[imp] when imp is equal to
> TIPC_SYSTEM_IMPORTANCE. Otherwise, an error code, i.e., -ENOBUFS, is
> returned. This is enforced by the security check. However, at the end of
> tipc_link_xmit(), the length of "list" is added to l->backlog[imp].len without
> any further check. This can potentially cause unexpected values for
> l->backlog[imp].len. If imp is equal to TIPC_SYSTEM_IMPORTANCE and the
> original value of l->backlog[imp].len is less than l->backlog[imp].limit, 
> after
> this addition, l->backlog[imp] could be larger than
> l->backlog[imp].limit. 

It can, but only once. That is the intention with allowing oversubscription. 
This is expected and permitted.
At next sending attempt, if the send queue has not been reduced in the 
meantime, the link will be reset, as intended.

> That means the security check can potentially be
> bypassed,  especially when an adversary can control the length of "list".

The length of 'list' is entirely controlled by TIPC itself, either by the 
socket layer (where length  always is 1 for this type of messages) or
 name_dist, In the latter case the length is also 1, except at first link 
setup, when there guaranteed is no congestion anyway.

I appreciate your interest, but this patch is not needed.

BR
///jon

> 
> This patch performs such a check after the modification to
> l->backlog[imp].len (if imp is TIPC_SYSTEM_IMPORTANCE) to avoid such
> security issues. An error code will be returned if an unexpected value of
> l->backlog[imp].len is generated.
> 
> Signed-off-by: Wenwen Wang 
> ---
>  net/tipc/link.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c index 695acb7..62972fa 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -948,6 +948,11 @@ int tipc_link_xmit(struct tipc_link *l, struct
> sk_buff_head *list,
>   continue;
>   }
>   l->backlog[imp].len += skb_queue_len(list);
> + if (imp == TIPC_SYSTEM_IMPORTANCE &&
> + l->backlog[imp].len >= l->backlog[imp].limit) {
> + pr_warn("%s<%s>, link overflow", link_rst_msg, l-
> >name);
> + return -ENOBUFS;
> + }
>   skb_queue_splice_tail_init(list, backlogq);
>   }
>   l->snd_nxt = seqno;
> --
> 2.7.4



[net-next 1/1] tipc: introduce ioctl for fetching node identity

2018-04-25 Thread Jon Maloy
After the introduction of a 128-bit node identity it may be difficult
for a user to correlate between this identity and the generated node
hash address.

We now try to make this easier by introducing a new ioctl() call for
fetching a node identity by using the hash value as key. This will
be particularly useful when we extend some of the commands in the
'tipc' tool, but we also expect regular user applications to need
this feature.

Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 include/uapi/linux/tipc.h | 12 
 net/tipc/node.c   | 21 +
 net/tipc/node.h   |  1 +
 net/tipc/socket.c | 13 +++--
 4 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index bf6d286..6b2fd4d 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -209,16 +209,16 @@ struct tipc_group_req {
  * The string formatting for each name element is:
  * media: media
  * interface: media:interface name
- * link: Z.C.N:interface-Z.C.N:interface
- *
+ * link: node:interface-node:interface
  */
-
+#define TIPC_NODEID_LEN 16
 #define TIPC_MAX_MEDIA_NAME16
 #define TIPC_MAX_IF_NAME   16
 #define TIPC_MAX_BEARER_NAME   32
 #define TIPC_MAX_LINK_NAME 68
 
-#define SIOCGETLINKNAMESIOCPROTOPRIVATE
+#define SIOCGETLINKNAMESIOCPROTOPRIVATE
+#define SIOCGETNODEID  (SIOCPROTOPRIVATE + 1)
 
 struct tipc_sioc_ln_req {
__u32 peer;
@@ -226,6 +226,10 @@ struct tipc_sioc_ln_req {
char linkname[TIPC_MAX_LINK_NAME];
 };
 
+struct tipc_sioc_nodeid_req {
+   __u32 peer;
+   char node_id[TIPC_NODEID_LEN];
+};
 
 /* The macros and functions below are deprecated:
  */
diff --git a/net/tipc/node.c b/net/tipc/node.c
index e9c52e14..81e6dd0 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -195,6 +195,27 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel)
return mtu;
 }
 
+bool tipc_node_get_id(struct net *net, u32 addr, u8 *id)
+{
+   u8 *own_id = tipc_own_id(net);
+   struct tipc_node *n;
+
+   if (!own_id)
+   return true;
+
+   if (addr == tipc_own_addr(net)) {
+   memcpy(id, own_id, TIPC_NODEID_LEN);
+   return true;
+   }
+   n = tipc_node_find(net, addr);
+   if (!n)
+   return false;
+
+   memcpy(id, &n->peer_id, TIPC_NODEID_LEN);
+   tipc_node_put(n);
+   return true;
+}
+
 u16 tipc_node_get_capabilities(struct net *net, u32 addr)
 {
struct tipc_node *n;
diff --git a/net/tipc/node.h b/net/tipc/node.h
index bb271a3..846c8f2 100644
--- a/net/tipc/node.h
+++ b/net/tipc/node.h
@@ -60,6 +60,7 @@ enum {
 #define INVALID_BEARER_ID -1
 
 void tipc_node_stop(struct net *net);
+bool tipc_node_get_id(struct net *net, u32 addr, u8 *id);
 u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr);
 void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128,
  struct tipc_bearer *bearer,
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 252a52ae..c499200 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2973,7 +2973,8 @@ static int tipc_getsockopt(struct socket *sock, int lvl, 
int opt,
 
 static int tipc_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
-   struct sock *sk = sock->sk;
+   struct net *net = sock_net(sock->sk);
+   struct tipc_sioc_nodeid_req nr = {0};
struct tipc_sioc_ln_req lnr;
void __user *argp = (void __user *)arg;
 
@@ -2981,7 +2982,7 @@ static int tipc_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
case SIOCGETLINKNAME:
if (copy_from_user(&lnr, argp, sizeof(lnr)))
return -EFAULT;
-   if (!tipc_node_get_linkname(sock_net(sk),
+   if (!tipc_node_get_linkname(net,
lnr.bearer_id & 0x, lnr.peer,
lnr.linkname, TIPC_MAX_LINK_NAME)) {
if (copy_to_user(argp, &lnr, sizeof(lnr)))
@@ -2989,6 +2990,14 @@ static int tipc_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
return 0;
}
return -EADDRNOTAVAIL;
+   case SIOCGETNODEID:
+   if (copy_from_user(&nr, argp, sizeof(nr)))
+   return -EFAULT;
+   if (!tipc_node_get_id(net, nr.peer, nr.node_id))
+   return -EADDRNOTAVAIL;
+   if (copy_to_user(argp, &nr, sizeof(nr)))
+   return -EFAULT;
+   return 0;
default:
return -ENOIOCTLCMD;
}
-- 
2.1.4



[net 1/1] tipc: fix bug in function tipc_nl_node_dump_monitor

2018-04-25 Thread Jon Maloy
Commit 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor
summary") intended to fix a problem with user tool looping when max
number of bearers are enabled.

Unfortunately, the wrong version of the commit was posted, so the
problem was not solved at all.

This commit adds the missing part.

Fixes: 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor 
summary")
Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 6f98b56..baaf93f 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2244,7 +2244,7 @@ int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct 
netlink_callback *cb)
 
rtnl_lock();
for (bearer_id = prev_bearer; bearer_id < MAX_BEARERS; bearer_id++) {
-   err = __tipc_nl_add_monitor(net, &msg, prev_bearer);
+   err = __tipc_nl_add_monitor(net, &msg, bearer_id);
if (err)
break;
}
-- 
2.1.4



[net 1/1] tipc: fix infinite loop when dumping link monitor summary

2018-04-17 Thread Jon Maloy
From: Tung Nguyen 

When configuring the number of used bearers to MAX_BEARER and issuing
command "tipc link monitor summary", the command enters infinite loop
in user space.

This issue happens because function tipc_nl_node_dump_monitor() returns
the wrong 'prev_bearer' value when all potential monitors have been
scanned.

The correct behavior is to always try to scan all monitors until either
the netlink message is full, in which case we return the bearer identity
of the affected monitor, or we continue through the whole bearer array
until we can return MAX_BEARERS. This solution also caters for the case
where there may be gaps in the bearer array.

Signed-off-by: Tung Nguyen 
Signed-off-by: Jon Maloy 
---
 net/tipc/monitor.c |  2 +-
 net/tipc/node.c| 11 ---
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 32dc33a..5453e56 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -777,7 +777,7 @@ int __tipc_nl_add_monitor(struct net *net, struct 
tipc_nl_msg *msg,
 
ret = tipc_bearer_get_name(net, bearer_name, bearer_id);
if (ret || !mon)
-   return -EINVAL;
+   return 0;
 
hdr = genlmsg_put(msg->skb, msg->portid, msg->seq, &tipc_genl_family,
  NLM_F_MULTI, TIPC_NL_MON_GET);
diff --git a/net/tipc/node.c b/net/tipc/node.c
index c77dd2f..6f98b56 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2232,8 +2232,8 @@ int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct 
netlink_callback *cb)
struct net *net = sock_net(skb->sk);
u32 prev_bearer = cb->args[0];
struct tipc_nl_msg msg;
+   int bearer_id;
int err;
-   int i;
 
if (prev_bearer == MAX_BEARERS)
return 0;
@@ -2243,16 +2243,13 @@ int tipc_nl_node_dump_monitor(struct sk_buff *skb, 
struct netlink_callback *cb)
msg.seq = cb->nlh->nlmsg_seq;
 
rtnl_lock();
-   for (i = prev_bearer; i < MAX_BEARERS; i++) {
-   prev_bearer = i;
+   for (bearer_id = prev_bearer; bearer_id < MAX_BEARERS; bearer_id++) {
err = __tipc_nl_add_monitor(net, &msg, prev_bearer);
if (err)
-   goto out;
+   break;
}
-
-out:
rtnl_unlock();
-   cb->args[0] = prev_bearer;
+   cb->args[0] = bearer_id;
 
return skb->len;
 }
-- 
2.1.4



[net 1/1] tipc: fix use-after-free in tipc_nametbl_stop

2018-04-17 Thread Jon Maloy
When we delete a service item in tipc_nametbl_stop() we loop over
all service ranges in the service's RB tree, and for each service
range we loop over its pertaining publications while calling
tipc_service_remove_publ() for each of them.

However, tipc_service_remove_publ() has the side effect that it also
removes the comprising service range item when there are no publications
left. This leads to a "use-after-free" access when the inner loop
continues to the next iteration, since the range item holding the list
we are looping no longer exists.

We fix this by moving the delete of the service range item outside
the said function. Instead, we now let the two functions calling it
test if the list is empty and perform the removal when that is the
case.

Reported-by: syzbot+d64b64afc55660106...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 4068eaa..dd1c4fa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -241,7 +241,8 @@ static struct publication *tipc_service_insert_publ(struct 
net *net,
 static struct publication *tipc_service_remove_publ(struct net *net,
struct tipc_service *sc,
u32 lower, u32 upper,
-   u32 node, u32 key)
+   u32 node, u32 key,
+   struct service_range **rng)
 {
struct tipc_subscription *sub, *tmp;
struct service_range *sr;
@@ -275,19 +276,15 @@ static struct publication 
*tipc_service_remove_publ(struct net *net,
 
list_del(&p->all_publ);
list_del(&p->local_publ);
-
-   /* Remove service range item if this was its last publication */
-   if (list_empty(&sr->all_publ)) {
+   if (list_empty(&sr->all_publ))
last = true;
-   rb_erase(&sr->tree_node, &sc->ranges);
-   kfree(sr);
-   }
 
/* Notify any waiting subscriptions */
list_for_each_entry_safe(sub, tmp, &sc->subscriptions, service_list) {
tipc_sub_report_overlap(sub, p->lower, p->upper, TIPC_WITHDRAWN,
p->port, p->node, p->scope, last);
}
+   *rng = sr;
return p;
 }
 
@@ -379,13 +376,20 @@ struct publication *tipc_nametbl_remove_publ(struct net 
*net, u32 type,
 u32 node, u32 key)
 {
struct tipc_service *sc = tipc_service_find(net, type);
+   struct service_range *sr = NULL;
struct publication *p = NULL;
 
if (!sc)
return NULL;
 
spin_lock_bh(&sc->lock);
-   p = tipc_service_remove_publ(net, sc, lower, upper, node, key);
+   p = tipc_service_remove_publ(net, sc, lower, upper, node, key, &sr);
+
+   /* Remove service range item if this was its last publication */
+   if (sr && list_empty(&sr->all_publ)) {
+   rb_erase(&sr->tree_node, &sc->ranges);
+   kfree(sr);
+   }
 
/* Delete service item if this no more publications and subscriptions */
if (RB_EMPTY_ROOT(&sc->ranges) && list_empty(&sc->subscriptions)) {
@@ -747,16 +751,17 @@ int tipc_nametbl_init(struct net *net)
 static void tipc_service_delete(struct net *net, struct tipc_service *sc)
 {
struct service_range *sr, *tmpr;
-   struct publication *p, *tmpb;
+   struct publication *p, *tmp;
 
spin_lock_bh(&sc->lock);
rbtree_postorder_for_each_entry_safe(sr, tmpr, &sc->ranges, tree_node) {
-   list_for_each_entry_safe(p, tmpb,
-&sr->all_publ, all_publ) {
+   list_for_each_entry_safe(p, tmp, &sr->all_publ, all_publ) {
tipc_service_remove_publ(net, sc, p->lower, p->upper,
-p->node, p->key);
+p->node, p->key, &sr);
kfree_rcu(p, rcu);
}
+   rb_erase(&sr->tree_node, &sc->ranges);
+   kfree(sr);
}
hlist_del_init_rcu(&sc->service_list);
spin_unlock_bh(&sc->lock);
-- 
2.1.4



RE: [PATCH net 0/2] tipc: Better check user provided attributes

2018-04-16 Thread Jon Maloy
Acked-by: Jon Maloy 

Thank you, Eric.


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Monday, April 16, 2018 11:30
> To: David S . Miller 
> Cc: netdev ; Eric Dumazet
> ; Eric Dumazet 
> Subject: [PATCH net 0/2] tipc: Better check user provided attributes
> 
> syzbot reported a crash in __tipc_nl_net_set()
> 
> While fixing it, I also had to fix an old bug involving TIPC_NLA_NET_ADDR
> 
> Eric Dumazet (2):
>   tipc: add policy for TIPC_NLA_NET_ADDR
>   tipc: fix possible crash in __tipc_nl_net_set()
> 
>  net/tipc/net.c | 2 ++
>  net/tipc/netlink.c | 5 -
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> --
> 2.17.0.484.g0c8726318c-goog



[net 1/1] tipc: fix missing initializer in tipc_sendmsg()

2018-04-11 Thread Jon Maloy
The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.

We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.

Reported-by: syzbot+b0975ce9355b347c1...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1fd1c8b..252a52ae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1278,7 +1278,7 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
struct tipc_msg *hdr = &tsk->phdr;
struct tipc_name_seq *seq;
struct sk_buff_head pkts;
-   u32 dnode, dport;
+   u32 dport, dnode = 0;
u32 type, inst;
int mtu, rc;
 
@@ -1348,6 +1348,8 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
msg_set_destnode(hdr, dnode);
msg_set_destport(hdr, dest->addr.id.ref);
msg_set_hdr_sz(hdr, BASIC_H_SIZE);
+   } else {
+   return -EINVAL;
}
 
/* Block or return if destination link is congested */
-- 
2.1.4



[net 1/1] tipc: fix unbalanced reference counter

2018-04-11 Thread Jon Maloy
When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27ad ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d...@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 5 -
 net/tipc/name_table.h | 2 +-
 net/tipc/subscr.c | 5 -
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index b1fe209..4068eaa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -665,13 +665,14 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 
lower,
 /**
  * tipc_nametbl_subscribe - add a subscription object to the name table
  */
-void tipc_nametbl_subscribe(struct tipc_subscription *sub)
+bool tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
struct name_table *nt = tipc_name_table(sub->net);
struct tipc_net *tn = tipc_net(sub->net);
struct tipc_subscr *s = &sub->evt.s;
u32 type = tipc_sub_read(s, seq.type);
struct tipc_service *sc;
+   bool res = true;
 
spin_lock_bh(&tn->nametbl_lock);
sc = tipc_service_find(sub->net, type);
@@ -685,8 +686,10 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
pr_warn("Failed to subscribe for {%u,%u,%u}\n", type,
tipc_sub_read(s, seq.lower),
tipc_sub_read(s, seq.upper));
+   res = false;
}
spin_unlock_bh(&tn->nametbl_lock);
+   return res;
 }
 
 /**
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 4b14fc2..0febba4 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -126,7 +126,7 @@ struct publication *tipc_nametbl_insert_publ(struct net 
*net, u32 type,
 struct publication *tipc_nametbl_remove_publ(struct net *net, u32 type,
 u32 lower, u32 upper,
 u32 node, u32 key);
-void tipc_nametbl_subscribe(struct tipc_subscription *s);
+bool tipc_nametbl_subscribe(struct tipc_subscription *s);
 void tipc_nametbl_unsubscribe(struct tipc_subscription *s);
 int tipc_nametbl_init(struct net *net);
 void tipc_nametbl_stop(struct net *net);
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index b7d80bc..f340e53 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -153,7 +153,10 @@ struct tipc_subscription *tipc_sub_subscribe(struct net 
*net,
memcpy(&sub->evt.s, s, sizeof(*s));
spin_lock_init(&sub->lock);
kref_init(&sub->kref);
-   tipc_nametbl_subscribe(sub);
+   if (!tipc_nametbl_subscribe(sub)) {
+   kfree(sub);
+   return NULL;
+   }
timer_setup(&sub->timer, tipc_sub_timeout, 0);
timeout = tipc_sub_read(&sub->evt.s, timeout);
if (timeout != TIPC_WAIT_FOREVER)
-- 
2.1.4



RE: [PATCH v3] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create

2018-04-11 Thread Jon Maloy


> -Original Message-
> From: Ying Xue [mailto:ying@windriver.com]
> Sent: Wednesday, April 11, 2018 06:27
> To: Jia-Ju Bai ; Jon Maloy
> ; da...@davemloft.net
> Cc: netdev@vger.kernel.org; tipc-discuss...@lists.sourceforge.net; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH v3] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in
> tipc_mon_create
> 
> On 04/11/2018 06:24 PM, Jia-Ju Bai wrote:
> > tipc_mon_create() is never called in atomic context.
> >
> > The call chain ending up at tipc_mon_create() is:
> > [1] tipc_mon_create() <- tipc_enable_bearer() <-
> > tipc_nl_bearer_enable()
> > tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this
> > function is not called in atomic context.
> >
> > Despite never getting called from atomic context,
> > tipc_mon_create() calls kzalloc() with GFP_ATOMIC, which does not
> > sleep for allocation.
> > GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> which
> > can sleep and improve the possibility of successful allocation.
> >
> > This is found by a static analysis tool named DCNS written by myself.
> > And I also manually check it.
> >
> > Signed-off-by: Jia-Ju Bai 
> 
> Acked-by: Ying Xue 
Acked-by: Jon Maloy 
> 
> > ---
> > v2:
> > * Modify the description of GFP_ATOMIC in v1.
> >   Thank Eric for good advice.
> > v3:
> > * Modify wrong text in description in v2.
> >   Thank Ying for good advice.
> > ---
> >  net/tipc/monitor.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index
> > 9e109bb..9714d80 100644
> > --- a/net/tipc/monitor.c
> > +++ b/net/tipc/monitor.c
> > @@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id)
> > if (tn->monitors[bearer_id])
> > return 0;
> >
> > -   mon = kzalloc(sizeof(*mon), GFP_ATOMIC);
> > -   self = kzalloc(sizeof(*self), GFP_ATOMIC);
> > -   dom = kzalloc(sizeof(*dom), GFP_ATOMIC);
> > +   mon = kzalloc(sizeof(*mon), GFP_KERNEL);
> > +   self = kzalloc(sizeof(*self), GFP_KERNEL);
> > +   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
> > if (!mon || !self || !dom) {
> > kfree(mon);
> > kfree(self);
> >


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >> >is that the documentation of the tp_status field is somewhat
> >> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by user space.  In other places ownership
> >> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> >>
> >> But indeed this example in packet_mmap.txt is problematic
> >>
> >> if (status == TP_STATUS_KERNEL)
> >> retval = poll(&pfd, 1, timeout);
> >>
> >> It does not really matter whether the docs are possibly inconsistent and
> >> which one is authoritative. Examples like the above make it likely that
> >> some user code expects such code to work.
> >
> > Yes, that's exactly my concern.  Yet another troubling example seems to be
> > lipbcap which also is looking specifically for status to be anything other 
> > than
> > TP_STATUS_KERNEL(0) to indicate a frame is available in user space.
> 
> Good catch. If pcap-linux.c relies on this then the status field
> cannot be changed. Other fields can be modified freely while tp_status
> remains 0, perhaps that's an option.

Possibly. Someone else suggested something similar but in at least the
one example we thought through it still seemed like it didn't address the 
problem.

For example, let's say we used tp_len == -1 to indicate to other kernel threads
that the entry was already in progress.  This would require that user space 
never
set tp_len = -1 before returning the entry back to the kernel.  If it did then 
no
kernel thread would ever claim ownership and the ring would hang.

Now, it seems pretty unlikely that user space would do such a thing so maybe we
could look past that, but then we run into the issue that there is still a 
window
of opportunity for other kernel threads to come in and wrap the ring.

The reason is we can't set tp_len to the correct length after setting tp_status 
because
user space could grab the entry and see tp_len == -1 so we have to set tp_len
before we set tp_status. This means that there is still a window where other
kernel threads could come in and see tp_len as something other than -1 and
a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry.
This puts us back to where we are today (arguably with a smaller window,
but a window none the less).

Alternatively we could reacquire the spin_lock to then set tp_len followed by
tp_status.  This would give the necessary indivisibility in the kernel while 
preserving proper order as made visible to user space, but it comes at the cost
of another spin_lock.

Thanks for the suggestion.  If you can think of a way around this I'm all ears.
I'll think on this some more but so far I'm stuck on how to get past having to
broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort
of atomic construct along with a parallel shadow ring structure (still thinking
through that one as well).



RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn  
wrote:
> 
> On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen  wrote:
> > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> > casues the ring to get corrupted by allowing multiple kernel threads
> > to claim ownership of the same ring entry, Mark the ring entry as
> > already being used within the spin_lock to prevent other kernel
> > threads from reusing the same entry before it's fully filled in,
> > passed to user space, and then eventually passed back to the kernel
> > for use with a new packet.
> >
> > Note that the proposed change may modify the semantics of the
> > interface between kernel space and user space in a way which may cause
> > some applications to no longer work properly.
> 
> As long as TP_STATUS_USER (1) is not set, userspace should ignore
> the slot..
> 
> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >is that the documentation of the tp_status field is somewhat
> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >meaning the entry is owned by user space.  In other places ownership
> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> 
> But indeed this example in packet_mmap.txt is problematic
> 
> if (status == TP_STATUS_KERNEL)
> retval = poll(&pfd, 1, timeout);
> 
> It does not really matter whether the docs are possibly inconsistent and
> which one is authoritative. Examples like the above make it likely that
> some user code expects such code to work.

Yes, that's exactly my concern.  Yet another troubling example seems to be
lipbcap which also is looking specifically for status to be anything other than
TP_STATUS_KERNEL(0) to indicate a frame is available in user space.

Either way things are broken. They are broken as they stand now because the
ring can get overrun and the kernel and user space tracking of the ring can
get out of sync.  And they are broken with the below change because some user
space applications will be looking for anything other than TP_STATUS_KERNEL,
so again the ring will get out of sync.

The difference here being that the way it is today is on average (across all 
environments
and across all user space apps) less likely to occur while with the change 
below it is
much more likely to occur.

Maybe the right answer here is to implement a fix that is compatible for 
existing
applications and accept any potential performance impacts and then add yet 
another
version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for
passing ownership.

> 
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> 
> No need to reinterpret existing flags. tp_status is a u32 with
> sufficient undefined bits.

Agreed.

> 
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> > --
> > 2.10.3.dirty
> >

Thanks for the feedback!
Jon.


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)


> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index e0f3f4a..264d7b2 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> 
> This patch looks correct. Please resend it with proper signed-off-by
> and with a kernel code indenting style (tabs).  Is this bug present
> since the beginning of af_packet and multiqueue devices or did it get
> introduced in some previous kernel?

Sorry about the tabs, I'll fix that and try to figure out what I did wrong with
the signed-off-by.

I've looked back as far as I could find online (2.6.11) and it would appear that
this bug has always been there.

Thanks, jon.



[RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-03 Thread Jon Rosen
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry, Mark the ring entry as
already being used within the spin_lock to prevent other kernel
threads from reusing the same entry before it's fully filled in,
passed to user space, and then eventually passed back to the kernel
for use with a new packet.

Note that the proposed change may modify the semantics of the
interface between kernel space and user space in a way which may cause
some applications to no longer work properly. More discussion on this
change can be found in the additional comments section titled
"3. Discussion on packet_mmap ownership semantics:".

Signed-off-by: Jon Rosen 
---

Additional Comments Section
---

1. Description of the diffs:


   TPACKET_V1 and TPACKET_V2 format rings:
   ---
   Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
   prevent other kernel threads from re-using the same entry.

   This is necessary because there may be a delay from the time the
   spin_lock is released to the time that the packet is completed and
   the corresponding ring entry is marked as owned by user space.  If
   during this time other kernel threads enqueue more packets to the
   ring than the size of the ring then it will cause mutliple kernel
   threads to operate on the same entry at the same time, corrupting
   packets and the ring state.

   By marking the entry as allocated (IN_PROGRESS) we prevent other
   kernel threads from incorrectly re-using an entry that is still in
   the progress of being filled in before it is passed to user space.

   This forces each entry through the following states:

   +-> 1. (tp_status == TP_STATUS_KERNEL)
   |  Free: For use by any kernel thread to store a new packet
   |
   |   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
   |  Allocated: In use by a *specific* kernel thread
   |
   |   3. (tp_status & TP_STATUS_USER)
   |  Available: Packet available for user space to process
   |
   +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL


   No impact on TPACKET_V3 format rings:
   -
   Packet entry ownership is already protected from other kernel
   threads potentially re-using the same entry. This is done inside
   packet_current_rx_frame() where storage is allocated for the
   current packet. Since this is done within the spin_lock no
   additional status updates for this entry are required.


   Defining TP_STATUS_IN_PROGRESS:
   ---
   Rather than defining a new-bit we re-use an existing bit for this
   intermediate state.  Status will eventually be overwritten with the
   actual true status when passed to user space.  Any bit used to pass
   information to user space other than the one that passes ownership
   is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
   could be defined.


2. More detailed discussion:

   Ring entries basically have 2 states, owned by the kernel or owned by
   user space. For single producer/single consumer this works fine. For
   multiple producers there is a window between the call to spin_unlock
   [F] and the call to __packet_set_status [J] where if there are enough
   packets added to the ring by other kernel threads then the ring can
   wrap and multiple threads will end up using the same ring entries.

   This occurs because the ring entry alocated at [C] did not modify the
   state of the entry so it continues to appear as owned by the kernel
   and available for use for new packets even though it has already been
   allocated.

   A simple fix is to temporarily mark the ring entries within the spin
   lock such that user space will still think it?s owned by the kernel
   and other kernel threads will not see it as available to be used for
   new packets. If a kernel thread gets delayed between [F] and [J] for
   an extended period of time and the ring wraps back to the same point
   then subsiquent kernel threads attempts to allocate will fail and be
   treated as the ring being full.

   The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
   to prevent other kernel threads from re-using the same entry. Note that
   any existing bit other than TP_STATUS_USER could have been used.

   af_packet.c:tpacket_rcv()
  ... code removed for brevity ...

  // Acquire spin lock
A:spin_lock(&sk->sk_receive_queue.lock);

// Preemption is disabled

// Get current ring entry
B:  h.raw = packet_current_rx_frame(
po, skb, TP_STATUS_KERNEL, (macoff+snaplen));

// Get out if ring is full
// Code not show but it will also release 

RE: general protection fault in tipc_nametbl_unsubscribe

2018-04-03 Thread Jon Maloy
#syz dup: general protection fault in __list_del_entry_valid (3)

> -Original Message-
> From: syzbot
> [mailto:syzbot+4859fe19555ea87c4...@syzkaller.appspotmail.com]
> Sent: Monday, April 02, 2018 02:01
> To: da...@davemloft.net; Jon Maloy ; linux-
> ker...@vger.kernel.org; netdev@vger.kernel.org; syzkaller-
> b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net;
> ying@windriver.com
> Subject: general protection fault in tipc_nametbl_unsubscribe
> 
> Hello,
> 
> syzbot hit the following crash on upstream commit
> 10b84daddbec72c6b440216a69de9a9605127f7a (Sat Mar 31 17:59:00 2018
> +) Merge branch 'perf-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=4859fe19555ea87c42f3
> 
> So far this crash happened 3 times on upstream.
> C reproducer:
> https://syzkaller.appspot.com/x/repro.c?id=4775372465897472
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=4868734988582912
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=507380209544
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-2760467897697295172
> compiler: gcc (GCC) 7.1.1 20170620
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+4859fe19555ea87c4...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for details.
> If you forward the report, please keep this part and the footer.
> 
> R13:  R14:  R15:  Name
> sequence creation failed, no memory Failed to create subscription for
> {24576,0,4294967295}
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4447 Comm: syzkaller851181 Not tainted 4.16.0-rc7+ #374
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011
> RIP: 0010:__list_del_entry_valid+0x7e/0x150 lib/list_debug.c:51
> RSP: 0018:8801ae1aef48 EFLAGS: 00010246
> RAX: dc00 RBX:  RCX: 
> RDX:  RSI: 8801cf54c760 RDI: 8801cf54c768
> RBP: 8801ae1aef60 R08: 110035c35cff R09: 89956150
> R10: 8801ae1aee28 R11: 168a R12: 87745ea0
> R13: 8801ae1af100 R14: 8801cf54c760 R15: 8801cf4c8cc0
> FS:  () GS:8801db10()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55dce15c3090 CR3: 0846a002 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400 Call
> Trace:
>   __list_del_entry include/linux/list.h:117 [inline]
>   list_del_init include/linux/list.h:159 [inline]
>   tipc_nametbl_unsubscribe+0x318/0x990 net/tipc/name_table.c:848
>   tipc_subscrb_subscrp_delete+0x1e9/0x460 net/tipc/subscr.c:212
>   tipc_subscrb_delete net/tipc/subscr.c:242 [inline]
>   tipc_subscrb_release_cb+0x17/0x30 net/tipc/subscr.c:321
>   tipc_topsrv_kern_unsubscr+0x2c3/0x430 net/tipc/server.c:535
>   tipc_group_delete+0x2c0/0x3d0 net/tipc/group.c:231
>   tipc_sk_leave+0x10b/0x200 net/tipc/socket.c:2795
>   tipc_release+0x154/0xff0 net/tipc/socket.c:577
>   sock_release+0x8d/0x1e0 net/socket.c:595
>   sock_close+0x16/0x20 net/socket.c:1149
>   __fput+0x327/0x7e0 fs/file_table.c:209
>   fput+0x15/0x20 fs/file_table.c:243
>   task_work_run+0x199/0x270 kernel/task_work.c:113
>   exit_task_work include/linux/task_work.h:22 [inline]
>   do_exit+0x9bb/0x1ad0 kernel/exit.c:865
>   do_group_exit+0x149/0x400 kernel/exit.c:968
>   SYSC_exit_group kernel/exit.c:979 [inline]
>   SyS_exit_group+0x1d/0x20 kernel/exit.c:977
>   do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x43f228
> RSP: 002b:7ffde31217e8 EFLAGS: 0246 ORIG_RAX:
> 00e7
> RAX: ffda RBX:  RCX: 0043f228
> RDX:  RSI: 003c RDI: 
> RBP: 004bf308 R08: 00e7 R09: ffd0
> R10: 204ee000 R11: 0246 R12: 0001
> R13: 006d1180 R14:  R15: 
> Code: 00 00 00 00 ad de 49 39 c4 74 66 48 b8 00 02 00 00 00 00 ad de 48 89 da 
> 48
> 39 c3 74 65 48 c1 ea 03 48 b8 00 00 00 00 00 fc ff df <80> 3c 02 00
> 75 7b 48 8b 13 48 39 f2 75 57 49 8d 7c 24 08 48 b8
> RIP: __list

[net-next 1/1] tipc: Fix missing list initializations in struct tipc_subscription

2018-04-03 Thread Jon Maloy
When an item of struct tipc_subscription is created, we fail to
initialize the two lists aggregated into the struct. This has so far
never been a problem, since the items are just added to a root
object by list_add(), which does not require the addee list to be
pre-initialized. However, syzbot is provoking situations where this
addition fails, whereupon the attempted removal if the item from
the list causes a crash.

This problem seems to always have been around, despite that the code
for creating this object was rewritten in commit 242e82cc95f6 ("tipc:
collapse subscription creation functions"), which is still in net-next.

We fix this for that commit by initializing the two lists properly.

Fixes: 242e82cc95f6 ("tipc: collapse subscription creation functions")
Reported-by: syzbot+0bb443b74ce09197e...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy 
---
 net/tipc/subscr.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 6925a98..b7d80bc 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -145,6 +145,8 @@ struct tipc_subscription *tipc_sub_subscribe(struct net 
*net,
pr_warn("Subscription rejected, no memory\n");
return NULL;
}
+   INIT_LIST_HEAD(&sub->service_list);
+   INIT_LIST_HEAD(&sub->sub_list);
sub->net = net;
sub->conid = conid;
sub->inactive = false;
-- 
2.1.4



RE: [iproute2-next 0/2] tipc: changes to addressing structure

2018-03-29 Thread Jon Maloy

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of David Ahern
> Sent: Thursday, March 29, 2018 13:59
> To: Jon Maloy ; da...@davemloft.net;
> netdev@vger.kernel.org
> Cc: Mohan Krishna Ghanta Krishnamurthy
[..]
bit node addresses as an integer in hex format,
> >>i.e., we remove the assumption about an internal structure.
> >>
> >
> > Applied to iproute2-next. Thanks,
> >
> 
> BTW, please consider adding json support to tipc. It will make tipc command
> more robust to changes in output format.

Yes, we will do that.

///jon



[net-next v2 2/5] tipc: refactor name table translate function

2018-03-29 Thread Jon Maloy
The function tipc_nametbl_translate() function is ugly and hard to
follow. This can be improved somewhat by introducing a stack variable
for holding the publication list to be used and re-ordering the if-
clauses for selection of algorithm.

Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 61 +--
 1 file changed, 25 insertions(+), 36 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index e06c7a8..4bdc580 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -399,29 +399,32 @@ struct publication *tipc_nametbl_remove_publ(struct net 
*net, u32 type,
 /**
  * tipc_nametbl_translate - perform service instance to socket translation
  *
- * On entry, 'destnode' is the search domain used during translation.
+ * On entry, 'dnode' is the search domain used during translation.
  *
  * On exit:
- * - if name translation is deferred to another node/cluster/zone,
- *   leaves 'destnode' unchanged (will be non-zero) and returns 0
- * - if name translation is attempted and succeeds, sets 'destnode'
- *   to publication node and returns port reference (will be non-zero)
- * - if name translation is attempted and fails, sets 'destnode' to 0
- *   and returns 0
+ * - if translation is deferred to another node, leave 'dnode' unchanged and
+ *   return 0
+ * - if translation is attempted and succeeds, set 'dnode' to the publishing
+ *   node and return the published (non-zero) port number
+ * - if translation is attempted and fails, set 'dnode' to 0 and return 0
+ *
+ * Note that for legacy users (node configured with Z.C.N address format) the
+ * 'closest-first' lookup algorithm must be maintained, i.e., if dnode is 0
+ * we must look in the local binding list first
  */
-u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance,
-  u32 *destnode)
+u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance, u32 *dnode)
 {
struct tipc_net *tn = tipc_net(net);
bool legacy = tn->legacy_addr_format;
u32 self = tipc_own_addr(net);
struct service_range *sr;
struct tipc_service *sc;
+   struct list_head *list;
struct publication *p;
u32 port = 0;
u32 node = 0;
 
-   if (!tipc_in_scope(legacy, *destnode, self))
+   if (!tipc_in_scope(legacy, *dnode, self))
return 0;
 
rcu_read_lock();
@@ -434,43 +437,29 @@ u32 tipc_nametbl_translate(struct net *net, u32 type, u32 
instance,
if (unlikely(!sr))
goto no_match;
 
-   /* Closest-First Algorithm */
-   if (legacy && !*destnode) {
-   if (!list_empty(&sr->local_publ)) {
-   p = list_first_entry(&sr->local_publ,
-struct publication,
-local_publ);
-   list_move_tail(&p->local_publ,
-  &sr->local_publ);
-   } else {
-   p = list_first_entry(&sr->all_publ,
-struct publication,
-all_publ);
-   list_move_tail(&p->all_publ,
-  &sr->all_publ);
-   }
-   }
-
-   /* Round-Robin Algorithm */
-   else if (*destnode == self) {
-   if (list_empty(&sr->local_publ))
+   /* Select lookup algorithm: local, closest-first or round-robin */
+   if (*dnode == self) {
+   list = &sr->local_publ;
+   if (list_empty(list))
goto no_match;
-   p = list_first_entry(&sr->local_publ, struct publication,
-local_publ);
+   p = list_first_entry(list, struct publication, local_publ);
+   list_move_tail(&p->local_publ, &sr->local_publ);
+   } else if (legacy && !*dnode && !list_empty(&sr->local_publ)) {
+   list = &sr->local_publ;
+   p = list_first_entry(list, struct publication, local_publ);
list_move_tail(&p->local_publ, &sr->local_publ);
} else {
-   p = list_first_entry(&sr->all_publ, struct publication,
-all_publ);
+   list = &sr->all_publ;
+   p = list_first_entry(list, struct publication, all_publ);
list_move_tail(&p->all_publ, &sr->all_publ);
}
-
port = p->port;
node = p->node;
 no_match:
spin_unlock_bh(&sc->lock);
 not_found:
rcu_read_unlock();
-   *destnode = node;
+   *dnode = node;
return port;
 }
 
-- 
2.1.4



[net-next v2 5/5] tipc: avoid possible string overflow

2018-03-29 Thread Jon Maloy
gcc points out that the combined length of the fixed-length inputs to
l->name is larger than the destination buffer size:

net/tipc/link.c: In function 'tipc_link_create':
net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes
into a region of size between 26 and 58 [-Werror=format-overflow=]
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes
(assuming 75) into a destination of size 60
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

A detailed analysis reveals that the theoretical maximum length of
a link name is:
max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name =
16 + 1 + 15 + 1 + 16 + 1 + 15 = 65
Since we also need space for a trailing zero we now set MAX_LINK_NAME
to 68.

Just to be on the safe side we also replace the sprintf() call with
snprintf().

Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address
hash values")
Reported-by: Arnd Bergmann 

Signed-off-by: Jon Maloy 
---
 include/uapi/linux/tipc.h | 2 +-
 net/tipc/link.c   | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 156224a..bf6d286 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -216,7 +216,7 @@ struct tipc_group_req {
 #define TIPC_MAX_MEDIA_NAME16
 #define TIPC_MAX_IF_NAME   16
 #define TIPC_MAX_BEARER_NAME   32
-#define TIPC_MAX_LINK_NAME 60
+#define TIPC_MAX_LINK_NAME 68
 
 #define SIOCGETLINKNAMESIOCPROTOPRIVATE
 
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 8f2a949..695acb7 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -462,7 +462,8 @@ bool tipc_link_create(struct net *net, char *if_name, int 
bearer_id,
sprintf(peer_str, "%x", peer);
}
/* Peer i/f name will be completed by reset/activate message */
-   sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
+   snprintf(l->name, sizeof(l->name), "%s:%s-%s:unknown",
+self_str, if_name, peer_str);
 
strcpy(l->if_name, if_name);
l->addr = peer;
-- 
2.1.4



[net-next v2 1/5] tipc: replace name table service range array with rb tree

2018-03-29 Thread Jon Maloy
The current design of the binding table has an unnecessary memory
consuming and complex data structure. It aggregates the service range
items into an array, which is expanded by a factor two every time it
becomes too small to hold a new item. Furthermore, the arrays never
shrink when the number of ranges diminishes.

We now replace this array with an RB tree that is holding the range
items as tree nodes, each range directly holding a list of bindings.

This, along with a few name changes, improves both readability and
volume of the code, as well as reducing memory consumption and hopefully
improving cache hit rate.

Signed-off-by: Jon Maloy 
---
 net/tipc/core.h   |1 +
 net/tipc/link.c   |2 +-
 net/tipc/name_table.c | 1032 ++---
 net/tipc/name_table.h |2 +-
 net/tipc/node.c   |4 +-
 net/tipc/subscr.h |4 +-
 6 files changed, 477 insertions(+), 568 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index d0f64ca..8020a6c 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct tipc_node;
 struct tipc_bearer;
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 1289b4b..8f2a949 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1810,7 +1810,7 @@ int tipc_link_bc_nack_rcv(struct tipc_link *l, struct 
sk_buff *skb,
 
 void tipc_link_set_queue_limits(struct tipc_link *l, u32 win)
 {
-   int max_bulk = TIPC_MAX_PUBLICATIONS / (l->mtu / ITEM_SIZE);
+   int max_bulk = TIPC_MAX_PUBL / (l->mtu / ITEM_SIZE);
 
l->window = win;
l->backlog[TIPC_LOW_IMPORTANCE].limit  = max_t(u16, 50, win);
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 4359605..e06c7a8 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -44,52 +44,40 @@
 #include "addr.h"
 #include "node.h"
 #include "group.h"
-#include 
-
-#define TIPC_NAMETBL_SIZE 1024 /* must be a power of 2 */
 
 /**
- * struct name_info - name sequence publication info
- * @node_list: list of publications on own node of this 
- * @all_publ: list of all publications of this 
+ * struct service_range - container for all bindings of a service range
+ * @lower: service range lower bound
+ * @upper: service range upper bound
+ * @tree_node: member of service range RB tree
+ * @local_publ: list of identical publications made from this node
+ *   Used by closest_first lookup and multicast lookup algorithm
+ * @all_publ: all publications identical to this one, whatever node and scope
+ *   Used by round-robin lookup algorithm
  */
-struct name_info {
-   struct list_head local_publ;
-   struct list_head all_publ;
-};
-
-/**
- * struct sub_seq - container for all published instances of a name sequence
- * @lower: name sequence lower bound
- * @upper: name sequence upper bound
- * @info: pointer to name sequence publication info
- */
-struct sub_seq {
+struct service_range {
u32 lower;
u32 upper;
-   struct name_info *info;
+   struct rb_node tree_node;
+   struct list_head local_publ;
+   struct list_head all_publ;
 };
 
 /**
- * struct name_seq - container for all published instances of a name type
- * @type: 32 bit 'type' value for name sequence
- * @sseq: pointer to dynamically-sized array of sub-sequences of this 'type';
- *sub-sequences are sorted in ascending order
- * @alloc: number of sub-sequences currently in array
- * @first_free: array index of first unused sub-sequence entry
- * @ns_list: links to adjacent name sequences in hash chain
- * @subscriptions: list of subscriptions for this 'type'
- * @lock: spinlock controlling access to publication lists of all sub-sequences
+ * struct tipc_service - container for all published instances of a service 
type
+ * @type: 32 bit 'type' value for service
+ * @ranges: rb tree containing all service ranges for this service
+ * @service_list: links to adjacent name ranges in hash chain
+ * @subscriptions: list of subscriptions for this service type
+ * @lock: spinlock controlling access to pertaining service ranges/publications
  * @rcu: RCU callback head used for deferred freeing
  */
-struct name_seq {
+struct tipc_service {
u32 type;
-   struct sub_seq *sseqs;
-   u32 alloc;
-   u32 first_free;
-   struct hlist_node ns_list;
+   struct rb_root ranges;
+   struct hlist_node service_list;
struct list_head subscriptions;
-   spinlock_t lock;
+   spinlock_t lock; /* Covers service range list */
struct rcu_head rcu;
 };
 
@@ -99,17 +87,16 @@ static int hash(int x)
 }
 
 /**
- * publ_create - create a publication structure
+ * tipc_publ_create - create a publication structure
  */
-static struct publication *publ_create(u32 type, u32 lower, u32 upper,
-  u32 scope, u32 node, u32 port,
- 

[net-next v2 3/5] tipc: permit overlapping service ranges in name table

2018-03-29 Thread Jon Maloy
With the new RB tree structure for service ranges it becomes possible to
solve an old problem; - we can now allow overlapping service ranges in
the table.

When inserting a new service range to the tree, we use 'lower' as primary
key, and when necessary 'upper' as secondary key.

Since there may now be multiple service ranges matching an indicated
'lower' value, we must also add the 'upper' value to the functions
used for removing publications, so that the correct, corresponding
range item can be found.

These changes guarantee that a well-formed publication/withdrawal item
from a peer node never will be rejected, and make it possible to
eliminate the problematic backlog functionality we currently have for
handling such cases.

Signed-off-by: Jon Maloy 
---
 net/tipc/name_distr.c | 90 +--
 net/tipc/name_distr.h |  1 -
 net/tipc/name_table.c | 64 +---
 net/tipc/name_table.h |  8 ++---
 net/tipc/net.c|  2 +-
 net/tipc/node.c   |  2 +-
 net/tipc/socket.c |  4 +--
 7 files changed, 60 insertions(+), 111 deletions(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 8240a85..51b4b96 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -204,12 +204,12 @@ void tipc_named_node_up(struct net *net, u32 dnode)
  */
 static void tipc_publ_purge(struct net *net, struct publication *publ, u32 
addr)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_net *tn = tipc_net(net);
struct publication *p;
 
spin_lock_bh(&tn->nametbl_lock);
-   p = tipc_nametbl_remove_publ(net, publ->type, publ->lower,
-publ->node, publ->port, publ->key);
+   p = tipc_nametbl_remove_publ(net, publ->type, publ->lower, publ->upper,
+publ->node, publ->key);
if (p)
tipc_node_unsubscribe(net, &p->binding_node, addr);
spin_unlock_bh(&tn->nametbl_lock);
@@ -261,28 +261,31 @@ void tipc_publ_notify(struct net *net, struct list_head 
*nsub_list, u32 addr)
 static bool tipc_update_nametbl(struct net *net, struct distr_item *i,
u32 node, u32 dtype)
 {
-   struct publication *publ = NULL;
+   struct publication *p = NULL;
+   u32 lower = ntohl(i->lower);
+   u32 upper = ntohl(i->upper);
+   u32 type = ntohl(i->type);
+   u32 port = ntohl(i->port);
+   u32 key = ntohl(i->key);
 
if (dtype == PUBLICATION) {
-   publ = tipc_nametbl_insert_publ(net, ntohl(i->type),
-   ntohl(i->lower),
-   ntohl(i->upper),
-   TIPC_CLUSTER_SCOPE, node,
-   ntohl(i->port), ntohl(i->key));
-   if (publ) {
-   tipc_node_subscribe(net, &publ->binding_node, node);
+   p = tipc_nametbl_insert_publ(net, type, lower, upper,
+TIPC_CLUSTER_SCOPE, node,
+port, key);
+   if (p) {
+   tipc_node_subscribe(net, &p->binding_node, node);
return true;
}
} else if (dtype == WITHDRAWAL) {
-   publ = tipc_nametbl_remove_publ(net, ntohl(i->type),
-   ntohl(i->lower),
-   node, ntohl(i->port),
-   ntohl(i->key));
-   if (publ) {
-   tipc_node_unsubscribe(net, &publ->binding_node, node);
-   kfree_rcu(publ, rcu);
+   p = tipc_nametbl_remove_publ(net, type, lower,
+upper, node, key);
+   if (p) {
+   tipc_node_unsubscribe(net, &p->binding_node, node);
+   kfree_rcu(p, rcu);
return true;
}
+   pr_warn_ratelimited("Failed to remove binding %u,%u from %x\n",
+   type, lower, node);
} else {
pr_warn("Unrecognized name table message received\n");
}
@@ -290,53 +293,6 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
 }
 
 /**
- * tipc_named_add_backlog - add a failed name table update to the backlog
- *
- */
-static void tipc_named_add_backlog(struct net *net, struct distr_item *i,
-  u32 type, u32 node)
-{
-   struct distr_queue_item *e;
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-   unsigned lo

[net-next v2 4/5] tipc: tipc: rename address types in user api

2018-03-29 Thread Jon Maloy
The three address type structs in the user API have names that in
reality reflect the specific, non-Linux environment where they were
originally created.

We now give them more intuitive names, in accordance with how TIPC is
described in the current documentation.

struct tipc_portid   -> struct tipc_socket_addr
struct tipc_name -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range

To avoid confusion, we also update some commmets and macro names to
 match the new terminology.

For compatibility, we add macros that map all old names to the new ones.

Signed-off-by: Jon Maloy 
---
 include/uapi/linux/tipc.h | 57 +++
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 4ac9f1f..156224a 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -45,33 +45,33 @@
  * TIPC addressing primitives
  */
 
-struct tipc_portid {
+struct tipc_socket_addr {
__u32 ref;
__u32 node;
 };
 
-struct tipc_name {
+struct tipc_service_addr {
__u32 type;
__u32 instance;
 };
 
-struct tipc_name_seq {
+struct tipc_service_range {
__u32 type;
__u32 lower;
__u32 upper;
 };
 
 /*
- * Application-accessible port name types
+ * Application-accessible service types
  */
 
-#define TIPC_CFG_SRV   0   /* configuration service name type */
-#define TIPC_TOP_SRV   1   /* topology service name type */
-#define TIPC_LINK_STATE2   /* link state name type */
-#define TIPC_RESERVED_TYPES64  /* lowest user-publishable name type */
+#define TIPC_NODE_STATE0   /* node state service type */
+#define TIPC_TOP_SRV   1   /* topology server service type */
+#define TIPC_LINK_STATE2   /* link state service type */
+#define TIPC_RESERVED_TYPES64  /* lowest user-allowed service type */
 
 /*
- * Publication scopes when binding port names and port name sequences
+ * Publication scopes when binding service / service range
  */
 enum tipc_scope {
TIPC_CLUSTER_SCOPE = 2, /* 0 can also be used */
@@ -108,28 +108,28 @@ enum tipc_scope {
  * TIPC topology subscription service definitions
  */
 
-#define TIPC_SUB_PORTS 0x01/* filter for port availability */
-#define TIPC_SUB_SERVICE   0x02/* filter for service availability */
-#define TIPC_SUB_CANCEL0x04/* cancel a subscription */
+#define TIPC_SUB_PORTS  0x01/* filter: evt at each match */
+#define TIPC_SUB_SERVICE0x02/* filter: evt at first up/last down */
+#define TIPC_SUB_CANCEL 0x04/* filter: cancel a subscription */
 
 #define TIPC_WAIT_FOREVER  (~0)/* timeout for permanent subscription */
 
 struct tipc_subscr {
-   struct tipc_name_seq seq;   /* name sequence of interest */
+   struct tipc_service_range seq;  /* range of interest */
__u32 timeout;  /* subscription duration (in ms) */
__u32 filter;   /* bitmask of filter options */
char usr_handle[8]; /* available for subscriber use */
 };
 
 #define TIPC_PUBLISHED 1   /* publication event */
-#define TIPC_WITHDRAWN 2   /* withdraw event */
+#define TIPC_WITHDRAWN 2   /* withdrawal event */
 #define TIPC_SUBSCR_TIMEOUT3   /* subscription timeout event */
 
 struct tipc_event {
__u32 event;/* event type */
-   __u32 found_lower;  /* matching name seq instances */
-   __u32 found_upper;  /*"  "" "  */
-   struct tipc_portid port;/* associated port */
+   __u32 found_lower;  /* matching range */
+   __u32 found_upper;  /*"  "*/
+   struct tipc_socket_addr port;   /* associated socket */
struct tipc_subscr s;   /* associated subscription */
 };
 
@@ -149,20 +149,20 @@ struct tipc_event {
 #define SOL_TIPC   271
 #endif
 
-#define TIPC_ADDR_NAMESEQ  1
-#define TIPC_ADDR_MCAST1
-#define TIPC_ADDR_NAME 2
-#define TIPC_ADDR_ID   3
+#define TIPC_ADDR_MCAST 1
+#define TIPC_SERVICE_RANGE  1
+#define TIPC_SERVICE_ADDR   2
+#define TIPC_SOCKET_ADDR3
 
 struct sockaddr_tipc {
unsigned short family;
unsigned char  addrtype;
signed   char  scope;
union {
-   struct tipc_portid id;
-   struct tipc_name_seq nameseq;
+   struct tipc_socket_addr id;
+   struct tipc_service_range nameseq;
struct {
-   struct tipc_name name;
+   struct tipc_service_addr name;
__u32 domain;
} name;
} addr;
@@ -230,8 +230,13 @@ s

[net-next v2 0/5] tipc: slim down name table

2018-03-29 Thread Jon Maloy
We clean up and improve the name binding table:

 - Replace the memory consuming 'sub_sequence/service range' array with
   an RB tree.
 - Introduce support for overlapping service sequences/ranges

 v2: #1: Fixed a missing initialization reported by David Miller
 #4: Obsoleted and replaced a few more macros to get a consistent
 terminology in the API.
 #5: Added new commit to fix a potential string overflow bug (it
 is still only in net-next) reported by Arnd Bergmann

Jon Maloy (5):
  tipc: replace name table service range array with rb tree
  tipc: refactor name table translate function
  tipc: permit overlapping service ranges in name table
  tipc: tipc: rename address types in user api
  tipc: avoid possible string overflow

 include/uapi/linux/tipc.h |   59 +--
 net/tipc/core.h   |1 +
 net/tipc/link.c   |5 +-
 net/tipc/name_distr.c |   90 +---
 net/tipc/name_distr.h |1 -
 net/tipc/name_table.c | 1075 -
 net/tipc/name_table.h |   10 +-
 net/tipc/net.c|2 +-
 net/tipc/node.c   |4 +-
 net/tipc/socket.c |4 +-
 net/tipc/subscr.h |4 +-
 11 files changed, 556 insertions(+), 699 deletions(-)

-- 
2.1.4



[iproute2-next 1/2] tipc: introduce command for handling a new 128-bit node identity

2018-03-28 Thread Jon Maloy
We add the possibility to set and get a 128 bit node identifier, as
an alternative to the legacy 32-bit node address we are using now.

We also add an option to set and get 'clusterid' in the node. This
is the same as what we have so far called 'netid' and performs the
same operations. For compatibility the old 'netid' commands are
retained, -we just remove them from the help texts.

Acked-by: GhantaKrishnamurthy MohanKrishna 

Signed-off-by: Jon Maloy 
---
 include/uapi/linux/tipc_netlink.h |  2 +
 tipc/misc.c   | 78 ++-
 tipc/misc.h   |  2 +
 tipc/node.c   | 98 +--
 4 files changed, 174 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/tipc_netlink.h 
b/include/uapi/linux/tipc_netlink.h
index 469aa67..6bf8ec6 100644
--- a/include/uapi/linux/tipc_netlink.h
+++ b/include/uapi/linux/tipc_netlink.h
@@ -162,6 +162,8 @@ enum {
TIPC_NLA_NET_UNSPEC,
TIPC_NLA_NET_ID,/* u32 */
TIPC_NLA_NET_ADDR,  /* u32 */
+   TIPC_NLA_NET_NODEID,/* u64 */
+   TIPC_NLA_NET_NODEID_W1, /* u64 */
 
__TIPC_NLA_NET_MAX,
TIPC_NLA_NET_MAX = __TIPC_NLA_NET_MAX - 1
diff --git a/tipc/misc.c b/tipc/misc.c
index 8091222..16849f1 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -12,7 +12,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -33,3 +33,79 @@ uint32_t str2addr(char *str)
fprintf(stderr, "invalid network address \"%s\"\n", str);
return 0;
 }
+
+static int is_hex(char *arr, int last)
+{
+   int i;
+
+   while (!arr[last])
+   last--;
+
+   for (i = 0; i <= last; i++) {
+   if (!IN_RANGE(arr[i], '0', '9') &&
+   !IN_RANGE(arr[i], 'a', 'f') &&
+   !IN_RANGE(arr[i], 'A', 'F'))
+   return 0;
+   }
+   return 1;
+}
+
+static int is_name(char *arr, int last)
+{
+   int i;
+   char c;
+
+   while (!arr[last])
+   last--;
+
+   if (last > 15)
+   return 0;
+
+   for (i = 0; i <= last; i++) {
+   c = arr[i];
+   if (!IN_RANGE(c, '0', '9') && !IN_RANGE(c, 'a', 'z') &&
+   !IN_RANGE(c, 'A', 'Z') && c != '-' && c != '_' &&
+   c != '.' && c != ':' && c != '@')
+   return 0;
+   }
+   return 1;
+}
+
+int str2nodeid(char *str, uint8_t *id)
+{
+   int len = strlen(str);
+   int i;
+
+   if (len > 32)
+   return -1;
+
+   if (is_name(str, len - 1)) {
+   memcpy(id, str, len);
+   return 0;
+   }
+   if (!is_hex(str, len - 1))
+   return -1;
+
+   str[len] = '0';
+   for (i = 0; i < 16; i++) {
+   if (sscanf(&str[2 * i], "%2hhx", &id[i]) != 1)
+   break;
+   }
+   return 0;
+}
+
+void nodeid2str(uint8_t *id, char *str)
+{
+   int i;
+
+   if (is_name((char *)id, 15)) {
+   memcpy(str, id, 16);
+   return;
+   }
+
+   for (i = 0; i < 16; i++)
+   sprintf(&str[2 * i], "%02x", id[i]);
+
+   for (i = 31; str[i] == '0'; i--)
+   str[i] = 0;
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 585df74..6e8afdd 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -15,5 +15,7 @@
 #include 
 
 uint32_t str2addr(char *str);
+int str2nodeid(char *str, uint8_t *id);
+void nodeid2str(uint8_t *id, char *str);
 
 #endif
diff --git a/tipc/node.c b/tipc/node.c
index fe085ae..3ebbe0b 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -131,6 +131,90 @@ static int cmd_node_get_addr(struct nlmsghdr *nlh, const 
struct cmd *cmd,
return 0;
 }
 
+static int cmd_node_set_nodeid(struct nlmsghdr *nlh, const struct cmd *cmd,
+  struct cmdl *cmdl, void *data)
+{
+   char buf[MNL_SOCKET_BUFFER_SIZE];
+   uint8_t id[16] = {0,};
+   uint64_t *w0 = (uint64_t *) &id[0];
+   uint64_t *w1 = (uint64_t *) &id[8];
+   struct nlattr *nest;
+   char *str;
+
+   if (cmdl->argc != cmdl->optind + 1) {
+   fprintf(stderr, "Usage: %s node set nodeid NODE_ID\n",
+   cmdl->argv[0]);
+   return -EINVAL;
+   }
+
+   str = shift_cmdl(cmdl);
+   if (str2nodeid(str, id)) {
+   fprintf(stderr, "Invalid node identity\n");
+ 

[iproute2-next 0/2] tipc: changes to addressing structure

2018-03-28 Thread Jon Maloy
1: We introduce ability to set/get 128-bit node identities
2: We rename 'net id' to 'cluster id' in the command API, 
   of course in a compatible way.
3: We print out all 32-bit node addresses as an integer in hex format,
   i.e., we remove the assumption about an internal structure.

Jon Maloy (2):
  tipc: introduce command for handling a new 128-bit node identity
  tipc: change node address printout formats

 include/uapi/linux/tipc_netlink.h |   2 +
 tipc/link.c   |   3 +-
 tipc/misc.c   |  78 ++-
 tipc/misc.h   |   2 +
 tipc/nametable.c  |  16 ++
 tipc/node.c   | 109 +-
 tipc/socket.c |   3 +-
 7 files changed, 183 insertions(+), 30 deletions(-)

-- 
2.1.4



  1   2   3   4   5   6   7   8   >