from:"Kengo NAKAHARA"

Re: sadb_x_policy_reserved -> sadb_x_policy_flags

2023-01-03 Thread Kengo NAKAHARA


Hi,

Thank you for your pointing out.  I will add the definition similar
to FreeBSD.

Anyway, thank you for your fix on libreswan side.

On 2022/12/29 14:07, Andrew Cagney wrote:

I just noticed this rename when building libreswan against 10.

Can I suggest adding:
   #define sadb_x_policy_flags sadb_x_policy_flags
to signal this change to consumers?  Or perhaps there is some other
way to detect if this change is present?

As an aside, FreeBSD hid their rename by defining:
   #define sadb_x_policy_reserved sadb_x_policy_scope
from my pov that is less interesting.

Andrew

PS: the code is dumping structures



Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can version bump up to 9.99.100?

2022-09-19 Thread Kengo NAKAHARA

Hi,

On 2022/09/17 19:34, Robert Elz wrote:

 Date:Fri, 16 Sep 2022 23:46:59 +
 From:David Holland 
 Message-ID:  

   | While it's possible that some of
   | these may exist, it's unlikely that there are many of them or that
   | they appear anywhere especially important.

That's all encouraging, and yet more reason to bump the version
soon (as required) so we have the 9.99.1xx series around long
enough for any issues to be found and fixed, before we're back
to 10.0 and 10.99.1 with a whole different set of issues to fix.

I commit the patch and bump up to 9.99.100 now.
Thank you for your advise!

Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can version bump up to 9.99.100?

2022-09-16 Thread Kengo NAKAHARA


Hi,

Thank you for your comment.

On 2022/09/16 15:46, Paul Goyette wrote:

On Fri, 16 Sep 2022, Robert Elz wrote:


   Date:    Fri, 16 Sep 2022 11:10:30 +0900
   From:    Kengo NAKAHARA 
   Message-ID:  <90c3c46e-6668-9644-70c3-0eab2cf1c...@iij.ad.jp>


 | Hmm, I will test kernel module building before commit.

Sorry, I wasn't clear - I build everything (modules included) - I just
never actually load any modules, so I haven't tested them (my kernels have
the MODULAR option disabled).   I cannot imagine an issue, as internally
everything just uses __NetBSD_Version__ as a 32 bit (ordered) blob - the
breakdown into 9.99.100 type strings is just for us humans (and pkgsrc).


Not entirely true.

The human-oriented version is used as part of the path to modules
directory.  Need to make sure that the modules set is properly
populated, and that module loads find them in the directory.


I test module loading with just bumped up src.  After build.sh modules &&
build.sh installmodules, that works fine.

# uname -r
9.99.100

# ls -dl /stand/amd64/9.99.100/modules/tprof*
drwxr-xr-x  2 root  wheel  512 Sep 16 17:45 /stand/amd64/9.99.100/modules/tprof
drwxr-xr-x  2 root  wheel  512 Sep 16 17:46 
/stand/amd64/9.99.100/modules/tprof_x86
# modstat | grep tprof
# modload tprof_x86
# echo $?
0
# modstat | grep tprof
tprof  driver   filesys  a14957 -
tprof_x86  driver   filesys  -02010 tprof



Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can version bump up to 9.99.100?

2022-09-15 Thread Kengo NAKAHARA

Hi,

Thank you for your detailed comments!

On 2022/09/15 20:09, Robert Elz wrote:

 Date:Thu, 15 Sep 2022 17:08:52 +0900
 From:Kengo NAKAHARA 
 Message-ID:  <279eae4e-79f4-39c0-5279-83d5738b6...@iij.ad.jp>

   | Can version bump up to 9.99.100?  Is there anything wrong?

It can.   There are no issues with the base system (incl xsrc)
I have tested this in the past, it all just works - and that we
were going to need it sometime before the -10 branch has been
obvious for a while.

Two things I did not test were kernel modules (since I never use
them) which I highly doubt will give any problem, and should be
a trivial fix in the unlikely event there is an issue;

Hmm, I will test kernel module building before commit.

And pkgsrc, because when I tested I needed to revert to the then
current version number (.97 at the time I think), and that reversion
would do things to some pkgsrc version numbers that it should not
be required to deal with.

I don't have enough of a handle on the latter to guess, but if
something in pkgsrc breaks, this will provide the motivation to fix
it, which otherwise might never happen.  In case any change is
required to the parts of it in base, getting that done before the
-10 branch would be good.

So just do it.

I see.  I will commit it early next week.

Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Can version bump up to 9.99.100?

2022-09-15 Thread Kengo NAKAHARA


Hi,

I want to commit some fixes which need to bump up.
E.g. fix to be left inconsistent routes

https://github.com/knakahara/netbsd-src/commit/0ce8e1114d42e51a1ea88f42b1d942563c8312d6

Can version bump up to 9.99.100?  Is there anything wrong?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: debugging a kernel that doesn't start

2022-09-13 Thread Kengo NAKAHARA


Hi,

On 2022/09/13 4:17, Edgar Fuß wrote:

I'm trying to run NetBSD on a Dell PowerEdge R6515, and the kernel is being
loaded (PXE or USB) but then the machine hangs hard.

What's the way to debug a kernel that hangs so early that you can't printf
or drop into ddb? I guess that's a phenomenon quite common for a new port
or changes to locore.s (or whatever that's called today), but it's completely
new to me.

I have virtually no clue about PeCee hardware. At the point the kernel is
started, are BIOS routines still available?


If you use -current, I think it may hit the following issue.
http://gnats.netbsd.org/54962

Could you try the patch in PR?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: if_wm.c workqueue

2022-05-27 Thread Kengo NAKAHARA


Hi,

On 2022/05/27 10:11, Emmanuel Dreyfus wrote:

On Thu, May 26, 2022 at 05:46:31PM +0900, Kengo NAKAHARA wrote:

Could you use head of netbsd-9 branch?


Tried that, workqueue now works fine, and this improves a lot
the behavior against heavy flows.


I am grad to hear that.



Is there a way to tune the priority of the workqueue against
user processes?


I don't know to change kthread priority at run time.  Currently,
the workqueue's priority (PRI_SOFTNET) is higher than user processes.



The wm(4) man page needs a word about this, I will try to update
it. What about the other sysctls?

hw.wm0.q0.txq_free = 4088
hw.wm0.q0.txd_head = 1513
hw.wm0.q0.txd_tail = 1514
hw.wm0.q0.txq_next = 1514
hw.wm0.q0.txq_sfree = 55
hw.wm0.q0.txq_snext = 58
hw.wm0.q0.txq_sdirty = 48
hw.wm0.q0.txq_flags = 0
hw.wm0.q0.txq_stopping = 0
hw.wm0.q0.txq_sending = 1
hw.wm0.q0.rxq_ptr = 82


I think they are not needed to be written in man, as they are debug
information strongly dependent on implementation.  About other device
drivers, not all sysctls are written in man.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: if_wm.c workqueue

2022-05-26 Thread Kengo NAKAHARA


Hi,

Thank you for your information.

On 2022/05/26 15:50, Emmanuel Dreyfus wrote:

On Thu, May 26, 2022 at 11:40:15AM +0900, Kengo NAKAHARA wrote:

Hmm, could you send me the result of "dmesg | grep wm0" ?


This is NetBSD-9.2/amd64. It broke with two other acpi devices
starting to complain about workqueues. Sorry for being vague,
I did not copy the message, and I cannot break it during
daylight.

  wm0 at pci0 dev 25 function 0: I217 V Ethernet Connection (rev. 0x05)
  wm0: interrupting at msi2 vec 0
  wm0: PCI-Express bus
  wm0: 2048 words FLASH, version 0.13.4
  wm0: Ethernet address 3c:ec:ef:23:05:32
  wm0: 0x6a4480
  ihphy0 at wm0 phy 2: i217 10/100/1000 media interface, rev. 4


Ah, INTx, MSI of NetBSD-9.2 doesn't support workqueue proessing yet,
because it doesn't include the following commit.

https://github.com/NetBSD/src/commit/37210869b14978ed9935efa67e7d5b302336982d

Could you use head of netbsd-9 branch?



  wm1 at pci2 dev 0 function 0: I210-T1 Ethernet Server Adapter (rev. 0x03)
  wm1: for TX and RX interrupting at msix4 vec 0 affinity to 0
  wm1: for TX and RX interrupting at msix4 vec 1 affinity to 1
  wm1: for LINK interrupting at msix4 vec 2
  wm1: PCI-Express bus
  wm1: 2048 words FLASH(HW), version 3.16, Image Unique ID 04d9
  wm1: ROM image version 3.16 is older than 3.25
  wm1: Ethernet address 3c:ec:ef:23:05:33
  wm1: Copper
  wm1: 0x4c414500
  makphy0 at wm1 phy 1: I210 10/100/1000 media interface, rev. 0

wm1 uses MSI-X, so this interface can use workqueue processing in
NetBSD-9.2.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: if_wm.c workqueue

2022-05-25 Thread Kengo NAKAHARA


Hi,

On 2022/05/26 11:20, Emmanuel Dreyfus wrote:

On Thu, May 26, 2022 at 10:28:03AM +0900, Kengo NAKAHARA wrote:

When hw.wm0.txrx_workqueue is set to 1, receive and transmit processing
of wm0 uses workqueue(9) instead of softint(9), that is, the processing
can be preempted.  As a result, userland process can run even if wm0
receives packets at very high rate.  However, that may causes degradation
of throughput and latency.


Right, the behavior seems desirable, however the thing seems to have
rotten. The interface will not work once enabled.


Hmm, could you send me the result of "dmesg | grep wm0" ?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: if_wm.c workqueue

2022-05-25 Thread Kengo NAKAHARA


Hi,

On 2022/05/25 23:13, Emmanuel Dreyfus wrote:

Hello

I see if_wm.c can handle packets using a workqueue. The thing is
controlled by sysctl hw.wm0.txrx_workqueue but it is not documented
in the ma page.

What is the effect? I have a machine that has trouble handling high
throughput (interrupts rise to 80%+), I suspect it can help, am I
correct?


When hw.wm0.txrx_workqueue is set to 1, receive and transmit processing
of wm0 uses workqueue(9) instead of softint(9), that is, the processing
can be preempted.  As a result, userland process can run even if wm0
receives packets at very high rate.  However, that may causes degradation
of throughput and latency.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can WM_EVENT_COUNTERS be enabled default?

2021-01-27 Thread Kengo NAKAHARA


Hi,

On 2021/01/27 17:56, Martin Husemann wrote:

On Wed, Jan 27, 2021 at 03:44:29PM +0900, Kengo NAKAHARA wrote:

Sorry, I think LP64 system which does not have atomic_64 operations.


Unrelated to this discussion: are there any such systems (that
NetBSD supports)? I am not aware of any, so would be curious to learn
about them.


Ah, I confused with early amd64 CPU which does not have CMPXCHG16B.
Sorry, please ignore.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can WM_EVENT_COUNTERS be enabled default?

2021-01-26 Thread Kengo NAKAHARA


Hi,

On 2021/01/27 14:04, Jason Thorpe wrote:



On Jan 26, 2021, at 8:54 PM, Kengo NAKAHARA  wrote:

Ah, the stats are just incremented (not atomic ops) because they are
incremented by one CPU or protected by locks.  I will enable it as you
say.


My thinking is:

1. Probably not much interest on non-LP64 systems.
-and-
2. On non-LP64 systems, the upper/lower half can become inconsistent (pretty 
unlikely, but still :-)


Sorry, I think LP64 system which does not have atomic_64 operations.

Hmm, it should use atomic_{load,store}_relaxed(), I will fix it later.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can WM_EVENT_COUNTERS be enabled default?

2021-01-26 Thread Kengo NAKAHARA


Hi,

On 2021/01/27 13:36, Jason Thorpe wrote:



On Jan 26, 2021, at 7:51 PM, Kengo NAKAHARA  wrote:

Hi,

Thank you for your comments.

On 2021/01/27 12:14, Jason Thorpe wrote:

On Jan 26, 2021, at 6:21 PM, Kengo NAKAHARA  wrote:

Someone may worry about any other effects, I will add new kernel option
WM_DISABLE_EVENT_COUNTERS, please use it.

No strong objection from me, but can we enable it by default only on systems 
with 64-bit atomics?


I see.  Hmm, I will enable WM_EVENT_COUNTERS for amd64 and aarch64
instead of adding WM_DISABLE_EVENT_COUNTERS.  And then, I will also
enable for other 64 bit architectures.


I think you can safely enable it by default for all _LP64.  Are the stats 
incremented with atomics?


Ah, the stats are just incremented (not atomic ops) because they are
incremented by one CPU or protected by locks.  I will enable it as you
say.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Can WM_EVENT_COUNTERS be enabled default?

2021-01-26 Thread Kengo NAKAHARA


Hi,

Thank you for your comments.

On 2021/01/27 12:14, Jason Thorpe wrote:



On Jan 26, 2021, at 6:21 PM, Kengo NAKAHARA  wrote:

Someone may worry about any other effects, I will add new kernel option
WM_DISABLE_EVENT_COUNTERS, please use it.


No strong objection from me, but can we enable it by default only on systems 
with 64-bit atomics?


I see.  Hmm, I will enable WM_EVENT_COUNTERS for amd64 and aarch64
instead of adding WM_DISABLE_EVENT_COUNTERS.  And then, I will also
enable for other 64 bit architectures.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Can WM_EVENT_COUNTERS be enabled default?

2021-01-26 Thread Kengo NAKAHARA

]   0.00-10.04  sec   964 MBytes   806 Mbits/sec  receiver



+ GENERIC with WM_EVENT_COUNTERS

- ipgen

rfc2544 tolerable error rate: 0.1000%
rfc2544 trial duration: 30 sec
rfc2544 pps resolution: 0.0010%

framesize|0M  100M 200M 300M 400M 500M 600M 700M 800M 900M 1Gbps
-+++++++++++
  64 |##   39.57Mbps,   
77279/1488095pps
 128 | 67.56Mbps,   
65979/ 844594pps
 512 |302.83Mbps,   
73934/ 234962pps
1024 |##  586.94Mbps,   
71648/ 119731pps
1280 |##  744.40Mbps,   
72695/  96153pps
1408 |##  832.76Mbps,   
73931/  87535pps
1518 |#   889.38Mbps,   
73236/  81274pps

framesize|0   
|100k|200k|300k|400k|500k|600k|700k|800k|900k|1.0m|1.1m|1.2m|1.3m|1.4m|1.5m pps
-++++++++++++++++
  64 |###   
   77279/1488095pps,   5.19%
 128 |###   
   65979/ 844594pps,   7.81%
 512 |###   
   73934/ 234962pps,  31.47%
1024 |###   
   71648/ 119731pps,  59.84%
1280 |###   
   72695/  96153pps,  75.60%
1408 |###   
   73931/  87535pps,  84.46%
1518 |###   
   73236/  81274pps,  90.11%


- iperf3 client side

# iperf3 -c 192.168.0.254 -b 1000M
Connecting to host 192.168.0.254, port 5201
[  4] local 192.168.0.20 port 65534 connected to 192.168.0.254 port 5201
[ ID] Interval   Transfer Bandwidth   Retr  Cwnd
[  4]   0.00-1.00   sec  98.2 MBytes   823 Mbits/sec0   2.17 MBytes
[  4]   1.00-2.00   sec   112 MBytes   937 Mbits/sec0   3.94 MBytes
[  4]   2.00-3.00   sec   112 MBytes   938 Mbits/sec0   4.99 MBytes
[  4]   3.00-4.00   sec   107 MBytes   894 Mbits/sec0451 KBytes
[  4]   4.00-5.00   sec   111 MBytes   928 Mbits/sec0723 KBytes
[  4]   5.00-6.00   sec   110 MBytes   925 Mbits/sec0918 KBytes
[  4]   6.00-7.00   sec   109 MBytes   911 Mbits/sec0   1.03 MBytes
[  4]   7.00-8.00   sec   112 MBytes   936 Mbits/sec0   1.05 MBytes
[  4]   8.00-9.00   sec   111 MBytes   932 Mbits/sec0   1.07 MBytes
[  4]   9.00-10.00  sec   112 MBytes   936 Mbits/sec0   1.08 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bandwidth   Retr
[  4]   0.00-10.00  sec  1.07 GBytes   916 Mbits/sec0 sender
[  4]   0.00-10.00  sec  1.07 GBytes   916 Mbits/sec  receiver

iperf Done.


- iperf3 server side

# iperf3 -s
---
Server listening on 5201
---
Accepted connection from 192.168.0.254, port 48470
[  5] local 192.168.0.20 port 5201 connected to 192.168.0.254 port 48472
[ ID] Interval   Transfer Bandwidth
[  5]   0.00-1.00   sec  65.9 MBytes   553 Mbits/sec
[  5]   1.00-2.00   sec  91.7 MBytes   769 Mbits/sec
[  5]   2.00-3.00   sec  97.6 MBytes   819 Mbits/sec
[  5]   3.00-4.00   sec   102 MBytes   854 Mbits/sec
[  5]   4.00-5.00   sec   102 MBytes   856 Mbits/sec
[  5]   5.00-6.00   sec  90.1 MBytes   755 Mbits/sec
[  5]   6.00-7.00   sec  86.3 MBytes   725 Mbits/sec
[  5]   7.00-8.00   sec  92.0 MBytes   771 Mbits/sec
[  5]   8.00-9.00   sec   105 MBytes   881 Mbits/sec
[  5]   9.00-10.00  sec   105 MBytes   878 Mbits/sec
[  5]  10.00-10.04  sec  4.00 MBytes   881 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bandwidth
[  5]   0.00-10.04  sec  0.00 Bytes  0.00 bits/sec  sender
[  5]   0.00-10.04  sec   941 MBytes   787 Mbits/sec  receiver



Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: [PATCH] Make workqueue(9) use threadpool(9)

2020-09-14 Thread Kengo NAKAHARA


Hi,

On 2020/09/14 1:09, Taylor R Campbell wrote:

The attached patch adapts workqueue(9) to use a threadpool job instead
of a worker kthread in each queue.  There are a couple of advantages
to this change:

1. No kthreads for unused workqueues.

A lot of drivers and subsystems create workqueues to handle tasks
in thread context -- often per-CPU workqueues.  Even if these
subsystems aren't in use, they currently require kthreads to be
allocated, which may use up a substantial amount of kva.

With this change, kthreads are created only according to demand by
threadpool(9) -- but the workqueues are still guaranteed the same
concurrency as before, as long as kthreads can be created.

2. Workqueues can be created earlier (and are ready for CPU hotplug).

Right now, workqueue_create with WQ_PERCPU cannot be used until
after all CPUs have been detected.  I accidentally started doing
this in wg(4) because I've been running with this patch (and the
issue doesn't affect the rumpy atf tests).

For now I'll apply a workaround to wg(4), but it would be nice if
modules could create workqueues before configure (and if, should
anyone make CPU hotplug happen, workqueues were not a barrier to
that).

This change uses percpu_create and threadpool_job_init, rather than
explicit allocation of ncpu-sized arrays and kthread_create.  Using
percpu_create means percpu(9) takes care of running initialization
when CPUs are attached, and using threadpool_job_init instead of
kthread_create means we don't have to worry about failure when
initializing each CPU's queue in the percpu_create constructor.

The downside, of course, is that workqueue_create no longer guarantees
preallocation of all the threads needed to run the workqueues --
instead, if there is a shortage of threads dynamically assigned to do
work under load, the threadpool(9) logic will block until they can be
created.

Thoughts?


I tested the attached patch with workqueue mode wm(4)(sysctl -w
hw.wm[01].txrx_workqueue=1) and NET_MPSAFE=on kernel.  It works fine for
multiprocessor IP forwarding on my system.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: racy acccess in kern_runq.c

2019-12-06 Thread Kengo NAKAHARA


Hi,

On 2019/12/06 18:00, Andrew Doran wrote:

Hi,

On Fri, Dec 06, 2019 at 03:52:23PM +0900, Kengo NAKAHARA wrote:


There are some racy accesses in kern_runq.c detected by KCSAN.  Those
racy access messages is so frequency that they cover other messages,
so I want to fix them.  They can be fixed by the following patch.


Please don't commit this.  These accesses are racy by design.  There is no
safety issue and we do not want to disturb the activity of other CPUs in
this code path by locking them.  We also don't want to use atomics either.
This code path is extremely hot using atomics would impose a severe
performance penalty on the system under certain workloads.


I see, I don't commit it.  Indeed, I am worry about performance degradation.



Also if you change this to use strong ordering, you're quite likely to
change the selection behaviour of LWPs.  This is something delicate that
exhibits reasonable scheduling behaviour in large part through randomness
i.e by accident, so serialising it it's likely to have strong effects on how
LWPs are distributed.

Lastly I have a number of experimental changes to this code which I'll be
committing in the near future allowing people to change the LWP selection
policy to see how if we can improve performance under different workloads.
They will also likely show up in KCSAN as racy/dangerous accesses.

My suggestion is to find a way to teach KCSAN that we know something is
racy, we like it being racy, and that it's not a problem, so that it no
longer shows up in the KCSAN output.


I see.  I will add __nocsan attribute to the functions in my local sources
to find other races.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: racy acccess in kern_runq.c

2019-12-06 Thread Kengo NAKAHARA


Hi,

Thank you for your comment.

On 2019/12/06 16:42, Maxime Villard wrote:

Le 06/12/2019 à 07:52, Kengo NAKAHARA a écrit :

Hi,

There are some racy accesses in kern_runq.c detected by KCSAN.  Those
racy access messages is so frequency that they cover other messages,
so I want to fix them.  They can be fixed by the following patch.


diff --git a/sys/kern/kern_runq.c b/sys/kern/kern_runq.c
index 04ff1732016..bb0815689cd 100644
--- a/sys/kern/kern_runq.c
+++ b/sys/kern/kern_runq.c
@@ -627,7 +627,7 @@ sched_balance(void *nocallout)
  /* Make lockless countings */
  for (CPU_INFO_FOREACH(cii, ci)) {
  spc = >ci_schedstate;
-
+    spc_lock(ci);
  /*
   * Average count of the threads
   *
@@ -643,10 +643,11 @@ sched_balance(void *nocallout)
  hci = ci;
  highest = spc->spc_avgcount;
  }
+    spc_unlock(ci);
  }
  
  /* Update the worker */

-    worker_ci = hci;
+    atomic_swap_ptr(_ci, hci);
  
  if (nocallout == NULL)

  callout_schedule(_ch, balance_period);
@@ -734,12 +735,15 @@ sched_idle(void)
  spc_unlock(ci);
  
  no_migration:

+    spc_lock(ci);
  if ((spc->spc_flags & SPCF_OFFLINE) != 0 || spc->spc_count != 0) {
+    spc_unlock(ci);
  return;
  }
  
  /* Reset the counter, and call the balancer */

  spc->spc_avgcount = 0;
+    spc_unlock(ci);
  sched_balance(ci);
  tci = worker_ci;
  tspc = >ci_schedstate;



It just so happens we were talking about this yesterday.


Oh, I missed the thread, sorry.


worker_ci indeed needs to be a real atomic, but it means we also have to
fetch it atomically here:

744 tci = worker_ci;

For the other variables, my understanding was that we don't care about
races, but only care about making sure the accesses are not split, so it
seems to me atomic_relaxed would suffice.


I see.  I update the patch.

diff --git a/sys/kern/kern_runq.c b/sys/kern/kern_runq.c
index 04ff1732016..dbd0f73585a 100644
--- a/sys/kern/kern_runq.c
+++ b/sys/kern/kern_runq.c
@@ -627,6 +627,7 @@ sched_balance(void *nocallout)
/* Make lockless countings */
for (CPU_INFO_FOREACH(cii, ci)) {
spc = >ci_schedstate;
+   u_int avg, mcount;
 
 		/*

 * Average count of the threads
@@ -634,19 +635,20 @@ sched_balance(void *nocallout)
 * The average is computed as a fixpoint number with
 * 8 fractional bits.
 */
-   spc->spc_avgcount = (
-   weight * spc->spc_avgcount + (100 - weight) * 256 * 
spc->spc_mcount
-   ) / 100;
+   avg = atomic_load_relaxed(>spc_avgcount);
+   mcount = atomic_load_relaxed(>spc_mcount);
+   avg = (weight * avg + (100 - weight) * 256 * mcount) / 100;
+   atomic_store_relaxed(>spc_avgcount, avg);
 
 		/* Look for CPU with the highest average */

-   if (spc->spc_avgcount > highest) {
+   if (avg > highest) {
hci = ci;
-   highest = spc->spc_avgcount;
+   highest = avg;
}
}
 
 	/* Update the worker */

-   worker_ci = hci;
+   atomic_swap_ptr(_ci, hci);
 
 	if (nocallout == NULL)

callout_schedule(_ch, balance_period);
@@ -734,14 +736,15 @@ sched_idle(void)
spc_unlock(ci);
 
 no_migration:

-   if ((spc->spc_flags & SPCF_OFFLINE) != 0 || spc->spc_count != 0) {
+   if ((spc->spc_flags & SPCF_OFFLINE) != 0
+   || atomic_load_relaxed(>spc_count) != 0) {
return;
}
 
 	/* Reset the counter, and call the balancer */

-   spc->spc_avgcount = 0;
+   atomic_store_relaxed(>spc_avgcount, 0);
sched_balance(ci);
-   tci = worker_ci;
+   atomic_swap_ptr(, worker_ci);
tspc = >ci_schedstate;
if (ci == tci || spc->spc_psid != tspc->spc_psid)
return;


Is this appropriate?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

racy acccess in kern_runq.c

2019-12-05 Thread Kengo NAKAHARA


Hi,

There are some racy accesses in kern_runq.c detected by KCSAN.  Those
racy access messages is so frequency that they cover other messages,
so I want to fix them.  They can be fixed by the following patch.


diff --git a/sys/kern/kern_runq.c b/sys/kern/kern_runq.c
index 04ff1732016..bb0815689cd 100644
--- a/sys/kern/kern_runq.c
+++ b/sys/kern/kern_runq.c
@@ -627,7 +627,7 @@ sched_balance(void *nocallout)
/* Make lockless countings */
for (CPU_INFO_FOREACH(cii, ci)) {
spc = >ci_schedstate;
-
+   spc_lock(ci);
/*
 * Average count of the threads
 *
@@ -643,10 +643,11 @@ sched_balance(void *nocallout)
hci = ci;
highest = spc->spc_avgcount;
}
+   spc_unlock(ci);
}
 
 	/* Update the worker */

-   worker_ci = hci;
+   atomic_swap_ptr(_ci, hci);
 
 	if (nocallout == NULL)

callout_schedule(_ch, balance_period);
@@ -734,12 +735,15 @@ sched_idle(void)
spc_unlock(ci);
 
 no_migration:

+   spc_lock(ci);
if ((spc->spc_flags & SPCF_OFFLINE) != 0 || spc->spc_count != 0) {
+   spc_unlock(ci);
return;
}
 
 	/* Reset the counter, and call the balancer */

spc->spc_avgcount = 0;
+   spc_unlock(ci);
sched_balance(ci);
tci = worker_ci;
tspc = >ci_schedstate;


Can I commit this patch?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

What should the name of KPI which call softint_schedule with kpreempt_disable/enable be?

2019-10-16 Thread Kengo NAKAHARA


Hi,

I have fixed some softint_schedule() caller which can be called
while preempt enabled, that is, if_vmx.c:r1.51, ix_txrx.c:r1.56,
if_gre.c:r1.176, if_l2tp.c:r1.40, and if_tap.c:1.114.

There are several similar codes not only in network stack but also
USB stack(*1) and others(*2).
*1 https://nxr.netbsd.org/xref/src/sys/dev/usb/ucom.c#1257
   https://nxr.netbsd.org/xref/src/sys/dev/usb/usb.c#1120
*2 https://nxr.netbsd.org/xref/src/sys/kern/kern_rndq.c#225

I think the codes should be a KPI. What should the KPI name be?
I think of
void softint_schedule_lwp(void *arg)

Could you comment this KPI name or other idea?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: workqueue_destroy() can cause hanging up in the some cases

2019-09-06 Thread Kengo NAKAHARA


Hi,

On 2019/09/06 12:45, Taylor R Campbell wrote:

Date: Fri, 6 Sep 2019 12:17:32 +0900 From: Kengo NAKAHARA


I found workqueue_destroy() for WQ_PERCPU workqueue can cause
hanging up while preempt disabled. The caller of
workqueue_destroy() requires for q_worker kthread to call
kthread_exit(). In the implementation, the caller do cv_wait()(*1)
until q_worker sets NULL to q->q_worker(*2). - (*1)
https://nxr.netbsd.org/xref/src/sys/kern/subr_workqueue.c#227 -
(*2) https://nxr.netbsd.org/xref/src/sys/kern/subr_workqueue.c#208

However, q_worker thread cannot run on the CPU which the caller of 
workqueue_destroy() is running, when preempt disabled. That causes 
hanging up.


I think it may be enough to just add notice to workqueue_destroy()
man, but it should be fixed if it can.

Do you have any comments or fix ideas?


Interesting.

1. Why are you trying to workqueue_destroy with preemption disabled?

Normally creation and destruction routines should be done in thread 
context without anything blocked or any resources held, because they

may generally sleep waiting for resources.


I found it when I am debugging attach processing of ixg(4) and vmx(4),
in particular, I try to call workqueue_destroy() in attach function
to check error case.
# The drivers use workqueue(9) instead of softint(9) to reduce the
# packet processing load.
The device attach processing is done before kpreempt_enable().

Yes, in this situation, we can avoid the hanging up to use
config_finalize_register(). But, I worry about there may be other
problems.


2. How do you conclude that disabling preemption makes the
difference? Do you have a minimal working example that hangs if you
disable preemption, but works if you leave it enabled?

I would expect cv_wait to sleep and let the other thread run even if
preemption is disabled.  (We should maybe also have a mechanism for
disabling preemption _and sleep_ if we don't already -- meaning 
sleeping would cause a panic, in a context where you want to make 
sure no thread switches can happen.)


Ah... I'm sorry, I am incorrect and you are correct.
As I wrote the above, I found it in autoconfig processing. The autoconfig
processing is different from just preempt disabled situation. When boot
processing has completed, workqueue_destroy() after kpreempt_disable()
works fine.

Hmm, the hanging up seems to occur while boot processing only. So, I
should just call workqueue_destroy() and workqueue_create() in
config_finalize_register()'ed function. Sorry to report wrongly.


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA 
aut

workqueue_destroy() can cause hanging up in the some cases

2019-09-05 Thread Kengo NAKAHARA


Hi,

I found workqueue_destroy() for WQ_PERCPU workqueue can cause hanging up
while preempt disabled. The caller of workqueue_destroy() requires
for q_worker kthread to call kthread_exit(). In the implementation,
the caller do cv_wait()(*1) until q_worker sets NULL to q->q_worker(*2).
- (*1) https://nxr.netbsd.org/xref/src/sys/kern/subr_workqueue.c#227
- (*2) https://nxr.netbsd.org/xref/src/sys/kern/subr_workqueue.c#208

However, q_worker thread cannot run on the CPU which the caller of
workqueue_destroy() is running, when preempt disabled. That causes
hanging up.

I think it may be enough to just add notice to workqueue_destroy() man,
but it should be fixed if it can.

Do you have any comments or fix ideas?


Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA

Re: Concurrent trie-hash map

2018-10-17 Thread Kengo NAKAHARA

Hi,

On 2018/10/17 5:00, Mindaugas Rasiukevicius wrote:
> Hi,
> 
> I recently implemented thmap [1] -- a concurrent trie-hash map, combining
> the elements of hashing and radix trie.  It is supports lock-free lookups
> and concurrent inserts/deletes.  It is designed to be optimal as a general
> purpose *concurrent* associative array [2][3].
> 
> I am going to import it under src/sys/net/npf, so it could be used by NPF.
> However, there are various other areas in the kernel where it could improve
> the performance significantly, e.g. in the network stack, 5-tuple mapping
> to socket (in_pcb.c).  While I currently do not have plans to work in those
> areas, if anybody would like to use it elsewhere, I can import thmap in the
> src/sys/kern directory.  Let me know.

If it is imported to src/sys/kern, I want use it to re-implement
sys/netinet/ip_flow.c::ipflowtable. However, I don't have enough time to
do it right now...


> [1] https://github.com/rmind/thmap
> [2] http://www.netbsd.org/~rmind/thmap_lookup_80_64bit_keys_intel_4980hq.svg
> [3] http://www.netbsd.org/~rmind/thmap_insert_80_64bit_keys_intel_4980hq.svg

Wow, excellent scaling! :)


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA

Re: RFC: ipsec(4) pseudo interface

2018-01-10 Thread Kengo NAKAHARA

Hi,

On 2017/12/26 17:31, Kengo NAKAHARA wrote:
> Hi,
> 
> On 2017/12/25 23:14, Christos Zoulas wrote:
>> On Dec 25,  4:54pm, k-nakah...@iij.ad.jp (Kengo NAKAHARA) wrote:
>> -- Subject: Re: RFC: ipsec(4) pseudo interface
>>
>> | Here is the updated patch series and unified patch.
>> | - https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec4.tgz
>> | - https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec4-unified.patch
>>
>> Thanks, LGTM!
> 
> Thank you very much for your many helpful pointing out!
> 
> Does anyone else have any comments?

It seems net/ipsec ATFs are stable now, so I commit it.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: ipsec(4) pseudo interface

2017-12-26 Thread Kengo NAKAHARA

Hi,

On 2017/12/25 23:14, Christos Zoulas wrote:
> On Dec 25,  4:54pm, k-nakah...@iij.ad.jp (Kengo NAKAHARA) wrote:
> -- Subject: Re: RFC: ipsec(4) pseudo interface
> 
> | Here is the updated patch series and unified patch.
> | - https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec4.tgz
> | - https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec4-unified.patch
> 
> Thanks, LGTM!

Thank you very much for your many helpful pointing out!

Does anyone else have any comments?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: ipsec(4) pseudo interface

2017-12-24 Thread Kengo NAKAHARA

Hi,

On 2017/12/23 9:05, Christos Zoulas wrote:
> In article <e5721d0b-1b44-612e-a23c-ca5e2af52...@iij.ad.jp>,
> Kengo NAKAHARA  <k-nakah...@iij.ad.jp> wrote:
>> Hi,
>>
>> Thank you for your reviewing.
> 
> 
> Thanks for fixing; more nit-picking:
> 1. there is a variable called err instead of error why (all the rest
>are called error)?

Oh, good catch. I fix it(ipsecif4_fragout()).

> 2. I prefer fewer lines of code, fewer variables for all the copies
>of those similar functions, like:
> 
> +static int
> +if_ipsec_encap_detach(struct ipsec_variant *var)
> +{
> +
> + KASSERT(var != NULL);
> + KASSERT(if_ipsec_variant_is_configured(var));
> +
> + switch (var->iv_psrc->sa_family) {
> +#ifdef INET
> + case AF_INET:
> + return ipsecif4_detach(var);
> +#endif /* INET */
> +#ifdef INET6
> + case AF_INET6:
> + return ipsecif6_detach(var);
> + break;
> +#endif /* INET6 */
> + default:
> + return EINVAL;
> + }
> +}

Exactly. I fix the following duplications.
+ if_ipsec_encap_attach() and if_ipsec_encap_detach()
+ if_ipsec_gather_addr_port() and if_ipsec_set_addr_port()
  - remove if_ipsec_gather_addr_port()
+ sa_len checking in if_ipsec_ioctl()
  - functionize it as if_ipsec_check_salen()

Here is the updated patch series and unified patch.
- https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec4.tgz
- https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec4-unified.patch


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: ipsec(4) pseudo interface

2017-12-21 Thread Kengo NAKAHARA

Hi,

I'm sorry, I send mail while editing by mistake.

On 2017/12/20 22:40, Thor Lancelot Simon wrote:
> On Mon, Dec 18, 2017 at 06:49:44PM +0900, Kengo NAKAHARA wrote:
>> Hi,
>>
>> We implement ipsec(4) pseudo interface for route-based VPNs. This pseudo
>> interface manages its security policy(SP) by itself, in particular, we do
>> # ifconfig ipsec0 tunnel 10.0.0.1 10.0.0.2
>> the SPs "10.0.0.1 -> 10.0.0.2"(out) and "10.0.0.2 -> 10.0.0.1"(in) are
>> generated automatically and atomically. And then, when we do
>> # ifconfig ipsec0 deletetunnel
>> the SPs are destroyed automatically and atomically, too.
> 
> Do you have IKE daemon changes to use this?

No, I don't. Because ipsec(4) interface send the same PF_KEY message
as adding transport mode security policy manually. That is the behavior
to use existing IKE daemon.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: ipsec(4) pseudo interface

2017-12-21 Thread Kengo NAKAHARA

Hi,

Thank you for your reviewing.

On 2017/12/20 21:08, Christos Zoulas wrote:
> In article <75925357-8e16-0f0f-b7a0-78155c865...@iij.ad.jp>,
> Kengo NAKAHARA  <k-nakah...@iij.ad.jp> wrote:
>> Hi,
>>
>> On 2017/12/19 2:54, Christos Zoulas wrote:
>>> In article <02c36311-2fcd-08cf-cc71-b472e7c01...@iij.ad.jp>,
>>> Kengo NAKAHARA  <k-nakah...@iij.ad.jp> wrote:
>>>> Hi,
>>>>
>>>> We implement ipsec(4) pseudo interface for route-based VPNs. This pseudo
>>>> interface manages its security policy(SP) by itself, in particular, we do
>>>># ifconfig ipsec0 tunnel 10.0.0.1 10.0.0.2
>>>> the SPs "10.0.0.1 -> 10.0.0.2"(out) and "10.0.0.2 -> 10.0.0.1"(in) are
>>>> generated automatically and atomically. And then, when we do
>>>># ifconfig ipsec0 deletetunnel
>>>> the SPs are destroyed automatically and atomically, too.
>>>>
>>>> Here is the patches and an unified patch.
>>>>http://netbsd.org/~knakahara/if_ipsec/if_ipsec.tgz
>>>>http://netbsd.org/~knakahara/if_ipsec/if_ipsec-unified.patch
>>>>
>>>> By the way, I have one question. In the above patch(s), I temporarily add
>>>> manual for ipsecX pseudo interface as if_ipsec.4, because there is already
>>>> ipsec.4 for general ipsec protocol. How should I add the man of ipsec(4)
>>>> pseudo interface?
>>>>(a) Add if_ipsec.4
>>>>(b) move current ipsec.4(for ipsec protocol) to ipsec.9, and then
>>>>add ipsec.4(for ipsec pseudo interface)
>>>>(c) any other
>>>>
>>>> Could you comment the patch or the question?
>>>
>>> I've wanted this feature for a long time! Looks ok to me, but the
>>> sockaddr_copy()/port setting code, should be abstracted to a single
>>> function since it is repeated in ioctl().
>>
>> Thank you for your reviewing. I fix it in the following patch.
>>- patch series
>>  - https://netbsd.org/~knakahara/if_ipsec/if_ipsec2.tgz
>>- unified patch
>>  - https://netbsd.org/~knakahara/if_ipsec/if_ipsec2-unified.patch
> 
> Thanks:
> 
> + error = var->iv_output(var, family, m);
> + if (!error) {
> 
> - It is simpler to do (early returns):
>   if (error)
>   return error;
>   ...
>   return 0;

I apply it.

> - What's the point of 'goto bad' in ioctl if there is no cleanup to be done?

Oh, indeed. I refactor cleanup processing.

> - There is one more place you could use the addr_port function?
>   [or perhaps abstract it in in_port_t *get_port(struct sockaddr *)?

Sorry, I missed it.

> - We should use in_port_t instead of unsigned short for ports.

Ah, the original code is quite old, I missed updating it.

> - I am not clear how the code interracts with ipsec endpoints created
>   from /etc/ipsec.conf and ones created via ifconfig? I.e. if I create
>   endpoints in /etc/ipsec.conf, does the cloner interface get automatically
>   constructed?

In a nutshell, it is first-come basis. If the SPs required by ipsec(4)
interface are already added by /etc/ipsec.conf or manually setkey(8),
'ifconfig ipsec0 tunnel "src" "dst"' fails, and vice versa.
Sorry, there was no information about that and SPs required by ipsec(4)
interface. I add the information to man.
Furthermore, I find a bug in handling that error case. Thank you for
your pointing out!

Here is the update patch series and unified patch.
https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec3.tgz
https://www.netbsd.org/~knakahara/if_ipsec/if_ipsec3-unified.patch


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: ipsec(4) pseudo interface

2017-12-19 Thread Kengo NAKAHARA

Hi,

On 2017/12/19 11:07, Christos Zoulas wrote:
> In article <20171218184400.ga27...@britannica.bec.de>,
> Joerg Sonnenberger  <jo...@bec.de> wrote:
>> On Mon, Dec 18, 2017 at 06:49:44PM +0900, Kengo NAKAHARA wrote:
>>> (a) Add if_ipsec.4
>>> (b) move current ipsec.4(for ipsec protocol) to ipsec.9, and then
>>> add ipsec.4(for ipsec pseudo interface)
>>> (c) any other
>>
>> I'd call it either ifipsec(4) or ipsecif(4).

Thank you for comments. I use ipsecif(4) as it is easy to sort in
alphabetical with ipsec(4). :)

> Traditionally the names of devices drivers and protocols go to section 4.
> So we have a conflict here, use one of the names Joerg suggested and cross
> reference it on the top of the ipsec.4 page:
> 
> .Pp
> This manual pages describes the IPSEC.
> For the network device driver please see
> .Xr ifipsec 4 .

Thank you for your suggestion. I add the reference to ipsec(4).

As a result, here is the updated patch.
- patch series
  - https://netbsd.org/~knakahara/if_ipsec/if_ipsec2.tgz
- in paticular, manual is added in 
0004-add-ipsec-4-I-F-man-as-ipsecif.4.patch
- unified patch
  - https://netbsd.org/~knakahara/if_ipsec/if_ipsec2-unified.patch
# include christos@n.o's pointing out


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: ipsec(4) pseudo interface

2017-12-19 Thread Kengo NAKAHARA

Hi,

On 2017/12/19 2:54, Christos Zoulas wrote:
> In article <02c36311-2fcd-08cf-cc71-b472e7c01...@iij.ad.jp>,
> Kengo NAKAHARA  <k-nakah...@iij.ad.jp> wrote:
>> Hi,
>>
>> We implement ipsec(4) pseudo interface for route-based VPNs. This pseudo
>> interface manages its security policy(SP) by itself, in particular, we do
>># ifconfig ipsec0 tunnel 10.0.0.1 10.0.0.2
>> the SPs "10.0.0.1 -> 10.0.0.2"(out) and "10.0.0.2 -> 10.0.0.1"(in) are
>> generated automatically and atomically. And then, when we do
>># ifconfig ipsec0 deletetunnel
>> the SPs are destroyed automatically and atomically, too.
>>
>> Here is the patches and an unified patch.
>>http://netbsd.org/~knakahara/if_ipsec/if_ipsec.tgz
>>http://netbsd.org/~knakahara/if_ipsec/if_ipsec-unified.patch
>>
>> By the way, I have one question. In the above patch(s), I temporarily add
>> manual for ipsecX pseudo interface as if_ipsec.4, because there is already
>> ipsec.4 for general ipsec protocol. How should I add the man of ipsec(4)
>> pseudo interface?
>>(a) Add if_ipsec.4
>>(b) move current ipsec.4(for ipsec protocol) to ipsec.9, and then
>>add ipsec.4(for ipsec pseudo interface)
>>(c) any other
>>
>> Could you comment the patch or the question?
> 
> I've wanted this feature for a long time! Looks ok to me, but the
> sockaddr_copy()/port setting code, should be abstracted to a single
> function since it is repeated in ioctl().

Thank you for your reviewing. I fix it in the following patch.
- patch series
  - https://netbsd.org/~knakahara/if_ipsec/if_ipsec2.tgz
- unified patch
  - https://netbsd.org/~knakahara/if_ipsec/if_ipsec2-unified.patch
# include changing man entry, I will describe about it in a later mail


Thanks,

-- 
//////
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

RFC: ipsec(4) pseudo interface

2017-12-18 Thread Kengo NAKAHARA

Hi,

We implement ipsec(4) pseudo interface for route-based VPNs. This pseudo
interface manages its security policy(SP) by itself, in particular, we do
# ifconfig ipsec0 tunnel 10.0.0.1 10.0.0.2
the SPs "10.0.0.1 -> 10.0.0.2"(out) and "10.0.0.2 -> 10.0.0.1"(in) are
generated automatically and atomically. And then, when we do
# ifconfig ipsec0 deletetunnel
the SPs are destroyed automatically and atomically, too.

Here is the patches and an unified patch.
http://netbsd.org/~knakahara/if_ipsec/if_ipsec.tgz
http://netbsd.org/~knakahara/if_ipsec/if_ipsec-unified.patch

By the way, I have one question. In the above patch(s), I temporarily add
manual for ipsecX pseudo interface as if_ipsec.4, because there is already
ipsec.4 for general ipsec protocol. How should I add the man of ipsec(4)
pseudo interface?
(a) Add if_ipsec.4
(b) move current ipsec.4(for ipsec protocol) to ipsec.9, and then
add ipsec.4(for ipsec pseudo interface)
(c) any other

Could you comment the patch or the question?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-12-10 Thread Kengo NAKAHARA

Hi,

On 2017/12/11 11:09, Taylor R Campbell wrote:
>> Date: Mon, 11 Dec 2017 10:53:11 +0900
>> From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
>>
>> Thank you for your reviewing. I see. Here is the updated patch.
>> [...]
>> If there is no problem, can I commit and pullup -8 the patch?
> 
> LGTM, please commit!

Thank you for your quickly reviewing. I commit it as subr_psref.c:r1.8
and psref.h:r1.3. I will send pullup request -8 branch.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-12-10 Thread Kengo NAKAHARA

Hi,

On 2017/12/09 0:18, Taylor R Campbell wrote:
>> Date: Fri, 8 Dec 2017 18:51:28 +0900
>> From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
>>
>> However, it seems your patch has large diff... From the point of code
>> stability, smaller diff SLIST version would be better for netbsd-8 branch
>> to fix the bug. Because your patch causes some new ATF failures such as
>> ldp_regen and route_change_ifp (reported by ozaki-r@n.o). We can probably
>> fix them at once but guaranteeing its stability would take more time.
> 
> Not surprising I made a bug somewhere in there!  The SLIST version is
> fine by me.  However, there's one small nit:
> 
>> diff --git a/sys/sys/psref.h b/sys/sys/psref.h
>> index 88db6dbb603..9096a3798d6 100644
>> --- a/sys/sys/psref.h
>> +++ b/sys/sys/psref.h
>> @@ -69,7 +69,7 @@ struct psref_target {
>>   *  written only on the local CPU.
>>   */
>>  struct psref {
>> -LIST_ENTRY(psref)   psref_entry;
>> +SLIST_ENTRY(psref)  psref_entry;
>>  const struct psref_target   *psref_target;
> 
> In order to avoid changing the kernel ABI in netbsd-8, please add a
> void *psref_unused0 after psref_entry.  That way we don't have to
> think about the consequences of an ABI change in the branch, and we
> still have the opportunity to switch back to a doubly-linked list if
> we want.

Thank you for your reviewing. I see. Here is the updated patch.

diff --git a/sys/kern/subr_psref.c b/sys/kern/subr_psref.c
index c3f76ab0e74..9eac19def3f 100644
--- a/sys/kern/subr_psref.c
+++ b/sys/kern/subr_psref.c
@@ -78,7 +78,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_psref.c,v 1.7 2017/06/01 
02:45:13 chs Exp $");
 #include 
 #include 
 
-LIST_HEAD(psref_head, psref);
+SLIST_HEAD(psref_head, psref);
 
 static bool_psref_held(const struct psref_target *, struct psref_class *,
bool);
@@ -135,7 +135,7 @@ psref_cpu_drained_p(void *p, void *cookie, struct cpu_info 
*ci __unused)
const struct psref_cpu *pcpu = p;
bool *retp = cookie;
 
-   if (!LIST_EMPTY(>pcpu_head))
+   if (!SLIST_EMPTY(>pcpu_head))
*retp = false;
 }
 
@@ -194,7 +194,7 @@ psref_check_duplication(struct psref_cpu *pcpu, struct 
psref *psref,
bool found = false;
struct psref *_psref;
 
-   LIST_FOREACH(_psref, >pcpu_head, psref_entry) {
+   SLIST_FOREACH(_psref, >pcpu_head, psref_entry) {
if (_psref == psref &&
_psref->psref_target == target) {
found = true;
@@ -250,7 +250,7 @@ psref_acquire(struct psref *psref, const struct 
psref_target *target,
 #endif
 
/* Record our reference.  */
-   LIST_INSERT_HEAD(>pcpu_head, psref, psref_entry);
+   SLIST_INSERT_HEAD(>pcpu_head, psref, psref_entry);
psref->psref_target = target;
psref->psref_lwp = curlwp;
psref->psref_cpu = curcpu();
@@ -273,6 +273,7 @@ void
 psref_release(struct psref *psref, const struct psref_target *target,
 struct psref_class *class)
 {
+   struct psref_cpu *pcpu;
int s;
 
KASSERTMSG((kpreempt_disabled() || cpu_softintr_p() ||
@@ -302,7 +303,9 @@ psref_release(struct psref *psref, const struct 
psref_target *target,
 * (as does blocking interrupts).
 */
s = splraiseipl(class->prc_iplcookie);
-   LIST_REMOVE(psref, psref_entry);
+   pcpu = percpu_getref(class->prc_percpu);
+   SLIST_REMOVE(>pcpu_head, psref, psref, psref_entry);
+   percpu_putref(class->prc_percpu);
splx(s);
 
/* If someone is waiting for users to drain, notify 'em.  */
@@ -353,7 +356,7 @@ psref_copy(struct psref *pto, const struct psref *pfrom,
pcpu = percpu_getref(class->prc_percpu);
 
/* Record the new reference.  */
-   LIST_INSERT_HEAD(>pcpu_head, pto, psref_entry);
+   SLIST_INSERT_HEAD(>pcpu_head, pto, psref_entry);
pto->psref_target = pfrom->psref_target;
pto->psref_lwp = curlwp;
pto->psref_cpu = curcpu();
@@ -474,7 +477,7 @@ _psref_held(const struct psref_target *target, struct 
psref_class *class,
pcpu = percpu_getref(class->prc_percpu);
 
/* Search through all the references on this CPU.  */
-   LIST_FOREACH(psref, >pcpu_head, psref_entry) {
+   SLIST_FOREACH(psref, >pcpu_head, psref_entry) {
/* Sanity-check the reference's CPU.  */
KASSERTMSG((psref->psref_cpu == curcpu()),
"passive reference transferred from CPU %u to CPU %u",
diff --git a/sys/sys/psref.h b/sys/sys/psref.h
index 88db6dbb603..d1ab606e117 100644
--- a/sys/sys/psref.h
+++ b/sys/sys/psref.h
@@ -69,7 +69,9 @@ struct psref_target {
  * written only on t

Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-12-08 Thread Kengo NAKAHARA

  SLIST_ENTRY(psref)  psref_entry;
const struct psref_target   *psref_target;
struct lwp  *psref_lwp;
struct cpu_info *psref_cpu;


Of course, we also think this bug must be fixed before netbsd-8 rc.


Thanks,

-- 
//////
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: vlan(4) use pkthdr instead of mtag

2017-09-26 Thread Kengo NAKAHARA

Hi,

On 2017/09/26 14:57, Shoichi YAMAGUCHI wrote:
> I updated the patch again.
> https://gist.githubusercontent.com/s-ymgch228/6597cfc4b6f79c6c62fcdf25003acb55/raw/bf3b295aeeb6ae63965d533a828d8f5e360c884e/vlan_mtag.patch
>  - Introduce new functions related to vlan to cxgb_sge.c's t3_encap
> and remove __predict_false
>  - add a KASSERT to vlan_set_tag
> 
> Thanks for your advices.
> 
> Thanks,
> -- s-yamaguchi@IIJ

I'm his co-worker :)
I commit his patch now.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-09-19 Thread Kengo NAKAHARA

Hi,

Thank you for your quick response.

On 2017/09/19 14:36, Taylor R Campbell wrote:
>> Date: Tue, 19 Sep 2017 13:38:15 +0900
>> From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
>>
>> To fix PR kern/52515(*), in particular psref(9), I implement
>> PERCPU_LIST which is dedicated for percpu_alloc'ed percpu struct.
>> Here is the patch which consists of PERCULI_LIST implementation and
>> applying to subr_psref.c.
>> http://www.NetBSD.org/~knakahara/percpu-list/percpu-list.patch
>>
>> (*) http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52515
> 
> Derp.  That's an embarrassing bug!  Especially since I knew that about
> percpu(9) and didn't connect the dots...
> 
> First, what you propose is a reasonable strategy for working around
> the problem, but I'm reluctant to commit to a new style of list
> abstraction for just this purpose.  Can you confine it to subr_psref.c
> for now, and retain only the operations that are absolutely necessary?
> E.g., no need for insert_before/after -- just insert_head, remove,
> foreach, empty.

Hmm, you are right. I thought percpu list can be use by other new
components in the future, however only psref(9) must have the
initialization dependency problem. Therefore, other components should
use pointer instead of itself in percpu(9) data like percpu_urandom_cprng
in sys/dev/rndpseudo.c.

> I've attached a small patch that might serve as a quicker stop-gap
> (untested).  It's gross, but it doesn't change the ABI, add any header
> files, 

I tested your patch(with tiny typo fix s/perpcu_getref/percpu_getref/),
it works fine in my workload, thanks!

> Second, we should maybe have a more robust way to handle per-CPU data
> that require nontrivial initialization or finalization so that it's
> not necessary to do this in the first place.  Maybe something like
> 
> struct percpu *percpu_create(size_t nbytes,
> void (*init)(void *, struct cpu_info *),
> void (*fini)(void *, struct cpu_info *));
> 
> with the callbacks invoked by percpu_cpu_enlarge.  Needs details
> filled out like can it fail to initialize or finalize and thereby
> block CPU online/offline?

I agree. I think it must be hard work...


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

RFC: PERCPU_LIST to fix PR kern/52515

2017-09-18 Thread Kengo NAKAHARA

Hi,

To fix PR kern/52515(*), in particular psref(9), I implement PERCPU_LIST which
is dedicated for percpu_alloc'ed percpu struct. Here is the patch which consists
of PERCULI_LIST implementation and applying to subr_psref.c.
http://www.NetBSD.org/~knakahara/percpu-list/percpu-list.patch

(*) http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52515

The details are following.

As written in PR/52515, the root of the problem is back-pointer to
percpu_alloc'ed struct. At first, I tried generic solution to let
percpu_alloc'ed struct use a pointer instead of itself. Here is a part of the
PoC patch.
== bad patch ==
@@ -103,9 +103,18 @@ struct psref_class {
  * Not exposed by the API.
  */
 struct psref_cpu {
-   struct psref_head   pcpu_head;
+   struct psref_head   *pcpu_head;
 };
 
+static void
+psref_class_pcpu_init_p(void *p, void *cookie, struct cpu_info *ci)
+{
+   struct psref_cpu *pcpu = p;
+
+   pcpu->pcpu_head = kmem_alloc(sizeof(*(pcpu->pcpu_head)), KM_SLEEP);
+   LIST_INIT(pcpu->pcpu_head);
+}
+
 /*
  * psref_class_create(name, ipl)
  *
@@ -121,6 +130,7 @@ psref_class_create(const char *name, int ipl)
 
class = kmem_alloc(sizeof(*class), KM_SLEEP);
class->prc_percpu = percpu_alloc(sizeof(struct psref_cpu));
+   percpu_foreach(class->prc_percpu, _class_pcpu_init_p, NULL);
mutex_init(>prc_lock, MUTEX_DEFAULT, ipl);
cv_init(>prc_cv, name);
class->prc_iplcookie = makeiplcookie(ipl);
== bad patch ==

Howerver, it causes initialization dependency problems to apply this
modification to psref(9). That result from that percpu_foreach() must be called
after ncpu is fixed, that is,
(A) psref_class_create() for "ifnet_psref_class" must be called after ncpu
is fixed
- as psref_class_create() use percpu_foreach() in this modification
(B) "ifnet_psref_class" must be initialized before any (real and pseudo)
network interfaces are attached

To satisfy both (A) and (B), it is required the MI hook called immediately
after ncpu is fixed, yes, that hook is not implemented yet. Furthermore,
the modification is too large even if that hook is already implemented.
So, I gave up to use this generic solution.

After that, I decide to use list specific solution to use PERCPU_LIST
implemented for percpu_alloc'ed struct. The key design is an omission of the
back-pointer to list head in the first list element, that is, when it wants
to access list head, it always call percpu_getref() and then use the returned
pointer.
# There are some PR kern/52515 problems which cannot be fix by percpu list,
# such as ipforward_rt_percpu. They can be fixed by using a pointer instead
# of itself because they don't have complicated initialization dependency.


Could you comment the above patch or idea?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Crash in crypto_init

2017-07-20 Thread Kengo NAKAHARA

Hi,

On 2017/07/20 23:59, J. Lewis Muir wrote:
> On 07/20, Roy Marples wrote:
>> On 20/07/2017 08:45, Kengo NAKAHARA wrote:
>>> +   /*
>>> +* Some encryption devicees(such as mvcesa) is attached before
>>> +* ipi_sysinit(). That causes assertion in ipi_register() as
>>> +* crypto_ret_si softint uses SOFTINT_RCPU.
>>> +*/
>>
>> devices, not devicees
> 
> Also:
> 
> s/devices(such/devices (such/
> s/is attached/are attached/
> s/causes assertion/causes an assertion/

Oh, thank you for your pointing out. I commit.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Crash in crypto_init

2017-07-20 Thread Kengo NAKAHARA

Hi,

On 2017/07/20 18:11, Roy Marples wrote:
> On 20/07/2017 08:45, Kengo NAKAHARA wrote:
>> +/*
>> + * Some encryption devicees(such as mvcesa) is attached before
>> + * ipi_sysinit(). That causes assertion in ipi_register() as
>> + * crypto_ret_si softint uses SOFTINT_RCPU.
>> + */
> 
> devices, not devicees

Oh, thank you for your pointing out! I fix it.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Crash in crypto_init

2017-07-20 Thread Kengo NAKAHARA

Hi,

On 2017/07/20 18:07, Martin Husemann wrote:
> On Thu, Jul 20, 2017 at 04:45:08PM +0900, Kengo NAKAHARA wrote:
>> Hi,
>>
>> On 2017/07/19 20:55, Martin Husemann wrote:
>>> On Wed, Jul 19, 2017 at 08:40:27PM +0900, Kengo NAKAHARA wrote:
>>>> I think the reason is mvcesa_attach() is called before ipi_sysinit().
>>>> Could you try below tentative patch?
>>>
>>> Thanks for the quick workaround, it boots now!
>>
>> I re-implement fix patch. Could you revert previous workaround and try
>> the following patch?
> 
> Yes, that works (and looks good to me)

Thank you for your testing and reviewing. I commit the patch include
fix typo pointed out by roy@n.o as crypto.c:r 1.93.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Crash in crypto_init

2017-07-20 Thread Kengo NAKAHARA

Hi,

On 2017/07/19 20:55, Martin Husemann wrote:
> On Wed, Jul 19, 2017 at 08:40:27PM +0900, Kengo NAKAHARA wrote:
>> I think the reason is mvcesa_attach() is called before ipi_sysinit().
>> Could you try below tentative patch?
> 
> Thanks for the quick workaround, it boots now!

I re-implement fix patch. Could you revert previous workaround and try
the following patch?

diff --git a/sys/opencrypto/crypto.c b/sys/opencrypto/crypto.c
index 18103269408..2c896c5f512 100644
--- a/sys/opencrypto/crypto.c
+++ b/sys/opencrypto/crypto.c
@@ -68,6 +68,7 @@ __KERNEL_RCSID(0, "$NetBSD: crypto.c,v 1.92 2017/07/18 
06:01:36 knakahara Exp $"
 #include 
 #include 
 #include 
+#include 
 
 #if defined(_KERNEL_OPT)
 #include "opt_ocf.h"
@@ -382,6 +383,8 @@ static void crypto_driver_lock(struct cryptocap *);
 static void crypto_driver_unlock(struct cryptocap *);
 static void crypto_driver_clear(struct cryptocap *);
 
+static int crypto_init_finalize(device_t);
+
 static struct cryptostats cryptostats;
 #ifdef CRYPTO_TIMING
 static int crypto_timing = 0;
@@ -417,10 +420,13 @@ crypto_init0(void)
return crypto_destroy(false);
}
 
-   crypto_ret_si = 
softint_establish(SOFTINT_NET|SOFTINT_MPSAFE|SOFTINT_RCPU,
-   _softint, NULL);
-   if (crypto_ret_si == NULL) {
-   printf("crypto_init: cannot establish ret queue handler\n");
+   /*
+* Some encryption devicees(such as mvcesa) is attached before
+* ipi_sysinit(). That causes assertion in ipi_register() as
+* crypto_ret_si softint uses SOFTINT_RCPU.
+*/
+   if (config_finalize_register(NULL, crypto_init_finalize) != 0) {
+   printf("crypto_init: cannot register crypto_init_finalize\n");
return crypto_destroy(false);
}
 
@@ -429,6 +435,17 @@ crypto_init0(void)
return 0;
 }
 
+static int
+crypto_init_finalize(device_t self __unused)
+{
+
+   crypto_ret_si = 
softint_establish(SOFTINT_NET|SOFTINT_MPSAFE|SOFTINT_RCPU,
+   _softint, NULL);
+   KASSERT(crypto_ret_si != NULL);
+
+   return 0;
+}
+
 int
 crypto_init(void)
 {


If no issue occurs, I will commit this patch.

Does anyone have a comment?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Crash in crypto_init

2017-07-19 Thread Kengo NAKAHARA

Hi,

Sorry, it's my bug.

On 2017/07/19 20:11, Martin Husemann wrote:
> Trying to boot  -current on a uniprocessor arm soc results in:
> 
> gtidmac0: XOR Engine 4 channels, intr 5, 6, 7, 8
> gttwsi0 at mvsoc0 unit 0 offset 0x11000-0x110ff irq 29: Marvell TWSI 
> controller
> iic0 at gttwsi0: I2C bus
> mvcesa0 at mvsoc0 unit 0 offset 0x3d000-0x3dfff irq 22: Marvell Cryptographic 
> Engines and Security Accelerator
> panic: kernel diagnostic assertion "i != IPI_SYNCH_ID" failed: file 
> "../../../../kern/subr_ipi.c", line 138 
> Stopped in pid 0.1 (system) at  netbsd:cpu_Debugger+0x4:bx  r14
> db> bt
> 0xc0587b44: netbsd:vpanic+0x10
> 0xc0587b5c: netbsd:kern_assert+0x40
> 0xc0587b94: netbsd:ipi_register+0xac
> 0xc0587bd4: netbsd:softint_establish+0x108
> 0xc0587c4c: netbsd:crypto_init0+0x110
> 0xc0587c6c: netbsd:_run_once+0x80
> 0xc0587c9c: netbsd:crypto_get_driverid+0x16c
> 0xc0587cd4: netbsd:mvcesa_attach+0x7c

I think the reason is mvcesa_attach() is called before ipi_sysinit().
Could you try below tentative patch?

diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c
index 63e115ae4b3..30a1e2f54ee 100644
--- a/sys/kern/init_main.c
+++ b/sys/kern/init_main.c
@@ -524,6 +524,9 @@ main(void)
/* Enable deferred processing of RNG samples */
rnd_init_softint();
 
+   extern void crypto_init_later(void);
+   crypto_init_later();
+
 #ifdef RND_PRINTF
/* Enable periodic injection of console output into entropy pool */
kprintf_init_callout();
diff --git a/sys/opencrypto/crypto.c b/sys/opencrypto/crypto.c
index 18103269408..401e5442089 100644
--- a/sys/opencrypto/crypto.c
+++ b/sys/opencrypto/crypto.c
@@ -417,18 +417,30 @@ crypto_init0(void)
return crypto_destroy(false);
}
 
+#if 0
crypto_ret_si = 
softint_establish(SOFTINT_NET|SOFTINT_MPSAFE|SOFTINT_RCPU,
_softint, NULL);
if (crypto_ret_si == NULL) {
printf("crypto_init: cannot establish ret queue handler\n");
return crypto_destroy(false);
}
+#endif
 
sysctl_opencrypto_setup(_opencrypto_clog);
 
return 0;
 }
 
+void crypto_init_later(void);
+void
+crypto_init_later(void)
+{
+
+   crypto_ret_si = 
softint_establish(SOFTINT_NET|SOFTINT_MPSAFE|SOFTINT_RCPU,
+   _softint, NULL);
+   KASSERT (crypto_ret_si != NULL);
+}
+
 int
 crypto_init(void)
 {


Yes, it is ugly... Please give me some more time to implement formal patch.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: make cryptoret() of opencrypto softint

2017-07-18 Thread Kengo NAKAHARA

Hi,

On 2017/07/06 18:37, Kengo NAKAHARA wrote:
> Currently, cryptoret() which dequeues from crp_ret_q and calls crp's
> callback function is done in kthread context.
> https://nxr.netbsd.org/xref/src/sys/opencrypto/crypto.c#416
> 
> In contrast, crypto_done() which enqueues to crp_ret_q is done in
> interrupt context or softint context when opencrypto(9) is used by IPsec.
> 
> This difference of context between crypto_done() and cryptoret() makes
> the length of crp_ret_q very long when IPsec puts high load. As a result,
> it causes very long latency. In worst case, the latency is a few seconds.
> In May 24, I committed crypto_ret_q limitation. This limitation can avoid
> such a long latency, however it drops many packets. Hmm, it is annoying
> trade-off...
> 
> I think the context mismatch between crypto_done()(enqueue side) and
> cryptoret()(dequeue side) causes this problem. So, I think cryptoret()
> should be done in softint context. As the callback functions both IPsec
> and /dev/crypto do not sleep, it seems there is no problem to make
> cryptoret() softint. Here is the patch
> http://www.netbsd.org/~knakahara/cryptoret-softint/cryptoret-softint.patch
> 
> Could you comment my concept or this patch?
> By the way, does anyone knows why cryptoret() must be done in kthread
> context?

As there is no objection for about two weeks, I commit above patch with
a little improvement. Although I tested with ATF net/ and crypto/, if
you encounter a problem, please apply below reverse patch.

http://www.netbsd.org/~knakahara/cryptoret-softint/cryptoret-softint-reverse.patch


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

RFC: make cryptoret() of opencrypto softint

2017-07-06 Thread Kengo NAKAHARA

Hi,

Currently, cryptoret() which dequeues from crp_ret_q and calls crp's
callback function is done in kthread context.
https://nxr.netbsd.org/xref/src/sys/opencrypto/crypto.c#416

In contrast, crypto_done() which enqueues to crp_ret_q is done in
interrupt context or softint context when opencrypto(9) is used by IPsec.

This difference of context between crypto_done() and cryptoret() makes
the length of crp_ret_q very long when IPsec puts high load. As a result,
it causes very long latency. In worst case, the latency is a few seconds.
In May 24, I committed crypto_ret_q limitation. This limitation can avoid
such a long latency, however it drops many packets. Hmm, it is annoying
trade-off...

I think the context mismatch between crypto_done()(enqueue side) and
cryptoret()(dequeue side) causes this problem. So, I think cryptoret()
should be done in softint context. As the callback functions both IPsec
and /dev/crypto do not sleep, it seems there is no problem to make
cryptoret() softint. Here is the patch
http://www.netbsd.org/~knakahara/cryptoret-softint/cryptoret-softint.patch

Could you comment my concept or this patch?
By the way, does anyone knows why cryptoret() must be done in kthread
context?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: CVS commit: src/sys/dev/scsipi

2017-06-20 Thread Kengo NAKAHARA

Hi,

On 2017/06/19 23:16, Michael van Elst wrote:
> k-nakah...@iij.ad.jp (Kengo NAKAHARA) writes:
>> Today, I found my environment panic while rebooting. As bisecting, it
>> seems the cause is below commit.
> 
> Does the following patch help ?
> 
> Index: scsipi_base.c
> ===
> RCS file: /cvsroot/src/sys/dev/scsipi/scsipi_base.c,v
> retrieving revision 1.176
> diff -p -u -r1.176 scsipi_base.c
> --- scsipi_base.c 17 Jun 2017 22:35:50 -  1.176
> +++ scsipi_base.c 19 Jun 2017 14:14:58 -
> @@ -2403,7 +2403,7 @@ scsipi_target_detach(struct scsipi_chann
>   device_t tdev;
>   int ctarget, mintarget, maxtarget;
>   int clun, minlun, maxlun;
> - int error;
> + int error = 0;
>  
>   if (target == -1) {
>   mintarget = 0;
> @@ -2453,7 +2453,7 @@ scsipi_target_detach(struct scsipi_chann
>  out:
>   KERNEL_UNLOCK_ONE(curlwp);
>  
> - return 0;
> + return error;
>  }
>  
>  /*

This patch fix my problem. Thank you!


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: CVS commit: src/sys/dev/scsipi

2017-06-18 Thread Kengo NAKAHARA

Hi,

Today, I found my environment panic while rebooting. As bisecting, it
seems the cause is below commit.

On 2017/06/18 7:35, Michael van Elst wrote:
> Module Name:  src
> Committed By: mlelstv
> Date: Sat Jun 17 22:35:50 UTC 2017
> 
> Modified Files:
>   src/sys/dev/scsipi: atapiconf.c cd.c scsi_base.c scsiconf.c
>   scsipi_base.c sd.c ss.c st.c
> 
> Log Message:
> The atapibus detach path did hold the channel mutex while calling into 
> autoconf,
> which would trigger a panic when unplugging a USB ATAPI CDROM.
> 
> Align detach code for scsibus and atapibus to fix this.
> 
> Also avoid races when detaching devices by replacing callout_stop with
> callout_halt.
> 
> 
> To generate a diff of this commit:
> cvs rdiff -u -r1.90 -r1.91 src/sys/dev/scsipi/atapiconf.c
> cvs rdiff -u -r1.340 -r1.341 src/sys/dev/scsipi/cd.c
> cvs rdiff -u -r1.91 -r1.92 src/sys/dev/scsipi/scsi_base.c
> cvs rdiff -u -r1.279 -r1.280 src/sys/dev/scsipi/scsiconf.c
> cvs rdiff -u -r1.175 -r1.176 src/sys/dev/scsipi/scsipi_base.c
> cvs rdiff -u -r1.324 -r1.325 src/sys/dev/scsipi/sd.c
> cvs rdiff -u -r1.88 -r1.89 src/sys/dev/scsipi/ss.c
> cvs rdiff -u -r1.230 -r1.231 src/sys/dev/scsipi/st.c
> 
> Please note that diffs are not public domain; they are subject to the
> copyright notices on the relevant files.

The panic message is here.

// snip
Jun 19 22:48:28 syncing disks... done
config_detach: detached device scsibus0 has children sd0
Skipping crash dump on recursive panic
panic: config_detach
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0x802249f5 cs 0x8 rflags 0x246 cr2 
0x70c8884aa38f ilevel 0 rsp 0xe4011099cc80
curlwp 0xe4027de12140 pid 632.1 lowest kstack 0xe401109992c0
Stopped in pid 632.1 (reboot) atnetbsd:breakpoint+0x5:  leave
db{2}> bt
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
config_detach() at netbsd:config_detach+0x218
config_detach_all() at netbsd:config_detach_all+0x97
cpu_reboot() at netbsd:cpu_reboot+0x173
sys_reboot() at netbsd:sys_reboot+0x75
syscall() at netbsd:syscall+0x1d8
--- syscall (number 208) ---
70c88843e5ba:

The addr2line of config_detach()+0x218 is here.
https://nxr.netbsd.org/xref/src/sys/kern/subr_autoconf.c#1728

My environment is amd64 which use sd0(USB Flash) as root filesystem.

Could someone have any solution?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: localcount_hadref() or localcount_trydarin()

2017-06-12 Thread Kengo NAKAHARA

Hi,

On 2017/06/12 21:51, Taylor R Campbell wrote:
>> Date: Mon, 12 Jun 2017 10:53:52 +0900
>> From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
>>
>> I want to avoid detaching the encryption device while it is used by IPsec.
>> That is, once someone creates Security Assocatation(SA) to call
>> crypto_newsession(), the encryption device related the SA must not be
>> detached until the SA is flushed(done crypto_freesession()) and the SA
>> is not used(done crypto_dispatch() and cryptointr()).
> 
> Why don't you just use a global reference count first?  Is the latency
> and scalability of crypto_newsession and crypto_freesession critical?
> 
> I am not familiar with opencrypto -- maybe the answer is yes.  But in
> crypto_newsession, crypto_freesession, and cryptointr, you're still
> acquiring a global lock.  That defeats the purpose of using
> localcount, which is to make the latency and scalability of
> acquire/release no more than a CPU-local integer increment.

Exactly. At first, I tried to remove global lock in a single step using
localcount(9) and other tools, however I could not completely fix bugs.
So, I changed my policy. I restructure locks and define lock-orders at
first, and then I will remove global locks.
# This lock restructure also has the purpose to fix packet reordering.
# Yes, there are many tasks about opencrypto...

That is also the reason why I suspend implementation the patches shown
in my previous mail. Sorry to lack of talk again.

I will read your next mail, but I think it take time for me to read it
and consider the implementation. I reply this answer as soon as possible.

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: localcount_hadref() or localcount_trydarin()

2017-06-11 Thread Kengo NAKAHARA

Hi,

On 2017/06/10 1:20, Taylor R Campbell wrote:
>> Date: Fri, 9 Jun 2017 16:50:18 +0900
>> From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
>>
>> I tried using localcount(9) for opencyrpto(9) to avoid detach encryption
>> drivers. While I implement that, I want to something to use KASSERT which
>> ensure someone have reference or not.
>> In addition, localcount_drain() waits forever if someone has had reference.
>> So, I want to function which returns in not so long time regardless of
>> someone has reference or not. In other words, I want some function like
>> mutex_tryenter() for mutex_enter().
> 
> This sounds questionable to me.  If there is a reference remaining
> that does not drain in a bounded time, that sounds like a bug that
> should be addressed some other way.  Can you explain how you want to
> use this?  Can you share preliminary diffs to opencrypto code with
> localcount that motivate this?

At first, I mention conclusion shortly,
- I want to avoid detaching encryption device when it is used by IPsec
- I implement naive design of that with localcount(9)
  - The patches show in the last
- After that, the detaching process(drvctl(8)) waits infinity

The details are below.

I want to avoid detaching the encryption device while it is used by IPsec.
That is, once someone creates Security Assocatation(SA) to call
crypto_newsession(), the encryption device related the SA must not be
detached until the SA is flushed(done crypto_freesession()) and the SA
is not used(done crypto_dispatch() and cryptointr()).

Thus, I can eliminate "migration job" in opencrypto. The migration job
runs when encryption device is detached while encryption job is
processing, that is, below two codes.
   [1] https://nxr.netbsd.org/xref/src/sys/opencrypto/crypto.c#1620
   [2] https://nxr.netbsd.org/xref/src/sys/opencrypto/crypto.c#1300
The [2] codes(crypto_invoke()) is fast path, so I want to avoid calling
crypto_newsession() which may sleep. Furthermore, the migration job can
disturb crp_q order.

I implement prototype which design is below.
- acquire localcount(9) in crypto_newsession()
- release localcount(9) in crypto_freesession()
- acquire and release localcount(9) in crypto_dispatch() and cryptointr()
- drain localcount(9) in crypto_unregister()
Hmm, that force drvctl(8) which detach the encryption device wait infinity
until the SA is flushed. And the drvctl(8) cannot be killed. It is not good.
I think localcount_trydrain() is required to avoid the drvctl(8) infinity
waiting simply.

Here is the working patch series. I'm sorry to say the patches is old,
so they are applied to crypto.c:r1.78 and cryptodev.h:r1.34. Furthermore,
there is still some bugs
- 
http://www.netbsd.org/~knakahara/opencrypto-localcount/0001-introduce-localcount.patch
- 
http://www.netbsd.org/~knakahara/opencrypto-localcount/0002-remove-migrate-process.patch
- 
http://www.netbsd.org/~knakahara/opencrypto-localcount/0003-avoid-localcount-acquire-after-drain.patch

Could you comment this design or patches? 
Hmm, I begin to feel this usage is not fit to localcount(9)


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: localcount_hadref() or localcount_trydarin()

2017-06-09 Thread Kengo NAKAHARA

Hi,

On 2017/06/09 17:18, Paul Goyette wrote:
> On Fri, 9 Jun 2017, Kengo NAKAHARA wrote:
>> I tried using localcount(9) for opencyrpto(9) to avoid detach encryption
>> drivers. While I implement that, I want to something to use KASSERT which
>> ensure someone have reference or not.
>> In addition, localcount_drain() waits forever if someone has had reference.
>> So, I want to function which returns in not so long time regardless of
>> someone has reference or not. In other words, I want some function like
>> mutex_tryenter() for mutex_enter().
// snip
> I think you are trying to abuse the design of localcount(9)   :)
> 
> The problem here is, whether or not you return immediately, the
> localcount has been marked for draining, and nothing else will ever
> be able to acquire it.  You've already set the totalp pointer to a
> non-NULL value, so any further attempts to call localcount_drain()
> or localcount_hadref() will fail in the KASSERT.
> 
> So, let's assume that you call localcount_hadref(); if it returns
> true you immediately call localcount_fini() (since it has already
> been drained).  But what happens if it returns false?  You can
> never go back and check the "total" again, so you will never know
> when it is safe to call localcount_fini().
> 
> 
>> In the other design, rename the function name to "localcount_trydrain()"
>> and invert return value true/false.
> 
> This has the same problem.  You WILL always drain the localcount, 
> whether or not you return quickly.  But if the localcount is not
> already drained, you will never be able to check again.
> 
> Furthermore, the totalp pointer points to a value _on_the_stack_
> and the pointer is stored in the localcount itself.  So, if you
> allow the routine to return "early", there might still be other
> places that hold references to this localcount, and they will
> eventually call localcount_release().  When that happens, they
> will try to decrement the total value, and this may corrupt the
> stack of some other routine!
> 
> 
> As I wrote in the man-page, draining the localcount is a one-shot
> operation - once you start the process of draining (by setting the
> totalp and issuing the xcall) you cannot stop the process, nor can
> you restart the process (to get an updated count).  Additionally,
> once you start the process of draining, you _must_ wait for it to
> complete;

Thank you for your detailed comments!

I read man again carefully..., yes, to the last. Hmm... I see.

I rethink opencrypto(9) design to use not only localcount(9)
but also any other tools to avoid waiting.
Anyway, thank you for your expositions.

> you cannot escape.
> 
> :)

Oh, my resistance is futile


Thanks,

-- 
//////
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

RFC: localcount_hadref() or localcount_trydarin()

2017-06-09 Thread Kengo NAKAHARA

Hi,

I tried using localcount(9) for opencyrpto(9) to avoid detach encryption
drivers. While I implement that, I want to something to use KASSERT which
ensure someone have reference or not.
In addition, localcount_drain() waits forever if someone has had reference.
So, I want to function which returns in not so long time regardless of
someone has reference or not. In other words, I want some function like
mutex_tryenter() for mutex_enter().

My naive implementation of that is here.

bool
localcount_hadref(struct localcount *lc, kmutex_t *interlock)
{
int64_t total = 0;

KASSERT(mutex_owned(interlock));
KASSERT(lc->lc_totalp == NULL);

 /* Mark it draining.  */
lc->lc_totalp = 

mutex_exit(interlock);
xc_wait(xc_broadcast(0, _xc, lc, interlock));
mutex_enter(interlock);

/* Paranoia: Cause any further use of lc->lc_totalp to crash.  */
lc->lc_totalp = (void *)(uintptr_t)1;

KASSERT(total >= NULL);
if (total == 0)
return false;
else
return true;
}


In the other design, rename the function name to "localcount_trydrain()"
and invert return value true/false.

Could you comment this function? Any comments are welcome.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: vlan(4) mpsafe

2017-06-06 Thread Kengo NAKAHARA

Hi,

On 2017/06/05 17:43, Kengo NAKAHARA wrote:
> On 2017/05/31 19:19, Shoichi Yamaguchi wrote:
>> Thank you for comments.
>>
>> I modify and rebase the patch.
>> Here is the updated patch:
>>  
>> https://gist.githubusercontent.com/s-ymgch228/02bb360cfd4e95eb1f64a66c28b30ab2/raw/1b9c8408da88a205ee1b09482125dcf49d5a762e/vlan_mpsafe.patch
> // snip
> 
> LGTM.
> If there is no objection, I want to merge this patch within a few days.
> 
> Could you comment this patch?

I commit it as if_vlan.c:r1.98 and if_vlanvar.h:r1.10. I will send
pull up request to netbsd-8 branch a few days for a week later.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: swcrypto is initialized twice

2017-06-04 Thread Kengo NAKAHARA

Hi,

On 2017/06/03 12:17, Michael van Elst wrote:
> k-nakah...@iij.ad.jp (Kengo NAKAHARA) writes:
> 
>> Currently(after cryptosoft.c:r1.44), software encryption driver
>> (swcrypto0) is initialized twice, that is, swcr_init() is called
>> below two call paths.
>>(1) swcrypto_attach()
>><= called from module initialization
>>(2) swcryptoattach()
>><= called from autoconf(9) initialization
> 
>> Hmm, compare with pseudo interfaces like gif(4), It seems the
>> swcryptoattach() should do nothing.
> 
> 
> In this case, yes.
> 
> Modules have their own module info data structures compiled by the
> linker, for a builtin module their modcmd function is called in
> module_init_class().
> 
> pseudo devices have no bus attachment, their attach routines are
> compiled by config(1) and called in config_finalize() after all
> builtin modules have been initialized.
> 
> A run-time loaded module gets its modcmd function called much
> later in module_load().
// snip

Thank you for detailed explanation!
I will be careful about that, when I modify a similar code.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: swcrypto is initialized twice

2017-06-01 Thread Kengo NAKAHARA

Hi,

On 2017/06/01 17:37, Paul Goyette wrote:
> On Thu, 1 Jun 2017, Kengo NAKAHARA wrote:
>> Currently(after cryptosoft.c:r1.44), software encryption driver
>> (swcrypto0) is initialized twice, that is, swcr_init() is called
>> below two call paths.
>>(1) swcrypto_attach()
>><= called from module initialization
>>(2) swcryptoattach()
>><= called from autoconf(9) initialization
>>
>> This is wrong as it leaks struct cryptocap in crypto_drivers[].
>>
>> Hmm, compare with pseudo interfaces like gif(4), It seems the
>> swcryptoattach() should do nothing. So, I think bellow patch
>> is required.
>> 
>> --- a/sys/opencrypto/cryptosoft.c
>> +++ b/sys/opencrypto/cryptosoft.c
>> @@ -1325,8 +1325,10 @@ swcr_init(void)
>> void
>> swcryptoattach(int num)
>> {
>> -
>> -   swcr_init();
>> +   /*
>> +* Nothing to do here, initialization is handled by the
>> +* module initialization code in swcrypto_attach() below).
>> +*/
>> }
>>
>> void   swcrypto_attach(device_t, device_t, void *);
>> 
>>
>> Could you comment this patch?
>> If there is no objection, I will commit this patch within a few days.
> 
> Yes, I believe your analysis and patch are both correct.

Thank you for your reviewing! I commit the patch as cryptosoft.c:r1.51.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

swcrypto is initialized twice

2017-06-01 Thread Kengo NAKAHARA

Hi,

Currently(after cryptosoft.c:r1.44), software encryption driver
(swcrypto0) is initialized twice, that is, swcr_init() is called
below two call paths.
(1) swcrypto_attach()
<= called from module initialization
(2) swcryptoattach()
<= called from autoconf(9) initialization

This is wrong as it leaks struct cryptocap in crypto_drivers[].

Hmm, compare with pseudo interfaces like gif(4), It seems the
swcryptoattach() should do nothing. So, I think bellow patch
is required.

--- a/sys/opencrypto/cryptosoft.c
+++ b/sys/opencrypto/cryptosoft.c
@@ -1325,8 +1325,10 @@ swcr_init(void)
 void
 swcryptoattach(int num)
 {
-
-   swcr_init();
+   /*
+* Nothing to do here, initialization is handled by the
+* module initialization code in swcrypto_attach() below).
+*/
 }
 
 void   swcrypto_attach(device_t, device_t, void *);


Could you comment this patch?
If there is no objection, I will commit this patch within a few days.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Introducing localcount(9)

2017-05-12 Thread Kengo NAKAHARA

Hi,

On 2017/05/12 7:04, Paul Goyette wrote:
> On Thu, 11 May 2017, Taylor R Campbell wrote:
> 
>>>>(1) Why splsoftserial() is required instead of kpreempt_disable()?
>>>>localcount_drain() uses xc_broadcast(0, ...), that is, it uses
>>>>low priority xcall. Low priority xcall would be done by kthread
>>>>context, so I think kpreempt_disable() would be sufficient to
>>>>prevent localcount_drain() xcall running.
>>>
>>> I think you are correct.  Taylor, do you agree?
>>
>> Yes, I think this is fine.  I probably chose splsoftserial because I
>> was thinking of pserialize(9).
> 
> Committed.

Thank you for your commit. I believe this modification make it
more efficient. :)


>>>>(2) Why both "localcount_adjust(lc, -1)" and "*lc->lc_totalp -= 1" are
>>>>necessary in "lc->lc_totalp != NULL" case?
>>>>I am afraid of below situation
>>>>[1] CPU#A: mutex_enter(interlock)
>>>>[2] CPU#A: call localcount_drain()
>>>>[3] CPU#A: mutex_exit(interlock) before xc_broadcast
>>>>[4] CPU#B: call localcount_release()
>>>>[5] CPU#B: mutex_enter(interlock)
>>>>[6] CPU#B: localcount_adjust(lc, -1)
>>>>[7] CPU#B: *lc->lc_totalp -= 1
>>>>[8] CPU#B: done localcount_release()
>>>>[9] CPU#A: call xc_broadcast()
>>>>[10]CPU#B: begin localcount_xc() and done
>>>>[11]CPU#A: resume localcount_drain()
>>>>[12]CPU#A: mutex_enter(interlock)
>>>>Can "lc->lc_totalp" be -1 at this point?
>>>
>>> localcount_adjust() updates the local CPU's reference counter, while
>>> *lc->lc_totalp is the global sum that we are accumulating.  We want to
>>> keep them synchronized.
>>
>> What knakahara is worried about -- rightly, I think -- is the
>> difference between the following two scenarios:
>>
>> 1. CPU A: localcount_drain  (total=0, a=0, b=1)
>>   CPU B: localcount_xc (total=1, a=0, b=1)
>>   CPU B: localcount_release(total=0, a=0, b=0)
>>
>> 2. CPU A: localcount_drain  (total=0, a=0, b=1)
>>   CPU B: localcount_release(total=-1, a=0, b=0)
>>   CPU B: localcount_xc (total=-1, a=0, b=0)
>>
>> If localcount_xc has already run on the local CPU, then we do want to
>> keep them in sync in localcount_release, which is why I wrote the
>> logic I did.  But if it has not yet run, we don't want to double-count
>> the release in the global and the local counts, which is what
>> knakahara observed.
> 
> Ah, yes, got it!

I am sorry for lack of words...
Thank you for more clearly explanation, riastradh@n.o!


>> The problem is that in localcount_release, we don't know whether the
>> total includes the local count yet, and thus whether when subtracting
>> one from the local count we also need to subtract one from the total
>> count.
>>
>> One possible way to remedy this -- haven't thought this through yet,
>> obviously needs more attention -- is to change localcount_xc to
>> *transfer* rather than add the local references,
>>
>>mutex_enter(interlock);
>>localp = percpu_getref(lc->lc_percpu);
>>*lc->lc_totalp += *localp;
>>*localp -= *localp;   /* i.e., *localp = 0 */
>>percpu_putref(lc->lc_percpu);
>>mutex_exit(interlock);
>>
>> and to change localcount_release to decrement only the total count
>> when it is available:
>>
>>mutex_enter(interlock);
>>if (--*lc->lc_totalp == 0)
>>cv_broadcast(cv);
>>mutex_exit(interlock);
>>
>> Then the scenarios become:
>>
>> 1'. CPU A: localcount_drain (total=0, a=0, b=1)
>>CPU B: localcount_xc(total=1, a=0, b=0)
>>CPU B: localcount_release   (total=0, a=0, b=0)
>>
>> 2'. CPU A: localcount_drain (total=0, a=0, b=1)
>>CPU B: localcount_release   (total=-1, a=0, b=1)
>>CPU B: localcount_xc(total=0, a=0, b=0)
> 
> Makes sense.  I've made appropriate changes, and will rebuild and 
> re-test.

LGTM, too. Thank you for your fixing.


Thanks,
 
-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Introducing localcount(9)

2017-05-11 Thread Kengo NAKAHARA

Hi,

On 2017/05/06 14:12, Paul Goyette wrote:
> If you want to look at the implementation of localcount(9), you can see
> the code in the CVS repository [5].  Files sys/kern/subr_localcount.c
> and sys/sys/localcount.h contain the localcount functionality;  files
> sys/kern/subr_autoconf.c and sys/kern/subr_devsw.c have been updated to
> use the localcount functionality;  and many device drivers throughout
> the system have been updated to provide improved reference counting.
> (Don't forget to look for any newly created files in the Attic!)
> 
> Unless there are substantial objections, I'd like to commit localcount
> in a couple of weeks.  Constructive comments and code review welcomed!

Excellent! I expect the code is committed.
I think opencrypto(9) would use localcount(9) to prevent detaching
encrypt drivers while they are still used.

I have two questions about localcount_release() in localcount.c.

(1) Why splsoftserial() is required instead of kpreempt_disable()?
localcount_drain() uses xc_broadcast(0, ...), that is, it uses
low priority xcall. Low priority xcall would be done by kthread
context, so I think kpreempt_disable() would be sufficient to
prevent localcount_drain() xcall running.

(2) Why both "localcount_adjust(lc, -1)" and "*lc->lc_totalp -= 1" are
necessary in "lc->lc_totalp != NULL" case?
I am afraid of below situation
[1] CPU#A: mutex_enter(interlock)
[2] CPU#A: call localcount_drain()
[3] CPU#A: mutex_exit(interlock) before xc_broadcast
[4] CPU#B: call localcount_release()
[5] CPU#B: mutex_enter(interlock)
[6] CPU#B: localcount_adjust(lc, -1)
[7] CPU#B: *lc->lc_totalp -= 1
[8] CPU#B: done localcount_release()
[9] CPU#A: call xc_broadcast()
[10]CPU#B: begin localcount_xc() and done
[11]CPU#A: resume localcount_drain()
[12]CPU#A: mutex_enter(interlock)
Can "lc->lc_totalp" be -1 at this point?

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: if_wm panics on boot

2017-03-03 Thread Kengo NAKAHARA

Hi,

On 2017/03/04 4:31, Tomohiro Kusumi wrote:
> Hi
> 
> Now I'm at
> $NetBSD: if_wm.c,v 1.495 2017/03/03 07:57:49
> but still hits the same assertion located at L6622.

Ahh, sorry, the newest revision is not r1.495 but r1.496...
r1.495 is not fixed yet.

The ident of r1.496 is
===
+/* $NetBSD: if_wm.c,v 1.496 2017/03/03 16:48:55 knakahara Exp $*/
===
and commit log is 
===
fix r1.492 bug, sorry
===


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: if_wm panics on boot

2017-03-03 Thread Kengo NAKAHARA

Hi,

On 2017/03/04 0:29, Tomohiro Kusumi wrote:
> Yes, the source says 1.492 which is very recent.
> I guess I'll just revert to the one that was (safely)working for now.
> 
> 1.492 2017/03/03 03:33:44 knakahara

Sorry, it is my mistake as you point out. I fix it r1.495 just now.

Could you try the newest wm?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: L2TPv3 interface

2017-02-16 Thread Kengo NAKAHARA

Hi,

On 2017/02/07 14:01, Kengo NAKAHARA wrote:
> On 2017/01/20 21:26, Kengo NAKAHARA wrote:
>> At first, here is updated patches.
>>
>> http://netbsd.org/~knakahara/if-l2tp-2/01-accept-ifname-include-digit.patch
>>http://netbsd.org/~knakahara/if-l2tp-2/02-if-l2tp.patch
> 
> I rebase and add some fixes. Here is updated patch set and unified patch.
> - patch set
>   http://netbsd.org/~knakahara/if-l2tp-3/if-l2tp-3.tgz
> - unified patch
>   http://netbsd.org/~knakahara/if-l2tp-3/if-l2tp-3.patch
> 
> Could you comment this patch set? If there is no objection, I will commit
> this patch set after a few days or weeks.

I committed it.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: L2TPv3 interface

2017-02-06 Thread Kengo NAKAHARA

Hi,

On 2017/01/20 21:26, Kengo NAKAHARA wrote:
> At first, here is updated patches.
>http://netbsd.org/~knakahara/if-l2tp-2/01-accept-ifname-include-digit.patch
>http://netbsd.org/~knakahara/if-l2tp-2/02-if-l2tp.patch

I rebase and add some fixes. Here is updated patch set and unified patch.
- patch set
  http://netbsd.org/~knakahara/if-l2tp-3/if-l2tp-3.tgz
- unified patch
  http://netbsd.org/~knakahara/if-l2tp-3/if-l2tp-3.patch

Could you comment this patch set? If there is no objection, I will commit
this patch set after a few days or weeks.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: L2TPv3 interface

2017-01-22 Thread Kengo NAKAHARA

Hi,


On 2017/01/20 21:26, Kengo NAKAHARA wrote:
> On 2017/01/20 0:38, Taylor R Campbell wrote:
>>Date: Thu, 19 Jan 2017 17:58:17 +0900
>>    From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
>>+/*
>>+ * l2tp_variant update API.
>>+ *
>>+ * Assumption:
>>+ * reader side dereferences sc->l2tp_var in reader critical section only,
>>+ * that is, all of reader sides do not reader the sc->l2tp_var after
>>+ * pserialize_perform().
>>+ */
>>+static void
>>+l2tp_variant_update(struct l2tp_softc *sc, struct l2tp_variant *nvar)
>>+{
>>+   struct ifnet *ifp = >l2tp_ec.ec_if;
>>+   struct l2tp_variant *ovar = sc->l2tp_var;
>>+
>>+   KASSERT(mutex_owned(>l2tp_lock));
>>+
>>+   membar_producer();
>>+   atomic_swap_ptr(>l2tp_var, nvar);
>>+   pserialize_perform(l2tp_psz);
>>+   psref_target_destroy(>lv_psref, lv_psref_class);
>>
>> No need for atomic_swap_ptr.  Just
>>
>> sc->l2tp_var = nvar;
>>
>> is enough.  Nobody else can write to it because we hold the lock.
> 
> Between writer and writer, it is correct. However, between writer and
> reader, I think atomic_swap_ptr is required to prevent reader's load
> before writer's store done. Is this correct?

Sorry, I was wrong. Reader has nothing to do with atomic_ops. So,
As you said, atomic_swap_ptr is not required here. I will fix it.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: Are modunload and ifconfig create racy?

2017-01-22 Thread Kengo NAKAHARA

Hi,

On 2017/01/21 6:47, Paul Goyette wrote:
> (Replying _without_ looking at the code!  I can look more deeply if you 
> could re-send a pointer to the code...)

In the example of gre(4) interface, if CPU#0 do "modunload if_gre" and
CPU#1 do "ifconfig gre0 create", they seems to have below race as far as
if_gre.c.
+ CPU#1
  - call gre_clone_create()
- before done gre_count increment, that is before line 369
  https://nxr.netbsd.org/xref/src/sys/net/if_gre.c#327
+ CPU#0
  - call gredetach()
- done gre_count check, that is, reached line 1464
  https://nxr.netbsd.org/xref/src/sys/net/if_gre.c#1464
+ CPU#1
  - continue gre_clone_create()
- there is no checking code if gredetach() is being called or not
  in gre_clone_create(), so continue
At this point, CPU#0 continue to modunload processing while CPU#1 continue
to ifconfig create. What will happen? I think fault would arise and panic
in CPU#1 at some point.


> Generally, the module's unload code should detach the devsw entry for 
> the driver first, to prevent any new accesses.
> 
> Then, the config_fini_component() call will fail if existing devices 
> cannot be detached.  If this does fail, the module unload code should 
> re-attach the devsw entry and return failure.

The pseudo interface modules' load/unload code is generated by IF_MODULE
macro
https://nxr.netbsd.org/xref/src/sys/net/if_module.h#54

The unload code seems to call each inteface's detach code(such as gredetach()),
and then, the code calls config_cfdriver_detach()(not config_fini_component()).
Hmm, isn't this general way?


> The only chance I see for any race condition is if the open/create logic 
> can clone a new entry which is not seen by config_cfdata_detach() when 
> it scans for device instances which need to be detached.
> 
> If there is a race condition, it should be solved in each interface's 
> driver module, or within the autoconfig framework.  Each driver module 
> is responsible for managing its own integration with autoconfig (ie, 
> calling config_{init,fini}_component and devsw_{att},det}ach routines).

I think the race condition should solved in IF_MODULE macro. However,
I don't have an appropriate solution. It may be required to modify
if_clone_create()...


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: L2TPv3 interface

2017-01-20 Thread Kengo NAKAHARA

Hi riastradh@n.o,

Thank you for your detailed review!


At first, here is updated patches.
   http://netbsd.org/~knakahara/if-l2tp-2/01-accept-ifname-include-digit.patch
   http://netbsd.org/~knakahara/if-l2tp-2/02-if-l2tp.patch

And then, each response is below.

On 2017/01/20 0:38, Taylor R Campbell wrote:
>Date: Thu, 19 Jan 2017 17:58:17 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> A few little comments:
> 
>diff --git a/sys/net/if.c b/sys/net/if.c
>index 2386af3..ba63266 100644
>--- a/sys/net/if.c
>+++ b/sys/net/if.c
>@@ -1599,7 +1613,7 @@ if_clone_lookup(const char *name, int *unitp)
>strcpy(ifname, "if_");
>/* separate interface name from unit */
>for (dp = ifname + 3, cp = name; cp - name < IFNAMSIZ &&
>-   *cp && (*cp < '0' || *cp > '9');)
>+   *cp && !if_is_unit(cp);)
>*dp++ = *cp++;
> 
> This changes the generic syntax interface names, perhaps to allow the
> `2' in `l2tp', although since this loop skips over the first three
> octets that doesn't seem to be necessary.  Either way, I don't have a
> problem with this, but it should be done in a separate change.

I see. But sorry, I want to postpone the fix to reduce unnecessary
skip... As a first step, I separate this changes and will commit first.


>diff --git a/sys/net/if_l2tp.c b/sys/net/if_l2tp.c
>new file mode 100644
>index 000..dda8bbd
>--- /dev/null
>+++ b/sys/net/if_l2tp.c
>@@ -0,0 +1,1541 @@
>[...]
>+/*
>+ * l2tp global variable definitions
>+ */
>+LIST_HEAD(l2tp_sclist, l2tp_softc);
>+static struct l2tp_sclist l2tp_softc_list;
>+kmutex_t l2tp_list_lock;
>+
>+#if !defined(L2TP_ID_HASH_SIZE)
>+#define L2TP_ID_HASH_SIZE 64
>+#endif
>+static u_long l2tp_id_hash_mask;
>+
>+kmutex_t l2tp_hash_lock;
>+static struct pslist_head *l2tp_hashed_list = NULL;
> 
> Consider putting related global state into cacheline-aligned structs?

Oh, I forgot it. I should put them into cacheline-aligned structs.
In addition, I remove l2tp_id_hash_mask variable and use L2TP_ID_HASH_SIZE
to avoid holding lock in fast-path (l2tp_lookup_session_ref() =>
id_hash_func()).


> static struct {
> kmutex_t  lock;
> struct l2tp_sclistlist;
> } l2tp_softc __cacheline_aligned;
> 
> static struct {
> kmutex_t  lock;
> struct pslist_head*list;
>   unsigned long   mask;
> } l2tp_hash __cacheline_aligned;
> 
>+pserialize_t l2tp_psz;
>+struct psref_class *lv_psref_class __read_mostly;
> 
> __read_mostly for l2tp_psz?

Yes, I add it.

>+static int
>+l2tpdetach(void)
>+{
>+   int error;
>+
>+   if (!LIST_EMPTY(_softc_list))
>+   return EBUSY;
> 
> Need lock here?  Need to first set flag preventing new creation?
> 
> mutex_enter(_softc.lock);
> KASSERT(!l2tp_softc.dying);
> l2tp_softc.detaching = true;
> if (!LIST_EMPTY(_softc.list)) {
>   l2tp_softc.detaching = false;
>   mutex_exit(_softc.lock);
>   return EBUSY;
> }
> mutex_exit(_softc.lock);
> 
> Anyone trying to add to l2tp_softc.list must also check
> l2tp_softc.detaching before proceeding.

You are right. Hmm..., it's seems there are same problems in
other module'd interfaces such as pppoe(4), gre(4), and so on.
I think module framework could fix this problem, so I will ask
pgoyette@n.o and christos@n.o if they have any idea later.


>+static int
>+l2tp_clone_destroy(struct ifnet *ifp)
>+{
>+   struct l2tp_softc *sc = (void *) ifp;
> 
> Use container_of here:
> 
>   struct l2tp_softc *sc = container_of(ifp, struct l2tp_softc,
>   l2tp_ec.ec_if);
> 
> No functional difference, but the compiler type-checks it.

I use container_of here and similar codes.

>+void
>+l2tp_input(struct mbuf *m, struct ifnet *ifp)
>+{
>+
>+   KASSERT(ifp != NULL);
>+
>+   if (0 == (mtod(m, u_long) & 0x03)) {
>+   /* copy and align head of payload */
>+   struct mbuf *m_head;
>+   int copy_length;
>+
>+#define L2TP_COPY_LENGTH   60
>+#define L2TP_LINK_HDR_ROOM (MHLEN - L2TP_COPY_LENGTH - 4/*round4(2)*/)
>+
>+   if (m->m_pkthdr.len < L2TP_COPY_LENGTH) {
>+   copy_length = m->m_pkthdr.len;
>+   } else {
>+   copy_length = L2TP_COPY_LENGTH;
>+   }
>+
>+

RFC: L2TPv3 interface

2017-01-19 Thread Kengo NAKAHARA

Hi,

My co-workers implemented L2TPv3(RFC3931) interface for old version NetBSD.
And then, I port the inteface to NetBSD-current and MP-ify. Here is the patch.
http://netbsd.org/~knakahara/if-l2tp/if-l2tp.patch

The L2TPv3 interface implementation is named l2tp(4), which can send/receive
L2TPv3 data packet and control packet. Above patch include ifconfig(8)
extension, so we can create L2TPv3 tunnel between hostA and hostB as below.
== host A ==
# ifconfig wm0 inet 192.168.0.1/24
# ifconfig l2tp0 create
# ifconfig l2tp0 tunnel 192.168.0.1 192.168.0.2
# ifconfig l2tp0 session 1234 4321
# ifconfig bridge0 create
# brconfig bridge0 add wm1
# brconfig bridge0 add l2tp0
# ifconfig l2tp0 up
# ifconfig wm1 up
# ifconfig bridge0 up
== host A ==

== host B ==
# ifconfig l2tp0 create
# ifconfig l2tp0 tunnel 192.168.0.2 192.168.0.1
# ifconfig l2tp0 session 4321 1234
# ifconfig bridge0 create
# brconfig bridge0 add wm1
# brconfig bridge0 add l2tp0
# ifconfig l2tp0 up
# ifconfig wm1 up
# ifconfig bridge0 up
== host B ==

As l2tp(4) interface can send/receive L2TPv3 control packet, we can implement
userland daemon which manage sessions, that is, authentication, keepalive, and
so on. That's future work.

Could you comment this patch?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: MP-safe pppoe(4)

2016-12-12 Thread Kengo NAKAHARA

Hi,

On 2016/12/01 18:40, Kengo NAKAHARA wrote:
> My co-worker Shoichi YAMAGUCHI and I implement pppoe(4) MP-safe.
> # Nearly all parts is written by him, thanks :)
> Here is the patch.
> http://www.netbsd.org/~knakahara/pppoe-mpify/pppoe-mpify.patch
> 
> At first, we try to use pserialize/psref, however pppoe would update
> softc members in softint context of control packet processing. As their
> update is higher priority than data packet, the control packet
> processing cannot be modified to deferred processing such as
> workqueue(9).
> Furthermore, some packet processing path, that is, sppp_input() and
> sppp_outout() write their softc members for keepalive.
> So, we give up to use pserialize/psref...
> 
> Therefore, we use rwlock and mutex, the lock design of above patch
> is below
> + for if_pppoe.c
>   - "pppoe_softc_list_lock" rwlock
> => global lock for pppoe softc list
>   - "softc->sc_session_lock" rwlock
> => per softc lock for some members which is frequently read by
>data packet processing path
>   - "softc->sc_lock" mutex
> => per softc lock for other members which is read/written by
>control packet processing and ioctl
> + for if_spppsubr.c
>   - "spppq_lock" mutex
> => global lock for sppp keep alive list
>   - "sppp->pp_lock" mutex
> => per struct sppp mutex for whole sppp include write-mostly
>members
> 
> 
> Could you comment this patch?

I rebase and commit it.


Thanks,

-- 
//////
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

MP-safe pppoe(4)

2016-12-01 Thread Kengo NAKAHARA

Hi,

My co-worker Shoichi YAMAGUCHI and I implement pppoe(4) MP-safe.
# Nearly all parts is written by him, thanks :)
Here is the patch.
http://www.netbsd.org/~knakahara/pppoe-mpify/pppoe-mpify.patch

At first, we try to use pserialize/psref, however pppoe would update
softc members in softint context of control packet processing. As their
update is higher priority than data packet, the control packet
processing cannot be modified to deferred processing such as
workqueue(9).
Furthermore, some packet processing path, that is, sppp_input() and
sppp_outout() write their softc members for keepalive.
So, we give up to use pserialize/psref...

Therefore, we use rwlock and mutex, the lock design of above patch
is below
+ for if_pppoe.c
  - "pppoe_softc_list_lock" rwlock
=> global lock for pppoe softc list
  - "softc->sc_session_lock" rwlock
=> per softc lock for some members which is frequently read by
   data packet processing path
  - "softc->sc_lock" mutex
=> per softc lock for other members which is read/written by
   control packet processing and ioctl
+ for if_spppsubr.c
  - "spppq_lock" mutex
=> global lock for sppp keep alive list
  - "sppp->pp_lock" mutex
=> per struct sppp mutex for whole sppp include write-mostly
   members


Could you comment this patch?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: gif(4) MP-ify by psref [was Re: passive references]

2016-06-28 Thread Kengo NAKAHARA

Hi,

On 2016/05/20 22:02, Taylor R Campbell wrote:
>Date: Fri, 20 May 2016 16:28:15 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>I back to gif(4) MP-ify, since I could eliminate bottleneck of
>MP-scalable bridge(4) to support wm(4) TX multiqueue.
> 
>I rebase my gif(4) MP-ify patches and reflect latest psref(9) and
>pslist(9) implementation. Here is the patch series.
>http://www.netbsd.org/~knakahara/gif-psref/gif-psref.tgz
>And here is the unified patch.
>http://www.netbsd.org/~knakahara/gif-psref/unified-gif-psref.patch

I rebase and improve the patches. Here is the updated patch series.
http://www.netbsd.org/~knakahara/gif-mpify/gif-mpify.tgz
And here is the unified patch.
http://www.netbsd.org/~knakahara/gif-mpify/unified-gif-mpify.patch

> Nice!  I don't have time to review it right now, but I'll try to find
> some time in the next few days.  Have you done any throughput tests
> yet, or tried observing the effect of `ifconfig gif1 create' on the
> flow through gif0?  I'm particularly curious to see how bad it is to
> just encap drop packets while `ifconfig gif1 create' is running.

I measure the throughput, however the effect is very little, that is,
nearly equal to accidental error. Additionally I create 1024 gif(4)s,
and then measure gif0 throughput while doing "ifconfig gif1025 create". 
The "ifconfig gif1025 create" effect is less than 1 percent(about 0.2%)
degradation.
In contrast, "ifconfig gif1 create" become slow while it flow nearly
limit performance packets on gif0. It takes about 2 or 3 seconds.
Furthermore, "ifconfig gif1025 create" become very slow under the same
condition. It takes over 10 seconds, or over 20 seconds in worst case.
# of course, if gif0 has no flow, ifconfigs finish immediately

If there is no objection, I will commit them after a few days or a few
weeks.

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: eliminate gif(4) Tx softint(9)

2016-06-23 Thread Kengo NAKAHARA

Hi,

On 2016/06/18 3:56, Taylor R Campbell wrote:
>Date: Fri, 17 Jun 2016 17:57:50 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>I have reconsidered and researched gif(4) implementation, as a result
>I think gif(4) Tx softint(9) should be eliminated. That is, the
>following patch is required.
>
> http://www.netbsd.org/~knakahara/gif-direct-output/gif-direct-output.patch
>I tested this patch by ATF and some gif(4) workload.
> 
> Sounds reasonable to me.  I like deleting needless code!

There is no objection for a week. I commit it with some modification.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: separate L3 output KERNEL_LOCK

2016-06-20 Thread Kengo NAKAHARA

Hi,

On 2016/06/16 16:01, Kengo NAKAHARA wrote:
> On 2016/06/16 15:49, Kengo NAKAHARA wrote:
>> Thank you for your comments.
>> From the above, I update the patch series,
>> 
>> http://www.netbsd.org/~knakahara/separate-l3-lock-2/separate-l3-lock-2.tgz
>> that is, add 0010 - 0013 patches.
> 
> Oops, I forget to fix ifnet comment. I add 0014 patch.
> 
>> and here is the updated unified patch.
>> 
>> http://www.netbsd.org/~knakahara/separate-l3-lock-2/unified-separate-l3-lock-2.patch
>>
>> Could you comment this patch?
>>
>> Does anyone else have any comments? Any comments are welcome.

I commit above patch.
# and some fix patches, sorry.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

RFC: eliminate gif(4) Tx softint(9)

2016-06-17 Thread Kengo NAKAHARA

Hi,

I have reconsidered and researched gif(4) implementation, as a result
I think gif(4) Tx softint(9) should be eliminated. That is, the
following patch is required.
http://www.netbsd.org/~knakahara/gif-direct-output/gif-direct-output.patch
I tested this patch by ATF and some gif(4) workload.

gif(4) Tx softint(9) is introduced by if_gif.c:r1.33 which is contributed
by KAME project. I ask my co-worker who was a member of the project, he
says "there must be no reason that gif(4) Tx must use softint. They
just wanted to use softint as it had been just implemented." Hmm, it
seems gif(4) Tx softint(9) can be eliminated.

Another point of view, each gif(4) establishes one Tx softint(9) handler,
that is, if 100 gif(4)s are created and set tunnel address, 100 softint(9)
handlers are established. That would cause harmful effect to softint(9)
processing. So, I think gif(4) Tx softint(9) should be eliminated.

Thoughts?
Or, does anyone know why gif(4) Tx processing uses softint(9)?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: separate L3 output KERNEL_LOCK

2016-06-16 Thread Kengo NAKAHARA

Hi,

Thank you for your comments.

On 2016/06/16 12:46, Taylor R Campbell wrote:
>Date: Thu, 16 Jun 2016 12:26:10 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>I rewrite my code. Here is patch series,
>
> http://www.netbsd.org/~knakahara/separate-l3-lock-2/separate-l3-lock-2.tgz
>and here is unified patch.
>
> http://www.netbsd.org/~knakahara/separate-l3-lock-2/unified-separate-l3-lock-2.patch
> 
>Could you comment this patch?
> 
> Looks pretty good!  Some comments from a quick skim:
> 
> First, did you catch all the direct uses of ifp->if_output?
> 
> I see a few by grepping that don't appear in your patch:
> 
> . dist/pf/net/pf.c
> . external/bsd/ipf/netinet/ip_fil_netbsd.c
> . net/if_ecosubr.c
> . netatalk/aarp.c
> . netatalk/ddp_output.c
> 
> Or is there a reason you don't need to convert those?

I need to convert them.
# Hmm, I overlooked them while I wrote down my memo.

> Same goes for ifp->if_start.  Here are all the uses I don't see
> converted -- but many of these are in drivers which are perhaps
> deliberately calling their own start routines and hence maybe don't
> need to go through if_start_lock:
> 
> . altq/altq_cbq.c
> . altq/altq_subr.c
> . arch/acorn26/ioc/if_eca.c
> . dev/ic/arn5008.c
> . dev/ic/arn9003.c
> . dev/ic/bwi.c
> . dev/pci/if_ipw.c
> . dev/pci/if_iwi.c
> . dev/pci/if_iwm.c
> . dev/pci/if_iwn.c
> . dev/pci/if_wm.c
> . dev/usb/if_athn_usb.c
> . dev/usb/uhso.c
> . kern/kern_pmf.c
> . net/if_ecosubr.c
> . net/if_spppsubr.c

Hmm, I think ifp->if_start in arch/* and dev/* are themselves' start
routine, that is, they are not "from L2 to device driver processing".
So, I think it causes confusion to convert such ifp->if_start.

In contrast, altq/* and net/* should be converted. Uh... I overlooked
them.

>--- a/sys/dev/pci/if_wm.c
>+++ b/sys/dev/pci/if_wm.c
>@@ -2407,6 +2407,7 @@ alloc_retry:
>strlcpy(ifp->if_xname, xname, IFNAMSIZ);
>ifp->if_softc = sc;
>ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
>+   ifp->if_flags = IFEF_START_MPSAFE;
> 
> 
> Should be ifp->if_extflags = IFEF_START_MPSAFE, no?

Yes. I update a little while ago.

> Also, if there's any chance that ether_ifattach or similar might be
> called first (which it isn't in this case but perhaps might be in
> other drivers in the future), then maybe this should be `|=' instead
> of `=', since ifp->if_extflags is shared between the two layers.
> 
>--- a/sys/net/if.h
>+++ b/sys/net/if.h
>@@ -242,7 +242,8 @@ typedef struct ifnet {
>u_short if_index;   /* numeric abbreviation for this 
> if */
>short   if_timer;   /* time 'til if_slowtimo called */
>short   if_flags;   /* up/down, broadcast, etc. */
>-   short   if__pad1;   /* be nice to m68k ports */
>+   short   if_extflags;/* device driver MP-safe, etc. */
>+   /* be nice to m68k ports */
> 
> You can remove the `be nice to m68k ports' comment, I think.  I
> presume it was a comment explaining why there is padding there.

I will remove it.

>--- a/sys/net/if_ethersubr.c
>+++ b/sys/net/if_ethersubr.c
>@@ -910,6 +919,7 @@ ether_ifattach(struct ifnet *ifp, const uint8_t *lla)
> {
>struct ethercom *ec = (struct ethercom *)ifp;
> 
>+   ifp->if_extflags = IFEF_OUTPUT_MPSAFE;
> 
> This should probably use `|=', not `=' -- otherwise I expect it to
> overwrite any IFEF_START_MPSAFE flag supplied by the driver!

Oops, I change to '|='.

> Just to make sure these assignments don't clobber each other, can you
> add KASSERT(ifp->if_extflags & IFEF_OUTPUT_MPSAFE) to ether_output and
> KASSERT(ifp->if_extflags & IFEF_START_MPSAFE) to wm_start?  (We
> probably can't kassert that the kernel lock is dropped, unfortunately,
> but this is a close enough proxy for now.)

Yes, I will add KASSERT appropriately.


>From the above, I update the patch series,
http://www.netbsd.org/~knakahara/separate-l3-lock-2/separate-l3-lock-2.tgz
that is, add 0010 - 0013 patches.
and here is the updated unified patch.

http://www.netbsd.org/~knakahara/separate-l3-lock-2/unified-separate-l3-lock-2.patch

Could you comment this patch?

Does anyone else have any comments? Any comments are welcome.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: separate L3 output KERNEL_LOCK

2016-06-15 Thread Kengo NAKAHARA

Hi,

On 2016/06/16 12:26, Kengo NAKAHARA wrote:
> On 2016/06/16 9:47, Kengo NAKAHARA wrote:
>> On 2016/06/14 23:14, Taylor R Campbell wrote:
>>>Date: Tue, 14 Jun 2016 13:33:08 +0900
>>>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> snip
>>> That said, why not not use two flags, say IFEF_OUTPUT_MPSAFE and
>>> IFEF_START_MPSAFE?  I never much liked the WRAP_FOO device -- is there
>>> a particular reason it's better for if_output?
>>>
>>> It seems to me that it is easier to audit changes for the flag than to
>>> audit changes for the wrapper: for the flag, it is only necessary to
>>> make sure all callers of ifp->if_start instead use if_start_lock; for
>>> WRAP_FOO, it's not as easy to make sure you adjusted everything.  But
>>> maybe I'm missing something about the motivation for WRAP_FOO here.
>>
>> Fair enough. I will rewrite my code to use that two flags.
> 
> I rewrite my code. Here is patch series,
> http://www.netbsd.org/~knakahara/separate-l3-lock-2/separate-l3-lock-2.tgz
> and here is unified patch.
> 
> http://www.netbsd.org/~knakahara/separate-l3-lock-2/unified-separate-l3-lock-2.patch

I update a little above patches (especially 0009-*.patch).
Thank you, Michael. :)
 

Thanks,

-- 
//////
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: separate L3 output KERNEL_LOCK

2016-06-15 Thread Kengo NAKAHARA

Hi,

On 2016/06/16 9:47, Kengo NAKAHARA wrote:
> On 2016/06/14 23:14, Taylor R Campbell wrote:
>>Date: Tue, 14 Jun 2016 13:33:08 +0900
>>    From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
snip
>> That said, why not not use two flags, say IFEF_OUTPUT_MPSAFE and
>> IFEF_START_MPSAFE?  I never much liked the WRAP_FOO device -- is there
>> a particular reason it's better for if_output?
>>
>> It seems to me that it is easier to audit changes for the flag than to
>> audit changes for the wrapper: for the flag, it is only necessary to
>> make sure all callers of ifp->if_start instead use if_start_lock; for
>> WRAP_FOO, it's not as easy to make sure you adjusted everything.  But
>> maybe I'm missing something about the motivation for WRAP_FOO here.
> 
> Fair enough. I will rewrite my code to use that two flags.

I rewrite my code. Here is patch series,
http://www.netbsd.org/~knakahara/separate-l3-lock-2/separate-l3-lock-2.tgz
and here is unified patch.

http://www.netbsd.org/~knakahara/separate-l3-lock-2/unified-separate-l3-lock-2.patch

Could you comment this patch?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: separate L3 output KERNEL_LOCK

2016-06-15 Thread Kengo NAKAHARA

Hi,

On 2016/06/14 23:14, Taylor R Campbell wrote:
>Date: Tue, 14 Jun 2016 13:33:08 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>The design to implement this concept is consisted of below two parts.
>1. at L3 output processing call L2 output routine
>   - remove KERNEL_LOCK from caller(L3)
>   - wrap callee(L2) ifp->if_output with KERNEL_LOCK if necessary
>2. at L2 output processing call device driver output routine
>   - check device driver MP-safe flags (*)
>   - if device driver has MP-safe flag, call without KERNEL_LOCK and
> vice versa
> 
>  (*) Unfortunately, struct ifnet->if_flags is already used all bit.
>  So, I change if__pad1 to if_extflags and use it as flags.
> 
> I was going to ask `Why two different mechanisms for if_output and
> if_start -- why not use the MPSAFE flag for both?', until the answer
> occurred to me:  An ifnet has a split identity between the
> device-specific parts, whose code lives in sys/dev, and the
> protocol-specific parts, whose code lives in sys/net; and the two
> identities may be independently made MP-safe.

Oh, I forget to explain the reasons. The reason is the flag checking
codes seems racy. Therefore, I want to use KERNEL_LOCK wrapper not only
if_output but also if_start. However, there are many device drivers
much more than L2 component, so I give up to apply KERNEL_LOCK wrapper
to if_start.
# When I was implementing wm(4) TX multiqueue support, I had a hard time
# to debug the similar flag checking codes...

Practically, MPSAFE flag must be written at initialization, and then be
read only at other run time. So, the flag checking codes would not cause
race.

> That said, why not not use two flags, say IFEF_OUTPUT_MPSAFE and
> IFEF_START_MPSAFE?  I never much liked the WRAP_FOO device -- is there
> a particular reason it's better for if_output?
> 
> It seems to me that it is easier to audit changes for the flag than to
> audit changes for the wrapper: for the flag, it is only necessary to
> make sure all callers of ifp->if_start instead use if_start_lock; for
> WRAP_FOO, it's not as easy to make sure you adjusted everything.  But
> maybe I'm missing something about the motivation for WRAP_FOO here.

Fair enough. I will rewrite my code to use that two flags.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

separate L3 output KERNEL_LOCK

2016-06-13 Thread Kengo NAKAHARA

Hi,

Currently, L3 output processing call L2 (or L3 tunneling) output routine
holding KERNEL_LOCK. This KERNEL_LOCK is controled by NET_MPSAFE kernel
option. e.g.
https://nxr.netbsd.org/xref/src/sys/netinet/ip_output.c#215

That is, if NET_MPSAFE is enabled at current implementation, L2 and device
driver output processing run without KERNEL_LOCK whether the L2 component
and the device driver is MP-safe or not.

To enable NET_MPSAFE safely in the future, it is required to separete the
KERNEL_LOCK to each L2 component and device drivers, and then enable
KERNEL_LOCK for each component which is not MP-safe yet.

The design to implement this concept is consisted of below two parts.
1. at L3 output processing call L2 output routine
   - remove KERNEL_LOCK from caller(L3)
   - wrap callee(L2) ifp->if_output with KERNEL_LOCK if necessary
2. at L2 output processing call device driver output routine
   - check device driver MP-safe flags (*)
   - if device driver has MP-safe flag, call without KERNEL_LOCK and
 vice versa

  (*) Unfortunately, struct ifnet->if_flags is already used all bit.
  So, I change if__pad1 to if_extflags and use it as flags.

Here is the patch series,
http://www.netbsd.org/~knakahara/separate-l3-lock/separate-l3-lock.tgz
and unified patch.

http://www.netbsd.org/~knakahara/separate-l3-lock/unified-separate-l3-lock.patch

Could you comment this design or implementation?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: gif(4) MP-ify by psref [was Re: passive references]

2016-05-23 Thread Kengo NAKAHARA

Hi,

I found some bugs and missing patch, so I update the patch series and
unified patch, sorry.

On 2016/05/20 22:02, Taylor R Campbell wrote:
>Date: Fri, 20 May 2016 16:28:15 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>I back to gif(4) MP-ify, since I could eliminate bottleneck of
>MP-scalable bridge(4) to support wm(4) TX multiqueue.
> 
>I rebase my gif(4) MP-ify patches and reflect latest psref(9) and
>pslist(9) implementation. Here is the patch series.
>http://www.netbsd.org/~knakahara/gif-psref/gif-psref.tgz
>And here is the unified patch.
>http://www.netbsd.org/~knakahara/gif-psref/unified-gif-psref.patch
> 
> Nice!  I don't have time to review it right now, but I'll try to find
> some time in the next few days.  Have you done any throughput tests
> yet, or tried observing the effect of `ifconfig gif1 create' on the
> flow through gif0?  I'm particularly curious to see how bad it is to
> just encap drop packets while `ifconfig gif1 create' is running.

I will re-measure that throughput for the fixed kernel.
# in the measurement result of old kernel, it seems "ifconfig gif1 create"
# does not so effect. So, I will also measure throughput when I create
# more gif(4)s (2, 4, 8, and so on).


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

gif(4) MP-ify by psref [was Re: passive references]

2016-05-20 Thread Kengo NAKAHARA

Hi,

I back to gif(4) MP-ify, since I could eliminate bottleneck of
MP-scalable bridge(4) to support wm(4) TX multiqueue.

I rebase my gif(4) MP-ify patches and reflect latest psref(9) and
pslist(9) implementation. Here is the patch series.
http://www.netbsd.org/~knakahara/gif-psref/gif-psref.tgz
And here is the unified patch.
http://www.netbsd.org/~knakahara/gif-psref/unified-gif-psref.patch

Basic design is almost the same as the below patch.

On 2016/03/03 13:31, Kengo NAKAHARA wrote:
> In fist, I reflect your reviews and restructure patch series.
> Here is git format-patch patch series.
> 
> http://www.netbsd.org/~knakahara/fix-gif-softint-using-psref2/fix-gif-softint-using-psref2-2.tgz
> And here is the unified patch, maybe unnecessary.
>
> http://www.netbsd.org/~knakahara/fix-gif-softint-using-psref2/unified-fix-gif-softint-using-psref2-2.patch

Could you comment the gif-psref implementation? If there no objection,
I will commit the patch series after a few days or a few weeks.

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: new ifnet MP-scalable sending interface

2016-04-27 Thread Kengo NAKAHARA

Hi, riastradh@n.o,

On 2016/04/27 23:49, Taylor R Campbell wrote:
>Date: Wed, 27 Apr 2016 16:27:56 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>On 2016/04/27 0:27, Taylor R Campbell wrote:
>> - Why not call the struct ifnet member `if_enqueue'?
>> 
>> It may be confusing to add a new verb `transmit' to a lexicon that
>> already has a lot of local jargon.
> 
>Fair enough. However "if_enqueue" would not be appropriate as joerg@n.o
>point out in the other mail.
> 
> joerg's rationale is good enough for me.  if_transmit is fine.
> 
>> - Have you considered making all callers use the ifq_enqueue function,
>> instead of calling the if_transmit member directly?
>> 
>> Generally, I would like to see a clearer separation between
>> (a) the responsibilities of a driver (`fill in ifp with functions
>>   that implement your functionality'), and
>> (b) the responsibilities of a caller in the network stack (`call
>>   ifq_enqueue to send a packet off to the driver').
>> But maybe in this case it doesn't matter so much, if there's no
>> additional logic in ifq_enqueue.
> 
>I have considered it a little, so that I use if_transmit member directly
>because of the clean of caller and ifq_enqueue() implementation. I feel
>like avoiding extra if-else statement in fast path every time. However,
>as you point out, it would not be a clear responsibility separation...
>Hmm, I will implement if_transmit to virtual interfaces such as gif(4).
>Do you think virtual interfaces should also use ifq_enqueue? If so,
>I will modify to use ifq_enqueue.
> 
> No, on reflection, I think using the member directly is probably fine.

Thank you for comments. I commit the patch.


Thanks,

-- 
//////
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: new ifnet MP-scalable sending interface

2016-04-27 Thread Kengo NAKAHARA

Hi joerg@n.o,

On 2016/04/27 17:53, Joerg Sonnenberger wrote:
> On Wed, Apr 27, 2016 at 04:27:56PM +0900, Kengo NAKAHARA wrote:
>> I have considered it a little, so that I use if_transmit member directly
>> because of the clean of caller and ifq_enqueue() implementation. I feel
>> like avoiding extra if-else statement in fast path every time. However,
>> as you point out, it would not be a clear responsibility separation...
>> Hmm, I will implement if_transmit to virtual interfaces such as gif(4).
>> Do you think virtual interfaces should also use ifq_enqueue? If so,
>> I will modify to use ifq_enqueue.
> 
> I think if_transmit is perfect fine.

Thank you for comments. I commit the patch.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: new ifnet MP-scalable sending interface

2016-04-27 Thread Kengo NAKAHARA

Hi,

On 2016/04/27 0:27, Taylor R Campbell wrote:
>Date: Tue, 26 Apr 2016 13:22:40 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>Does anyone have any comments or objections? If there is no objection,
>I will commit this patch after a few days or a few weeks.
>http://www.netbsd.org/~knakahara/if-transmit/if-transmit.patch
># rebased, no functional changes
> 
> Two little suggestions:
> 
> - Why not call the struct ifnet member `if_enqueue'?
> 
> It may be confusing to add a new verb `transmit' to a lexicon that
> already has a lot of local jargon.

Fair enough. However "if_enqueue" would not be appropriate as joerg@n.o
point out in the other mail.
Hmm..., I think of the following candidates, but I cannot determine.
# As you know, I am not good at English...
- if_express
- if_post
- if_pass
- if_downward
- if_direct
- if_transmit (again)
- if_enqueue (again)

Which one is better? Anything else better?


> - Have you considered making all callers use the ifq_enqueue function,
> instead of calling the if_transmit member directly?
> 
> Generally, I would like to see a clearer separation between
> (a) the responsibilities of a driver (`fill in ifp with functions
>   that implement your functionality'), and
> (b) the responsibilities of a caller in the network stack (`call
>   ifq_enqueue to send a packet off to the driver').
> But maybe in this case it doesn't matter so much, if there's no
> additional logic in ifq_enqueue.

I have considered it a little, so that I use if_transmit member directly
because of the clean of caller and ifq_enqueue() implementation. I feel
like avoiding extra if-else statement in fast path every time. However,
as you point out, it would not be a clear responsibility separation...
Hmm, I will implement if_transmit to virtual interfaces such as gif(4).
Do you think virtual interfaces should also use ifq_enqueue? If so,
I will modify to use ifq_enqueue.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: new ifnet MP-scalable sending interface

2016-04-25 Thread Kengo NAKAHARA

Hi,

On 2016/04/25 23:17, Joerg Sonnenberger wrote:
> On Mon, Apr 25, 2016 at 01:48:40PM +0900, Kengo NAKAHARA wrote:
>>> Most noticably, it can be used to avoid
>>> ever getting TXEOF interrupts while the device is actively being used.
>>
>> Hmm, do you mean so-called "poll-mode"?
> 
> No, poll mode would be mostly for the RX side. This is more about
> pushing packets to the hardware as soon as possible, not waiting for the
> next interrupt to queue more stuff.

I see. Thank you for your more comments. if_transmit() would allow
drivers to do such way. That is up to drivers.

Does anyone have any comments or objections? If there is no objection,
I will commit this patch after a few days or a few weeks.
http://www.netbsd.org/~knakahara/if-transmit/if-transmit.patch
# rebased, no functional changes

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: RFC: new ifnet MP-scalable sending interface

2016-04-24 Thread Kengo NAKAHARA

Hi,

On 2016/04/22 16:47, Joerg Sonnenberger wrote:
> On Fri, Apr 22, 2016 at 11:09:24AM +0900, Kengo NAKAHARA wrote:
>> So, I want to introduce new MP-scalable ifnet interface which doesn't
>> use if_snd queue and directly pass mbuf to lower layers. The interface
>> is called if_transmit() and the prototype is the following:
>> int (*if_transmit)(struct ifnet *, struct mbuf *);
> 
> Let me try to expand that a bit and you comment on whether it actually
> reflects your intention.
> 
> The old interface queueing implementation is based on assumption
> that restarting transmits after the in-device queue is cheaper than
> continued requeueing from the network stack. The new interface provides
> a hook for processing each outgoing packet individually. It can provide
> the old interface via a generic glue implementation, but it also allows
> a device more granular control.

Yes, you are right. I implement glue function(if_transmit()@if.c) in
my patch pointed previous mail, which call IFQ_ENQUEUE and ifp->if_start().
With the new interface, a device can control a packet to put H/W queues
with appropriate lock granularity.

> Most noticably, it can be used to avoid
> ever getting TXEOF interrupts while the device is actively being used.

Hmm, do you mean so-called "poll-mode"?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

RFC: new ifnet MP-scalable sending interface

2016-04-21 Thread Kengo NAKAHARA

Hi,

Currently, network interfaces does IFQ_ENQUEUE() and then call
ifp->if_start() in the packet sending process. "IFQ_ENQUEUE() +
ifp->if_start()" processing is already MP-safe, however it is not
MP-scalable because of IFQ_LOCK in IFQ_ENQUEUE() and IFQ_DEQUEUE().

So, I want to introduce new MP-scalable ifnet interface which doesn't
use if_snd queue and directly pass mbuf to lower layers. The interface
is called if_transmit() and the prototype is the following:
int (*if_transmit)(struct ifnet *, struct mbuf *);

At first, this interface must be used by network device driver to support
TX multiqueue so that it doesn't bottleneck bridge(4) MP-scalable.
I think this interface might also be used by virtual network interfaces
such as gif(4).

Here is the patch.
http://www.netbsd.org/~knakahara/if-transmit/if-transmit.patch

Does anyone have any comments? Any comments are welcome.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-20 Thread Kengo NAKAHARA

Hi,

On 2016/04/12 12:18, Kengo NAKAHARA wrote:
> On 2016/04/11 17:11, Joerg Sonnenberger wrote:
>> On Mon, Apr 11, 2016 at 01:06:00PM +0900, Kengo NAKAHARA wrote:
>>> I implement moving struct altq_pktattr from m_tag to mbuf. Here is
>>> the patch.
>>> 
>>> http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/mbuf-altq-pktattr.patch
>>
>> The only issue here is that I would prefer to not add the ALTQ
>> include, but just move the definition into sys/mbuf.h. Out of curiosity,
> 
> I see. I remove ALTQ include. Since I add individual fields in struct
> pkthdrAs as you indicated following, I don't move strct altq_pktattr.

I commit it after mbuf initialize function commit.

Thank you for many reviews and advises!


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-11 Thread Kengo NAKAHARA

Hi,

Thank you for comments.

On 2016/04/11 17:11, Joerg Sonnenberger wrote:
> On Mon, Apr 11, 2016 at 01:06:00PM +0900, Kengo NAKAHARA wrote:
>> I implement moving struct altq_pktattr from m_tag to mbuf. Here is
>> the patch.
>> 
>> http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/mbuf-altq-pktattr.patch
> 
> The only issue here is that I would prefer to not add the ALTQ
> include, but just move the definition into sys/mbuf.h. Out of curiosity,

I see. I remove ALTQ include. Since I add individual fields in struct
pkthdrAs as you indicated following, I don't move strct altq_pktattr.

> is the new size the same if you add the fields individually and the int
> first? I.e. does using the struct add initial padding? It might be
> useful at some point later to drop altq_pktattr as separate type
> altogether and just use individual fields. As soon as some data after
> pattr has to be added, we get padding on LP64 otherwise.

Fair enough.
The new size is the same when I add the fields individually an the int
first. I change the fields order in several patterns, but the new size
is the same unless adding __packed attribute.
# with __packed, the size changes from 56 bytes to 52 bytes in LP64,
# in contrast, the size is the same in ILP32
However, I get padding when I add new uint32_t fields after pattr in
LP64. So, I modify to use individual fields for future works.
I ensured this changes does not cause performance issues in my measurement.

Here is the updated patch and patch series.
- unified patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/mbuf-altq-pktattr-2.patch
- patch series

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0001-use-altq_pktattr-as-mbuf.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0002-fix-caller-side.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0003-more-refactor-to-fix-performance-issue.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0004-improve-comment.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0005-reviewed-by-joerg-n.o.patch

Could you comment this patch?

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-10 Thread Kengo NAKAHARA

Hi,

On 2016/04/11 13:06, Kengo NAKAHARA wrote:
> On 2016/04/07 8:50, Kengo NAKAHARA wrote:
>> Hi,
>>
>> On 2016/04/05 16:38, Joerg Sonnenberger wrote:
>>> On Tue, Apr 05, 2016 at 01:11:01PM +0900, Kengo NAKAHARA wrote:
>>>> (Q2) How do I decide the data is too large or not?
>>>> e.g. ALTQ case, the data is struct altq_pktattr whose members are void *,
>>>> int, and void *.
>>>> http://nxr.netbsd.org/xref/src/sys/altq/if_altq.h#86
>>>> Is this three member struct too large?
>>>> another PACKET_TAG_PF m_tag case, the data is struct pf_mtag
>>>> http://nxr.netbsd.org/xref/src/sys/dist/pf/net/pf_mtag.h#46
>>>> How about this pattern?
>>>
>>> Look at struct pkthdr. It's currently 2 pointers and 4 ints in the first
>>> mbuf of every packet. We want to avoid pushing too much into it as the
>>> rest of the mbuf is used for data, so "reasonable" small packets, we
>>> want to completely stay in the first mbuf. The total size of a mbuf is
>>> currently either 256 or 512 Bytes, mbuf header is < 60 Bytes, so
>>> assuming that reasonable small means < 80 Bytes, we have around 120
>>> Bytes or so to spend on the packet header.
>>
>> Thank you very much for your great helpful comments.
>>
>> How about Q1?
> 
> I implement moving struct altq_pktattr from m_tag to mbuf. Here is
> the patch.
> http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/mbuf-altq-pktattr.patch

I also upload separated patch series.

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0001-use-altq_pktattr-as-mbuf.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0002-fix-caller-side.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0003-more-refactor-to-fix-performance-issue.patch

http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/0004-improve-comment.patch

> As the comment in above patch, new mbuf size is 36 bytes for ILP32 and
> 56 bytes for LP64. They are less than 80 bytes which joerg@n.o pointed
> out.
> # I will add ethernet RSS Hash Value (32 bits) and RSS type (maybe 4
> # bits) to use multiqueue more effectively. I will also add gif(4)
> # nested count (8 bits or 16 bits) which use m_tag currently. If all
> # of them are added, it is still less than 80 bytes.
> 
> Could you comment the patch?
> 
> Hmm, The patch would conform to joerg@n.o's and tls@n.o's comments.
> If there is no other objection, I will commit above patch after a few
> days or a few weeks.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-10 Thread Kengo NAKAHARA

Hi,

On 2016/04/07 8:50, Kengo NAKAHARA wrote:
> Hi,
> 
> On 2016/04/05 16:38, Joerg Sonnenberger wrote:
>> On Tue, Apr 05, 2016 at 01:11:01PM +0900, Kengo NAKAHARA wrote:
>>> (Q2) How do I decide the data is too large or not?
>>> e.g. ALTQ case, the data is struct altq_pktattr whose members are void *,
>>> int, and void *.
>>> http://nxr.netbsd.org/xref/src/sys/altq/if_altq.h#86
>>> Is this three member struct too large?
>>> another PACKET_TAG_PF m_tag case, the data is struct pf_mtag
>>> http://nxr.netbsd.org/xref/src/sys/dist/pf/net/pf_mtag.h#46
>>> How about this pattern?
>>
>> Look at struct pkthdr. It's currently 2 pointers and 4 ints in the first
>> mbuf of every packet. We want to avoid pushing too much into it as the
>> rest of the mbuf is used for data, so "reasonable" small packets, we
>> want to completely stay in the first mbuf. The total size of a mbuf is
>> currently either 256 or 512 Bytes, mbuf header is < 60 Bytes, so
>> assuming that reasonable small means < 80 Bytes, we have around 120
>> Bytes or so to spend on the packet header.
> 
> Thank you very much for your great helpful comments.
> 
> How about Q1?

I implement moving struct altq_pktattr from m_tag to mbuf. Here is
the patch.
http://www.netbsd.org/~knakahara/mbuf-altq-pktattr/mbuf-altq-pktattr.patch

As the comment in above patch, new mbuf size is 36 bytes for ILP32 and
56 bytes for LP64. They are less than 80 bytes which joerg@n.o pointed
out.
# I will add ethernet RSS Hash Value (32 bits) and RSS type (maybe 4
# bits) to use multiqueue more effectively. I will also add gif(4)
# nested count (8 bits or 16 bits) which use m_tag currently. If all
# of them are added, it is still less than 80 bytes.

Could you comment the patch?

Hmm, The patch would conform to joerg@n.o's and tls@n.o's comments.
If there is no other objection, I will commit above patch after a few
days or a few weeks.

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: pserialize-safe queue(3) alternative

2016-04-08 Thread Kengo NAKAHARA

Hi,

Thank you for quick update!

On 2016/04/08 1:09, Taylor R Campbell wrote:
>Date: Thu, 7 Apr 2016 18:22:08 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>Sorry, I found one more question. _pslist_writer_next_container() does
>not use "next" variable. I think the following patch is required.
> 
> Thanks!  Here's a new draft with this fix, the insert_after fix, and a
> note on write ordering to justify the placement of membar_producer.

I apply pslist to my gif(4) implementation. Furthermore, ozaki-r@n.o
apply pslist to bridge(4) and check by ATF. In consequence, we think
the following patch might be required. Is the patch appropriate?


diff --git a/sys/sys/pslist.h b/sys/sys/pslist.h
index 2aef4b3..a59713e 100644
--- a/sys/sys/pslist.h
+++ b/sys/sys/pslist.h
@@ -73,6 +73,8 @@ static inline void
 pslist_writer_insert_head(struct pslist_head *head, struct pslist_entry *new)
 {
 
+   if (head->plh_first != NULL)
+   head->plh_first->ple_prevp = >ple_next;
new->ple_prevp = >plh_first;
new->ple_next = head->plh_first;
membar_producer();
@@ -107,6 +109,8 @@ static inline void
 pslist_writer_remove(struct pslist_entry *entry)
 {
 
+   if (entry->ple_next != NULL)
+   entry->ple_next->ple_prevp = entry->ple_prevp;
*entry->ple_prevp = entry->ple_next;
entry->ple_prevp = NULL; /* poison */
 }
@@ -187,7 +191,7 @@ _pslist_reader_next_container(struct pslist_entry *entry, 
ptrdiff_t offset)
 {
struct pslist_entry *next = entry->ple_next;
 
-   if (entry == NULL)
+   if (next == NULL)
return NULL;
membar_datadep_consumer();
 
@@ -224,10 +228,10 @@ _pslist_reader_next_container(struct pslist_entry *entry, 
ptrdiff_t offset)
pslist_writer_remove(&(E)->F)
 #definePSLIST_WRITER_FIRST(H, T, F)
  \
((T *)(_pslist_writer_first_container((H), offsetof(T, F))) + \
-   _PSLIST_VALIDATE_CONTAINER(pslist_first(H), T, F))
+   _PSLIST_VALIDATE_CONTAINER(pslist_writer_first(H), T, F))
 #definePSLIST_WRITER_NEXT(V, T, F) 
  \
((T *)(_pslist_writer_next_container(&(V)->F, offsetof(T, F))) +  \
-   _PSLIST_VALIDATE_CONTAINER(pslist_next(&(V)->F), T, F))
+   _PSLIST_VALIDATE_CONTAINER(pslist_writer_next(&(V)->F), T, F))
 #definePSLIST_WRITER_FOREACH(V, H, T, F)   
  \
for ((V) = PSLIST_WRITER_FIRST((H), T, F);\
(V) != NULL;  \
@@ -235,10 +239,10 @@ _pslist_reader_next_container(struct pslist_entry *entry, 
ptrdiff_t offset)
 
 #definePSLIST_READER_FIRST(H, T, F)
  \
((T *)(_pslist_reader_first_container((H), offsetof(T, F))) + \
-   _PSLIST_VALIDATE_CONTAINER(pslist_first(H), T, F))
+   _PSLIST_VALIDATE_CONTAINER(pslist_reader_first(H), T, F))
 #definePSLIST_READER_NEXT(V, T, F) 
  \
((T *)(_pslist_reader_next_container(&(V)->F, offsetof(T, F))) +  \
-   _PSLIST_VALIDATE_CONTAINER(pslist_next(&(V)->F), T, F))
+   _PSLIST_VALIDATE_CONTAINER(pslist_reader_next(&(V)->F), T, F))
 #definePSLIST_READER_FOREACH(V, H, T, F)   
  \
for ((V) = PSLIST_READER_FIRST((H), T, F);\
(V) != NULL;  \



Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: pserialize-safe queue(3) alternative

2016-04-08 Thread Kengo NAKAHARA

Hi,

On 2016/04/08 1:05, Taylor R Campbell wrote:
>Date: Thu, 7 Apr 2016 17:50:56 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>I have a question. It seems that pslist_writer_insert_after() does not
>change old entry->ple_next->ple_prevp. I think pslist_writer_insert_after()
>may be only used to add last element. If that is correct, I think
>"pslist_writer_insert_last" might be better for the function name.
> 
> Oops!  Good catch.  Obviously I need to write some automatic tests for
> this.

I agree. It's nice idea. :)


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: pserialize-safe queue(3) alternative

2016-04-07 Thread Kengo NAKAHARA

Hi,

On 2016/04/07 17:50, Kengo NAKAHARA wrote:
> Hi,
> 
> On 2016/04/04 3:23, Taylor R Campbell wrote:
>>Date: Sun, 3 Apr 2016 18:20:06 +
>>From: Taylor R Campbell <campbell+netbsd-tech-k...@mumble.net>
>>
>>The attached pslist.h implements a pserialize-safe alternative to the
>>LIST_* operations in queue(3), with example use in pslist.c:
>>
>> ...actually attached, this time for real!
> 
> I have a question. It seems that pslist_writer_insert_after() does not
> change old entry->ple_next->ple_prevp. I think pslist_writer_insert_after()
> may be only used to add last element. If that is correct, I think
> "pslist_writer_insert_last" might be better for the function name.
> 
> For the rest, it looks good to me. I'm eager that it is committed.

Sorry, I found one more question. _pslist_writer_next_container() does
not use "next" variable. I think the following patch is required.

--- a/sys/sys/pslist.h
+++ b/sys/sys/pslist.h
@@ -130,7 +130,7 @@ _pslist_writer_next_container(struct pslist_entry *entry, 
ptrdiff_t offset)
 {
struct pslist_entry *next = entry->ple_next;
 
-   return (next == NULL ? NULL : (char *)entry - offset);
+   return (next == NULL ? NULL : (char *)next - offset);
 }



Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: pserialize-safe queue(3) alternative

2016-04-07 Thread Kengo NAKAHARA

Hi,

On 2016/04/04 3:23, Taylor R Campbell wrote:
>Date: Sun, 3 Apr 2016 18:20:06 +
>From: Taylor R Campbell <campbell+netbsd-tech-k...@mumble.net>
> 
>The attached pslist.h implements a pserialize-safe alternative to the
>LIST_* operations in queue(3), with example use in pslist.c:
> 
> ...actually attached, this time for real!

I have a question. It seems that pslist_writer_insert_after() does not
change old entry->ple_next->ple_prevp. I think pslist_writer_insert_after()
may be only used to add last element. If that is correct, I think
"pslist_writer_insert_last" might be better for the function name.

For the rest, it looks good to me. I'm eager that it is committed.

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-06 Thread Kengo NAKAHARA

Hi,

On 2016/04/05 16:38, Joerg Sonnenberger wrote:
> On Tue, Apr 05, 2016 at 01:11:01PM +0900, Kengo NAKAHARA wrote:
>> (Q2) How do I decide the data is too large or not?
>> e.g. ALTQ case, the data is struct altq_pktattr whose members are void *,
>> int, and void *.
>> http://nxr.netbsd.org/xref/src/sys/altq/if_altq.h#86
>> Is this three member struct too large?
>> another PACKET_TAG_PF m_tag case, the data is struct pf_mtag
>> http://nxr.netbsd.org/xref/src/sys/dist/pf/net/pf_mtag.h#46
>> How about this pattern?
> 
> Look at struct pkthdr. It's currently 2 pointers and 4 ints in the first
> mbuf of every packet. We want to avoid pushing too much into it as the
> rest of the mbuf is used for data, so "reasonable" small packets, we
> want to completely stay in the first mbuf. The total size of a mbuf is
> currently either 256 or 512 Bytes, mbuf header is < 60 Bytes, so
> assuming that reasonable small means < 80 Bytes, we have around 120
> Bytes or so to spend on the packet header.

Thank you very much for your great helpful comments.

How about Q1?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-04 Thread Kengo NAKAHARA

Hi,

On 2016/04/04 22:05, Thor Lancelot Simon wrote:
> On Mon, Apr 04, 2016 at 05:25:14PM +0900, Kengo NAKAHARA wrote:
>>
>> You are right. Hmm..., I'm sorry, I change my objective. I think it could
>> be better to use pool cache for m_tag itself.
> 
> The tags are a mess and I've seen them seriously impact performance in
> numerous profiles of the network stack.  In fact, at a former employer,
> it was sort of an expected thing: each time we'd experiment with some
> new feature of the stack (vlan tagging, altq, etc.) we'd see a
> performance degradation which profiling would revel traced back to the
> allocation, use, and deallocation of mbuf tags.

It's very interesting. Could you tell me in detail? (of course, if it
does not violate NDA).

I measured the ALTQ performance of 3 patterns, that is
- NetBSD-current(using stack variable)
- pattern 1(using malloc m_tag)
- pattern 2(using pool cache m_tag)
As you indicated, pattern 1 much less performance than current.
However, pattern 2 is almost the same performance as current.
e.g. when the frame size is 128 byte, the throughput is about
147,800 fps, and the duration time is 20 sec, the number of dropped
packets are the following.

 run number | current | pattern 1 | pattern 2
+-+---+---
  1 |  21 |   571,779 |   721
  2 |   0 |   568,329 |   766
  3 | 741 |   577,228 |22
  4 | 650 |   561,325 |   774
  5 | 730 |   568,038 | 0

For this result, I think my ALTQ pool cache implementation (pattern 2)
does not degrade performance.

I think the changing from malloc m_tag to pool cache m_tag does not
degrade performance, either. As the latency of pool cache m_tag is
less than one of malloc m_tag.
http://mail-index.netbsd.org/tech-net/2016/02/19/msg005633.html

> I would suggest that the only "proper" use of mbuf tags as currently
> implemented is experimentation.  Uses of the tags which have been in
> the tree for a decade or more should be subsumed into the mbuf
> datastructure itself, with some care taken to not make it too large,
> or perhaps a single, "standard" extension structure could be defined
> to capture the less-common cases.

Hmm, sorry, I would not understand well... I have two question.

(Q1) Is harmful m_tag performance increasing?
Even if it is used for experimentation, it is better to reduce known
performance issue, I think.

(Q2) How do I decide the data is too large or not?
e.g. ALTQ case, the data is struct altq_pktattr whose members are void *,
int, and void *.
http://nxr.netbsd.org/xref/src/sys/altq/if_altq.h#86
Is this three member struct too large?
another PACKET_TAG_PF m_tag case, the data is struct pf_mtag
http://nxr.netbsd.org/xref/src/sys/dist/pf/net/pf_mtag.h#46
How about this pattern?

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: dtrace(1) error in the kernel built by build.sh kernel=GENERIC

2016-04-04 Thread Kengo NAKAHARA

Hi,

On 2016/04/04 2:51, Taylor R Campbell wrote:
>Date: Fri, 1 Apr 2016 20:04:01 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>When I use the kernel built by "build.sh kernel=GENERIC" (the source is
>updated at March 31), I find dtrace(1) error which prints the following
>message.
>== NG result ==
># dtrace -n 'proc::: {}'
>dtrace: invalid probe specifier proc::: {}: "/usr/lib/dtrace/psinfo.d", 
> line 46: failed to copy type of 'pr_arglen': Type information is in parent 
> and unavailable
>== NG result ==
> 
> I think this is a symptom of the overflow of CTF data in the kernel
> when compiled with debug data enabled -- but if we omit debug data
> then I think the functionality is substantially reduced.  Someone^TM
> needs to find a way to fix the CTF overflow properly.

Thank you for letting me know the reason.

Hmm..., I try to find out how to fix it when I have a time.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-04-04 Thread Kengo NAKAHARA

Hi,

Thank you for review and comment.

On 2016/03/29 23:22, Taylor R Campbell wrote:
> Hmm...  It looks like there are some paths out of IFQ_CLASSIFY that do
> not lead to IFQ_ENQUEUE, which would cause a memory leak.  For
> example, in atm_output, IFQ_CLASSIFY is unconditional, but every
> `senderr' branch ends in `goto bad', which skips ifq_enqueue.  When
> the pktattrs were stack-allocated that was OK, but now we need to
> explicitly release them with altq_delete_pattr.
> 
> Except maybe it won't be a memory leak -- maybe it will attempt to
> free(9) a tag that was allocated with pool_cache(9), via m_freem ->
> MFREE -> m_tag_delete_chain -> m_tag_delete -> m_tag_free -> free.
> When all tags were created with m_tag_get that was correct, but now it
> is necessary to clean them up explicitly.

You are right. Hmm..., I'm sorry, I change my objective. I think it could
be better to use pool cache for m_tag itself.

To fix a memory leak you point out, it must free altq_pktattr_tag in
the case that M_PREPEND failed. Considering also other cases, m_tag_delete()
might be modified like the following.

m_tag_delete(struct mbuf *m, struct m_tag *t)
{

#ifdef ALTQ
   if (t->m_tag_id == PACKET_TAG_ALTQ_PATTR) {
   struct altq_pktattr_tag *apt;
   apt = container_of(t, struct altq_pktattr_tag,
   apt_tag);
   altq_m_tag_delete(m, apt);
   return;
   }
#endif
   m_tag_unlink(m, t);
   m_tag_free(t);
}

Hmm, It could be overdoing, however I think it would rather use
pool cache for m_tag itself than add above ALTQ specific code to
m_tag_delete().

So, I implement m_tag pool cache and refactor IFQ_ENQUEUE. Here is
the patch series.
http://www.netbsd.org/~knakahara/mtag-pool-cache-ifq-enqueue/
0001-introduce-pool-cache-for-m_tag.patch
0002-add-dtrace-probe.patch
0003-eliminate-pktattr-argument-from-IFQ_ENQUEUE.patch

Could you comment above patches?

I'm sorry to change later...

Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: missing SDT_PROVIDER_DEFINE(sdt)

2016-04-03 Thread Kengo NAKAHARA

Hi,

Thank you for comments.

On 2016/04/04 11:59, Taylor R Campbell wrote:
>Date: Mon, 04 Apr 2016 07:06:37 +1000
>from: matthew green <m...@eterna.com.au>
> 
>Taylor R Campbell writes:
>> I'm actually a little puzzled by all that.  Why doesn't ipf just
>> define its own provider?  ipf appears to be the only use of the sdt
>> provider.  I'd sooner get rid of DTRACE_PROBE* altogether and just
>> adapt ipf to use its own provider.  But I have no objection if you'd
>> rather just add an SDT_PROVIDER_DEFINE(sdt).
> 
>i assume this was because ipf was ported to solaris and solaris
>provides a (generic?) "sdt" provider, not that it has been ported
>to netbsd dtrace in any way.
> 
> If it's for Solaris compatibility, then maybe it is worthwhile to have
> an `sdt' provider so that we might expect to get more useful D scripts
> from the Solaris world.  (But that's something Someone^TM needs to
> actively work on, which evidently at least I haven't had the
> activation energy for.)

I agree ipf should define own provider. But, I'm sorry to have no
enough time to fix it appropriately. I just want to let some headers
include sys/sdt.h and compile GENERIC kernel...

So, I just add SDT_PROVIER_DEFINE(sdt) to help Someone's future works.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

dtrace(1) error in the kernel built by build.sh kernel=GENERIC

2016-04-01 Thread Kengo NAKAHARA

Hi,

When I use the kernel built by "build.sh kernel=GENERIC" (the source is
updated at March 31), I find dtrace(1) error which prints the following
message.
== NG result ==
# dtrace -n 'proc::: {}'
dtrace: invalid probe specifier proc::: {}: "/usr/lib/dtrace/psinfo.d", line 
46: failed to copy type of 'pr_arglen': Type information is in parent and 
unavailable
== NG result ==

In contrast, when I apply the following patch, the same dtrace(1) command
succeed.
== patch ==
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -103,7 +103,7 @@ options DDB # in-kernel debugger
 optionsDDB_HISTORY_SIZE=512# enable history editing in DDB
 #options   KGDB# remote debugger
 #options   KGDB_DEVNAME="\"com\"",KGDB_DEVADDR=0x3f8,KGDB_DEVRATE=9600
-makeoptionsDEBUG="-g"  # compile full symbol table for CTF
+#makeoptions   DEBUG="-g"  # compile full symbol table for CTF
 #options   SYSCALL_STATS   # per syscall counts
 #options   SYSCALL_TIMES   # per syscall times
 #options   SYSCALL_TIMES_HASCOUNTER# use 'broken' rdtsc (soekris)
== patch ==

== ok result ==
netbsd-rangeley# dtrace -n 'proc::: {}'
dtrace: description 'proc::: ' matched 12 probes
# do some command such as ls(1)
  7208  :lwp-create 
  7205  :create 
  7207   :lwp-start 
  7203:exec 
  7202:exec-success 
  7204:exit 
  7210  :signal-discard 
== ok result ==

# BTW, when I use the kernel built by "build.sh release", the same
# dtrace(1) command succeed, because it does "nbconfig -U DEBUG".

I am not sure the above patch might be appropriate or not, but I have
no idea to fix this problem by other way.

Does anyone has any comments about this problem?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

missing SDT_PROVIDER_DEFINE(sdt)

2016-03-31 Thread Kengo NAKAHARA

Hi,

There is SDT_PROVIDER_DECLARE(sdt) in sdt.h.
http://nxr.netbsd.org/xref/src/sys/sys/sdt.h#416
However, there is no SDT_PROVIDER_DEFINE(sdt). I think this is
inconsistent.

Currently, "sdt" provider is not used yet, so this inconsistency doesn't
cause problems. "sdt" provider will be used by ipf if it enable sdt.
That is, when it enabled the following macros.
http://nxr.netbsd.org/xref/src/sys/external/bsd/ipf/netinet/ip_compat.h#2831

I think the following patch is required to prevent future problems such
as ipf enabled sdt.

--- a/sys/kern/kern_sdt.c
+++ b/sys/kern/kern_sdt.c
@@ -71,6 +71,8 @@
 #include 
 #include 
 
+SDT_PROVIDER_DEFINE(sdt);
+
 void sdt_probe_stub(u_int32_t, uintptr_t, uintptr_t,
uintptr_t, uintptr_t, uintptr_t);


Could you comment this patch?


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-03-29 Thread Kengo NAKAHARA

Hi,

Thank you for comments.

On 2016/03/25 23:05, Taylor R Campbell wrote:
>Date: Fri, 25 Mar 2016 17:26:14 +0900
>From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> 
>I rebase and modify a little(use container_of). Here is the patch.
>
> http://www.netbsd.org/~knakahara/ifq-enqueue-refactor/ifq-enqueue-refactor.patch
> 
>Does anyone have Any comments? Any comments are welcome.
> 
> Minor nit in altq_m_tag_get: KNF style is `return foo;', not `return
> (foo);'.

Thanks, I fix it. There are similar issues in altq_subr.c, I will fix
them where I modify around them.

> altq_get_pattr should use container_of too, not pointer arithmetic:
> 
>   mtag = m_tag_find(m, PACKET_TAG_ALTQ_PATTR, NULL);
>   if (mtag != NULL) {
>   struct altq_pktattr_tag *apt =
>   container_of(mtag, struct altq_pktattr_tag, apt_tag);
>   return >apt_pktattr;
>   }
> 
> Either or altq_get_pattr or IFQ_ENQUEUE should KASSERT(pattr != NULL)
> or similar.

Hmm, I think the functions which are set to ifq->altq_enqueue by
altq_attach() can accept NULL and handle it. The functions are following.
- blue_enqueue()@sys/altq/altq_blue.c
- cbq_enqueue()@sys/altq/altq_cbq.c
- fifoq_enqueue()@sys/altq/altq_fifoq.c
- hfsc_enqueue()@sys/altq/altq_hfsc.c
- jobs_enqueue()@sys/altq/altq_jobs.c
- priq_enqueue()@sys/altq/altq_priq.c
- red_enqueue()@sys/altq/altq_red.c
- rio_enqueue()@sys/altq/altq_rio.c
- wfq_ifenqueue()@sys/altq/altq_wfq.c

So, I think KASSERT may be too strict. Could you comment it?
BTW, it is hard to know whether altq_m_tag_get() succeed or not in
above patch, so I add some logs when altq_m_tag_get() failed.


> Why add struct ifnet::if_altq_classify?  Why not just teach
> altq_etherclassify to call altq_set_pattr?  From the comments above
> altq_etherclassify, it sounds like it's meant to be a temporary
> workaround for a deficiency in altq.  Especially the just-in-time
> assignment of ifp->if_altq_classify in ieee80211_deliver_data seems
> wrong to me.

You are quite right. I revert the modification about
ifnet::if_altq_classify and fix altq_etherclassify() to use
altq_m_tag_get().

Here is the updated patch.

http://www.netbsd.org/~knakahara/ifq-enqueue-refactor/ifq-enqueue-refactor-2.patch

Could you comment it?
And, does anyone have any comments? Any comments are welcome.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

Re: IFQ_ENQUEUE argument refactor (was Re: RFC: ALTQ caller refactor)

2016-03-25 Thread Kengo NAKAHARA

Hi,

On 2016/03/07 22:20, Kengo NAKAHARA wrote:
> Hi,
> 
> On 2016/03/07 19:20, Joerg Sonnenberger wrote:
>> On Tue, Feb 23, 2016 at 01:52:40PM +0900, Kengo NAKAHARA wrote:
>>> Hi,
>>>
>>> I want to uniform IFQ_ENQUEUE arguments whether ALTQ is defined or not.
>>> # And I also want to eliminate ALTQ_DECL and ALTQ_COMMA macros...
>>>  So, I make struct altq_pktattr from IFQ_CLASSIFY and IFQ_ENQUEUE argument
>>> to m_tag. Here is the patch.
>>> 
>>> http://netbsd.org/~knakahara/altq-caller-refactor/altq-caller-refactor.patch
>>>
>>> In above patch, I fix performance issue by the following two ways.
>>> - make ALTQ's m_tag pool cache
>>> - inline altq_*_pattr functions
>>> Without these two fixes, the enabled ALTQ throughput degraded about 10%.
>>> However, with these two fixes, the throughput is almost the same
>>> performance.
>>
>> Is there a good reason to keep the complexity of the pattr attribute
>> around? From working on the ALTQ code a few years ago I remember that it
>> is only used by some of the more experimental mechanisms and not used by
>> the few queueing disciplines people normally care about. It might be
>> easier to just get rid of it.
> 
> No, there is no good reason. But, that is out of my scope now.
> Sorry about confusing mail title, I want to remove pattr argument from
> IFQ_ENQUEUE to support TX multiqueue. The lack of TX multiqueue causes
> critical scaling problem because of lock contention in IFQ_ENQUEUE and
> ifp->if_start. To avoid this lock contention, that is, support TX
> multiqueue, I think it is require to support ifp->if_transmit() which
> pass mbuf to ifp directly without IFQ_ENQUEUE. To implement this
> interface with backward compatibility(keep IFQ_ENQUEUE + ifp->if_start
> interface for many network device drivers), it is required to remove
> pattr from IFQ_ENQUEUE.
> 
> BTW, I am not sure that it is the best solution to put struct altq_pktattr
> to mbuf to simplify ALTQ... As you may know, the comment in ip_input()
> says "ALTQ is changed to use a pfil hook".
> http://nxr.netbsd.org/xref/src/sys/netinet/ip_input.c#554 
> FreeBSD and OpenBSD modified ALTQ by such way. However, I am not sure
> that whether the way fits to NetBSD...

I rebase and modify a little(use container_of). Here is the patch.

http://www.netbsd.org/~knakahara/ifq-enqueue-refactor/ifq-enqueue-refactor.patch

Does anyone have Any comments? Any comments are welcome.


Thanks,

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-nakah...@iij.ad.jp>

1 2 >

1 - 100 of 176 matches

Mail list logo