Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?
On Tue, 11 Oct 2016, Liping Zhang wrote: > Yes, thanks for clarifying this. There's a bug in kernel, can you try > this patch: > > diff --git a/net/netfilter/xt_NFLOG.c b/net/netfilter/xt_NFLOG.c > index 018eed7..8c069b4 100644 > --- a/net/netfilter/xt_NFLOG.c > +++ b/net/netfilter/xt_NFLOG.c > @@ -32,6 +32,7 @@ nflog_tg(struct sk_buff *skb, const struct > xt_action_param *par) > li.u.ulog.copy_len = info->len; > li.u.ulog.group = info->group; > li.u.ulog.qthreshold = info->threshold; > + li.u.ulog.flags = 0; > > if (info->flags & XT_NFLOG_F_COPY_LEN) > li.u.ulog.flags |= NF_LOG_F_COPY_LEN; I have tested the above patch with 4.8.1, with and without nflog-size defined in an iptables configuration, and it works well. The ulogd-2.0.5 segfaults no longer happen when nflog-size is not present in a target. I recommend this fix. Thanks, Chris
Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?
On Tue, 11 Oct 2016, Liping Zhang wrote: > Yes, thanks for clarifying this. There's a bug in kernel, can you try > this patch: > > diff --git a/net/netfilter/xt_NFLOG.c b/net/netfilter/xt_NFLOG.c > index 018eed7..8c069b4 100644 > --- a/net/netfilter/xt_NFLOG.c > +++ b/net/netfilter/xt_NFLOG.c > @@ -32,6 +32,7 @@ nflog_tg(struct sk_buff *skb, const struct > xt_action_param *par) > li.u.ulog.copy_len = info->len; > li.u.ulog.group = info->group; > li.u.ulog.qthreshold = info->threshold; > + li.u.ulog.flags = 0; > > if (info->flags & XT_NFLOG_F_COPY_LEN) > li.u.ulog.flags |= NF_LOG_F_COPY_LEN; I have tested the above patch with 4.8.1, with and without nflog-size defined in an iptables configuration, and it works well. The ulogd-2.0.5 segfaults no longer happen when nflog-size is not present in a target. I recommend this fix. Thanks, Chris
Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?
On Mon, 10 Oct 2016, Liping Zhang wrote: > 2016-10-10 15:02 GMT+08:00 Chris Caputo <ccap...@alt.net>: > > Program received signal SIGSEGV, Segmentation fault. > > 0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at > > ulogd_raw2packet_BASE.c:720 > > > > 715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t > > len) > > 716 { > > 717 struct ulogd_key *ret = pi->output.keys; > > 718 struct iphdr *iph = > > 719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]); > > 720 void *nexthdr = (uint32_t *)iph + iph->ihl; > > > > I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous > > behavior by not always copying IP header data to user space. > > > > On my machine IPv4 log packets result in a ulogd segfault while IPv6 > > packets do not. I'm not sure of the cause of the difference. > > > > The corresponding userspace commit for the 209d2585 kernel change is: > > > > > > https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882 > > > > This adds --nflog-size to iptables. When --nflog-size is used with my > > iptables NFLOG lines, the ulogd-2.0.5 segfaults cease. > > What numbers did you specify after --nflog-size option? > --nflog-size 0 or ...? If you want log the whole packet to > the ulogd, please do not specify this nflog-size option. Not specifying nflog-size does not appear to log the whole packet... If "--nflog-size" is unspecified, and the iptables config is left unchanged when the kernel is upgraded to 4.8, ulogd-2.0.5 crashes. If "--nflog-size 0" is used, ulogd-2.0.5 crashes. If "--nflog-size" is used with size 1 or greater, ulogd-2.0.5 is fine. > > I'm surprised to see a kernel change cause unexpected userspace segfaults, > > so further investigation into a kernel fix would seem a good idea. > > According to the original user's manual, nflog-range option was > designed to be the number of bytes copied to userspace, but > unfortunately there's a bug from the beginning and it never works, > i.e. in kernel, it just ignored this option. > > Try to change the current nflog-range option's semantics may > cause unexpected results(maybe like this ulogd crash) ... > > In order to keep compatibility, Vishwanath introduce a new > nflog-size option and keep nflog-range unchanged. If you just > upgrade the kernel, and do not change iptables rules, this > problem will not happen. I am reporting that the problem does happen simply with an upgrade to kernel 4.8 and no other changes. When "--nflog-size" is unspecified or set to 0, the bug in ulogd-2.0.5 gets triggered. I agree there is a bug in ulogd-2.0.5 that this kernel change exposed, but I am trying to explain that all ulogd users risk this segfault if they upgrade to kernel 4.8 and don't either update to a fixed ulogd (possibly using your patch below) or an unreleased iptables with iptables config changes to implement nflog-size on each NFLOG target. > So I think this is ulogd's bug, in _interp_iphdr, it try to > dereference the iphdr pointer before validation check, meanwhile > this problem does not exist in ipv6 path. Can you try this patch: > > diff --git a/filter/raw2packet/ulogd_raw2packet_BASE.c > b/filter/raw2packet/ulogd_raw2packet_BASE.c > index 8a6180c..fd2665a 100644 > --- a/filter/raw2packet/ulogd_raw2packet_BASE.c > +++ b/filter/raw2packet/ulogd_raw2packet_BASE.c > @@ -717,7 +717,7 @@ static int _interp_iphdr(struct ulogd_pluginstance > *pi, uint32_t len) > struct ulogd_key *ret = pi->output.keys; > struct iphdr *iph = > ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]); > - void *nexthdr = (uint32_t *)iph + iph->ihl; > + void *nexthdr; > > if (len < sizeof(struct iphdr) || len <= (uint32_t)(iph->ihl * 4)) > return ULOGD_IRET_OK; > @@ -734,6 +734,7 @@ static int _interp_iphdr(struct ulogd_pluginstance > *pi, uint32_t len) > okey_set_u16([KEY_IP_ID], ntohs(iph->id)); > okey_set_u16([KEY_IP_FRAGOFF], ntohs(iph->frag_off)); > > + nexthdr = (uint32_t *)iph + iph->ihl; > switch (iph->protocol) { > case IPPROTO_TCP: > _interp_tcp(pi, nexthdr, len); I agree this will likely fix ulogd, but this misses the point about the new kernel defaulting to a zero size return when it used to return the packet. Thanks, Chris
Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?
On Mon, 10 Oct 2016, Liping Zhang wrote: > 2016-10-10 15:02 GMT+08:00 Chris Caputo : > > Program received signal SIGSEGV, Segmentation fault. > > 0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at > > ulogd_raw2packet_BASE.c:720 > > > > 715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t > > len) > > 716 { > > 717 struct ulogd_key *ret = pi->output.keys; > > 718 struct iphdr *iph = > > 719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]); > > 720 void *nexthdr = (uint32_t *)iph + iph->ihl; > > > > I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous > > behavior by not always copying IP header data to user space. > > > > On my machine IPv4 log packets result in a ulogd segfault while IPv6 > > packets do not. I'm not sure of the cause of the difference. > > > > The corresponding userspace commit for the 209d2585 kernel change is: > > > > > > https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882 > > > > This adds --nflog-size to iptables. When --nflog-size is used with my > > iptables NFLOG lines, the ulogd-2.0.5 segfaults cease. > > What numbers did you specify after --nflog-size option? > --nflog-size 0 or ...? If you want log the whole packet to > the ulogd, please do not specify this nflog-size option. Not specifying nflog-size does not appear to log the whole packet... If "--nflog-size" is unspecified, and the iptables config is left unchanged when the kernel is upgraded to 4.8, ulogd-2.0.5 crashes. If "--nflog-size 0" is used, ulogd-2.0.5 crashes. If "--nflog-size" is used with size 1 or greater, ulogd-2.0.5 is fine. > > I'm surprised to see a kernel change cause unexpected userspace segfaults, > > so further investigation into a kernel fix would seem a good idea. > > According to the original user's manual, nflog-range option was > designed to be the number of bytes copied to userspace, but > unfortunately there's a bug from the beginning and it never works, > i.e. in kernel, it just ignored this option. > > Try to change the current nflog-range option's semantics may > cause unexpected results(maybe like this ulogd crash) ... > > In order to keep compatibility, Vishwanath introduce a new > nflog-size option and keep nflog-range unchanged. If you just > upgrade the kernel, and do not change iptables rules, this > problem will not happen. I am reporting that the problem does happen simply with an upgrade to kernel 4.8 and no other changes. When "--nflog-size" is unspecified or set to 0, the bug in ulogd-2.0.5 gets triggered. I agree there is a bug in ulogd-2.0.5 that this kernel change exposed, but I am trying to explain that all ulogd users risk this segfault if they upgrade to kernel 4.8 and don't either update to a fixed ulogd (possibly using your patch below) or an unreleased iptables with iptables config changes to implement nflog-size on each NFLOG target. > So I think this is ulogd's bug, in _interp_iphdr, it try to > dereference the iphdr pointer before validation check, meanwhile > this problem does not exist in ipv6 path. Can you try this patch: > > diff --git a/filter/raw2packet/ulogd_raw2packet_BASE.c > b/filter/raw2packet/ulogd_raw2packet_BASE.c > index 8a6180c..fd2665a 100644 > --- a/filter/raw2packet/ulogd_raw2packet_BASE.c > +++ b/filter/raw2packet/ulogd_raw2packet_BASE.c > @@ -717,7 +717,7 @@ static int _interp_iphdr(struct ulogd_pluginstance > *pi, uint32_t len) > struct ulogd_key *ret = pi->output.keys; > struct iphdr *iph = > ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]); > - void *nexthdr = (uint32_t *)iph + iph->ihl; > + void *nexthdr; > > if (len < sizeof(struct iphdr) || len <= (uint32_t)(iph->ihl * 4)) > return ULOGD_IRET_OK; > @@ -734,6 +734,7 @@ static int _interp_iphdr(struct ulogd_pluginstance > *pi, uint32_t len) > okey_set_u16([KEY_IP_ID], ntohs(iph->id)); > okey_set_u16([KEY_IP_FRAGOFF], ntohs(iph->frag_off)); > > + nexthdr = (uint32_t *)iph + iph->ihl; > switch (iph->protocol) { > case IPPROTO_TCP: > _interp_tcp(pi, nexthdr, len); I agree this will likely fix ulogd, but this misses the point about the new kernel defaulting to a zero size return when it used to return the packet. Thanks, Chris
Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?
On Tue, 4 Oct 2016, Justin Piszcz wrote: > kernel 4.8 with ulogd-2.0.5- IPs are no longer logged: > > Oct 4 17:51:30 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00 > TTL=0 ID=0 PROTO=0 MARK=0 > Oct 4 17:51:31 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00 > TTL=0 ID=0 PROTO=0 MARK=0 > Oct 4 17:51:32 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00 > TTL=0 ID=0 PROTO=0 MARK=0 > > (reboot back to kernel 4.7, works fine) > > kernel 4.7 with ulogd-2.0.5: > Oct 4 17:56:44 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125 > DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58093 PROTO=TCP > SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH > URGP=0 MARK=0 > Oct 4 17:56:45 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125 > DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58725 PROTO=TCP > SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH > URGP=0 MARK=0 > > Looks like there were some changes in the 4.8 kernel regarding ulogd, > has anyone else run into this problem? For me, kernel 4.8.1 results in segfaults in ulogd-2.0.5 at: Program received signal SIGSEGV, Segmentation fault. 0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at ulogd_raw2packet_BASE.c:720 715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t len) 716 { 717 struct ulogd_key *ret = pi->output.keys; 718 struct iphdr *iph = 719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]); 720 void *nexthdr = (uint32_t *)iph + iph->ihl; I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous behavior by not always copying IP header data to user space. On my machine IPv4 log packets result in a ulogd segfault while IPv6 packets do not. I'm not sure of the cause of the difference. The corresponding userspace commit for the 209d2585 kernel change is: https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882 This adds --nflog-size to iptables. When --nflog-size is used with my iptables NFLOG lines, the ulogd-2.0.5 segfaults cease. I'm surprised to see a kernel change cause unexpected userspace segfaults, so further investigation into a kernel fix would seem a good idea. Having to add the likes of "--nflog-size 200" (200 simply being what I am using) to every NFLOG line in firewall configs is a significant burden for many. Putting out a new release of iptables may help ease this transition if the kernel is not patched to fix this. I had to use the git code since 1.6.0 doesn't have it. Chris
Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?
On Tue, 4 Oct 2016, Justin Piszcz wrote: > kernel 4.8 with ulogd-2.0.5- IPs are no longer logged: > > Oct 4 17:51:30 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00 > TTL=0 ID=0 PROTO=0 MARK=0 > Oct 4 17:51:31 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00 > TTL=0 ID=0 PROTO=0 MARK=0 > Oct 4 17:51:32 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00 > TTL=0 ID=0 PROTO=0 MARK=0 > > (reboot back to kernel 4.7, works fine) > > kernel 4.7 with ulogd-2.0.5: > Oct 4 17:56:44 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125 > DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58093 PROTO=TCP > SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH > URGP=0 MARK=0 > Oct 4 17:56:45 atom INPUT_BLOCK IN=eth1 OUT= > MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125 > DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58725 PROTO=TCP > SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH > URGP=0 MARK=0 > > Looks like there were some changes in the 4.8 kernel regarding ulogd, > has anyone else run into this problem? For me, kernel 4.8.1 results in segfaults in ulogd-2.0.5 at: Program received signal SIGSEGV, Segmentation fault. 0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at ulogd_raw2packet_BASE.c:720 715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t len) 716 { 717 struct ulogd_key *ret = pi->output.keys; 718 struct iphdr *iph = 719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]); 720 void *nexthdr = (uint32_t *)iph + iph->ihl; I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous behavior by not always copying IP header data to user space. On my machine IPv4 log packets result in a ulogd segfault while IPv6 packets do not. I'm not sure of the cause of the difference. The corresponding userspace commit for the 209d2585 kernel change is: https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882 This adds --nflog-size to iptables. When --nflog-size is used with my iptables NFLOG lines, the ulogd-2.0.5 segfaults cease. I'm surprised to see a kernel change cause unexpected userspace segfaults, so further investigation into a kernel fix would seem a good idea. Having to add the likes of "--nflog-size 200" (200 simply being what I am using) to every NFLOG line in firewall configs is a significant burden for many. Putting out a new release of iptables may help ease this transition if the kernel is not patched to fix this. I had to use the git code since 1.6.0 doesn't have it. Chris
Re: [PATCH 1/3] IPVS: add wlib & wlip schedulers
On Fri, 23 Jan 2015, Julian Anastasov wrote: > Hello, > > On Tue, 20 Jan 2015, Chris Caputo wrote: > > My application consists of incoming TCP streams being load balanced to > > servers which receive the feeds. These are long lived multi-gigabyte > > streams, and so I believe the estimator's 2-second timer is fine. As an > > example: > > > > # cat /proc/net/ip_vs_stats > >Total Incoming Outgoing Incoming Outgoing > >Conns Packets PacketsBytesBytes > > 9AB 58B7C170 1237CA2C3250 > > > > Conns/s Pkts/s Pkts/s Bytes/s Bytes/s > >1 387C0 B16C4AE0 > > All other schedulers react and see different > picture after every new connection. The worst example > is WLC where slow-start mechanism is desired because > idle server can be overloaded before the load is noticed > properly. Even WRR accounts every connection in its state. > > Your setup may expect low number of connections per > second but for other kind of setups sending all connections > to same server for 2 seconds looks scary. In fact, what > changes is the position, so we rotate only among the > least loaded servers that look equally loaded but it is > one server in the common case. And as our stats are per > CPU and designed for human reading, it is difficult to > read them often for other purposes. We need a good idea > to solve this problem, so that we can have faster feedback > after every scheduling. This is exactly why my wlib/wlip code is a hybrid of wlc and rr. Last location is saved, and search is started after it. Thus when traffic is zero, round-robin occurs. When flows already exist, bursts of new connections do choose poorly based on repeated use of last estimation, but the complexity of working around that seems complex. > > > May be not so useful idea: use sum of both directions > > > or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx > > > flags, see how "sh" scheduler supports flags. I.e. > > > inbps + outbps. > > > > I see a user-mode option as increasing complexity. For example, > > keepalived users would need to have keepalived patched to support the new > > algorithm, due to flags, rather than just configuring "wlib" or "wlip" and > > it just working. > > That is also true. > > > I think I'd rather see a wlob/wlop version for users that want to > > load-balance based on outgoing bytes/packets, and a wlb/wlp version for > > users that want them summed. > > ok > > > From: Chris Caputo > > > > IPVS: Change inbps and outbps to 64-bits so that estimator handles faster > > flows. Also increases maximum viewable at user level from ~2.15Gbits/s to > > ~34.35Gbits/s. > > Yep, we are limited from u32 in user space structs. > I have to think how to solve this problem. > > 1gbit => ~1.5 million pps > 10gbit => ~15 million pps > 100gbit => ~150 million pps > > > Signed-off-by: Chris Caputo > > --- > > diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h > > linux-3.19-rc5/include/net/ip_vs.h > > --- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 > > 06:02:20.0 + > > +++ linux-3.19-rc5/include/net/ip_vs.h 2015-01-20 08:01:15.548177969 > > + > > @@ -390,8 +390,8 @@ struct ip_vs_estimator { > > u32 cps; > > u32 inpps; > > u32 outpps; > > - u32 inbps; > > - u32 outbps; > > + u64 inbps; > > + u64 outbps; > > Not sure, may be everything here should be u64 because > we have shifted values. I'll need some days to investigate > this issue... > > Regards > > -- > Julian Anastasov Sounds good and thanks! Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] IPVS: add wlib wlip schedulers
On Fri, 23 Jan 2015, Julian Anastasov wrote: Hello, On Tue, 20 Jan 2015, Chris Caputo wrote: My application consists of incoming TCP streams being load balanced to servers which receive the feeds. These are long lived multi-gigabyte streams, and so I believe the estimator's 2-second timer is fine. As an example: # cat /proc/net/ip_vs_stats Total Incoming Outgoing Incoming Outgoing Conns Packets PacketsBytesBytes 9AB 58B7C170 1237CA2C3250 Conns/s Pkts/s Pkts/s Bytes/s Bytes/s 1 387C0 B16C4AE0 All other schedulers react and see different picture after every new connection. The worst example is WLC where slow-start mechanism is desired because idle server can be overloaded before the load is noticed properly. Even WRR accounts every connection in its state. Your setup may expect low number of connections per second but for other kind of setups sending all connections to same server for 2 seconds looks scary. In fact, what changes is the position, so we rotate only among the least loaded servers that look equally loaded but it is one server in the common case. And as our stats are per CPU and designed for human reading, it is difficult to read them often for other purposes. We need a good idea to solve this problem, so that we can have faster feedback after every scheduling. This is exactly why my wlib/wlip code is a hybrid of wlc and rr. Last location is saved, and search is started after it. Thus when traffic is zero, round-robin occurs. When flows already exist, bursts of new connections do choose poorly based on repeated use of last estimation, but the complexity of working around that seems complex. May be not so useful idea: use sum of both directions or control it with svc-flags IP_VS_SVC_F_SCHED_WLIB_xxx flags, see how sh scheduler supports flags. I.e. inbps + outbps. I see a user-mode option as increasing complexity. For example, keepalived users would need to have keepalived patched to support the new algorithm, due to flags, rather than just configuring wlib or wlip and it just working. That is also true. I think I'd rather see a wlob/wlop version for users that want to load-balance based on outgoing bytes/packets, and a wlb/wlp version for users that want them summed. ok From: Chris Caputo ccap...@alt.net IPVS: Change inbps and outbps to 64-bits so that estimator handles faster flows. Also increases maximum viewable at user level from ~2.15Gbits/s to ~34.35Gbits/s. Yep, we are limited from u32 in user space structs. I have to think how to solve this problem. 1gbit = ~1.5 million pps 10gbit = ~15 million pps 100gbit = ~150 million pps Signed-off-by: Chris Caputo ccap...@alt.net --- diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h linux-3.19-rc5/include/net/ip_vs.h --- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/include/net/ip_vs.h 2015-01-20 08:01:15.548177969 + @@ -390,8 +390,8 @@ struct ip_vs_estimator { u32 cps; u32 inpps; u32 outpps; - u32 inbps; - u32 outbps; + u64 inbps; + u64 outbps; Not sure, may be everything here should be u64 because we have shifted values. I'll need some days to investigate this issue... Regards -- Julian Anastasov j...@ssi.bg Sounds good and thanks! Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] IPVS: add wlib & wlip schedulers
From: Chris Caputo IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) scheduler docs for ipvsadm-1.27. Signed-off-by: Chris Caputo --- diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS --- ipvsadm-1.27-stock/SCHEDULERS 2013-09-06 08:37:27.0 + +++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 + @@ -1 +1 @@ -rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq +rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8 --- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 + +++ ipvsadm-1.27/ipvsadm.8 2015-01-17 22:14:32.812597191 + @@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s \fBnq\fR - Never Queue: assigns an incoming job to an idle server if there is, instead of waiting for a fast one; if all the servers are busy, it adopts the Shortest Expected Delay policy to assign the job. +.sp +\fBwlib\fR - Weighted Least Incoming Byterate: directs network +connections to the real server with the least incoming byterate +normalized by the server weight. +.sp +\fBwlip\fR - Weighted Least Incoming Packetrate: directs network +connections to the real server with the least incoming packetrate +normalized by the server weight. .TP .B -p, --persistent [\fItimeout\fP] Specify that a virtual service is persistent. If this option is -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] IPVS: add wlib & wlip schedulers
On Tue, 20 Jan 2015, Julian Anastasov wrote: > On Sat, 17 Jan 2015, Chris Caputo wrote: > > From: Chris Caputo > > > > IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least > > Incoming > > Packetrate) schedulers, updated for 3.19-rc4. Hi Julian, Thanks for the review. > The IPVS estimator uses 2-second timer to update > the stats, isn't that a problem for such schedulers? > Also, you schedule by incoming traffic rate which is > ok when clients mostly upload. But in the common case > clients mostly download and IPVS processes download > traffic only for NAT method. My application consists of incoming TCP streams being load balanced to servers which receive the feeds. These are long lived multi-gigabyte streams, and so I believe the estimator's 2-second timer is fine. As an example: # cat /proc/net/ip_vs_stats Total Incoming Outgoing Incoming Outgoing Conns Packets PacketsBytesBytes 9AB 58B7C170 1237CA2C3250 Conns/s Pkts/s Pkts/s Bytes/s Bytes/s 1 387C0 B16C4AE0 > May be not so useful idea: use sum of both directions > or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx > flags, see how "sh" scheduler supports flags. I.e. > inbps + outbps. I see a user-mode option as increasing complexity. For example, keepalived users would need to have keepalived patched to support the new algorithm, due to flags, rather than just configuring "wlib" or "wlip" and it just working. I think I'd rather see a wlob/wlop version for users that want to load-balance based on outgoing bytes/packets, and a wlb/wlp version for users that want them summed. > Another problem: pps and bps are shifted values, > see how ip_vs_read_estimator() reads them. ip_vs_est.c > contains comments that this code handles couple of > gigabits. May be inbps and outbps in struct ip_vs_estimator > should be changed to u64 to support more gigabits, with > separate patch. See patch below to convert bps in ip_vs_estimator to 64-bits. Other patches, based on your feedback, to follow. Thanks, Chris From: Chris Caputo IPVS: Change inbps and outbps to 64-bits so that estimator handles faster flows. Also increases maximum viewable at user level from ~2.15Gbits/s to ~34.35Gbits/s. Signed-off-by: Chris Caputo --- diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h linux-3.19-rc5/include/net/ip_vs.h --- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/include/net/ip_vs.h 2015-01-20 08:01:15.548177969 + @@ -390,8 +390,8 @@ struct ip_vs_estimator { u32 cps; u32 inpps; u32 outpps; - u32 inbps; - u32 outbps; + u64 inbps; + u64 outbps; }; struct ip_vs_stats { diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c --- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c 2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c 2015-01-20 08:01:34.369840704 + @@ -45,10 +45,12 @@ NOTES. - * The stored value for average bps is scaled by 2^5, so that maximal -rate is ~2.15Gbits/s, average pps and cps are scaled by 2^10. + * Average bps is scaled by 2^5, while average pps and cps are scaled by 2^10. - * A lot code is taken from net/sched/estimator.c + * All are reported to user level as 32 bit unsigned values. Bps can +overflow for fast links : max speed being ~34.35Gbits/s. + + * A lot of code is taken from net/core/gen_estimator.c */ @@ -98,7 +100,7 @@ static void estimation_timer(unsigned lo u32 n_conns; u32 n_inpkts, n_outpkts; u64 n_inbytes, n_outbytes; - u32 rate; + u64 rate; struct net *net = (struct net *)arg; struct netns_ipvs *ipvs; @@ -118,23 +120,24 @@ static void estimation_timer(unsigned lo /* scaled by 2^10, but divided 2 seconds */ rate = (n_conns - e->last_conns) << 9; e->last_conns = n_conns; - e->cps += ((long)rate - (long)e->cps) >> 2; + e->cps += ((s64)rate - (s64)e->cps) >> 2; rate = (n_inpkts - e->last_inpkts) << 9; e->last_inpkts = n_inpkts; - e->inpps += ((long)rate - (long)e->inpps) >> 2; + e->inpps += ((s64)rate - (s64)e->inpps) >> 2; rate = (n_outpkts - e->last_outpkts) << 9; e->last_outpkts = n_outpkts; - e->outpps += ((long)r
[PATCH 2/3] IPVS: add wlib & wlip schedulers
On Tue, 20 Jan 2015, Julian Anastasov wrote: > > + (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt || [...] > > + (dr == lr && dwgt > lwgt)) { > > Above check is redundant. I accepted your feedback and applied it to the below, except for this item. I believe if dr and lr are zero (no traffic), we still want to choose the higher weight, thus a separate comparison is needed. Thanks, Chris From: Chris Caputo IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) schedulers, updated for 3.19-rc5. Signed-off-by: Chris Caputo --- diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig linux-3.19-rc5/net/netfilter/ipvs/Kconfig --- linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig 2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/Kconfig 2015-01-20 08:08:28.883080285 + @@ -240,6 +240,26 @@ config IP_VS_NQ If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_WLIB + tristate "weighted least incoming byterate scheduling" + ---help--- + The weighted least incoming byterate scheduling algorithm directs + network connections to the server with the least incoming byterate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + +config IP_VS_WLIP + tristate "weighted least incoming packetrate scheduling" + ---help--- + The weighted least incoming packetrate scheduling algorithm directs + network connections to the server with the least incoming packetrate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + comment 'IPVS SH scheduler' config IP_VS_SH_TAB_BITS diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile linux-3.19-rc5/net/netfilter/ipvs/Makefile --- linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/Makefile 2015-01-20 08:08:28.883080285 + @@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o +obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o +obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o # IPVS application helpers obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c --- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 00:00:00.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c 2015-01-20 08:09:00.177816054 + @@ -0,0 +1,166 @@ +/* IPVS:Weighted Least Incoming Byterate Scheduling module + * + * Authors: Chris Caputo based on code by: + * + * Wensong Zhang + * Peter Kese + * Julian Anastasov + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Changes: + * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c. + * + */ + +/* The WLIB algorithm uses the results of the estimator's inbps + * calculations to determine which real server has the lowest incoming + * byterate. + * + * Real server weight is factored into the calculation. An example way to + * use this is if you have one server that can handle 100 Mbps of input and + * another that can handle 1 Gbps you could set the weights to be 100 and 1000 + * respectively. + */ + +#define KMSG_COMPONENT "IPVS" +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + +#include +#include + +#include + +static int +ip_vs_wlib_init_svc(struct ip_vs_service *svc) +{ + svc->sched_data = >destinations; + return 0; +} + +static int +ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest) +{ + struct list_head *p; + + spin_lock_bh(>sched_lock); + p = (struct list_head *)svc->sched_data; + /* dest is already unlinked, so p->prev is not valid but +* p->next is valid, use it to reach previous entry. +*/ + if (p == >n_list) + svc->sched_data = p->next->prev; + spin_unlock_bh(>sched_lock); + return 0; +} + +/* Weighted Least Incoming Byterate scheduling */ +static struct ip_vs_dest * +ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, + struct ip_vs_iphdr *iph) +{ + struct
[PATCH 1/3] IPVS: add wlib wlip schedulers
On Tue, 20 Jan 2015, Julian Anastasov wrote: On Sat, 17 Jan 2015, Chris Caputo wrote: From: Chris Caputo ccap...@alt.net IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) schedulers, updated for 3.19-rc4. Hi Julian, Thanks for the review. The IPVS estimator uses 2-second timer to update the stats, isn't that a problem for such schedulers? Also, you schedule by incoming traffic rate which is ok when clients mostly upload. But in the common case clients mostly download and IPVS processes download traffic only for NAT method. My application consists of incoming TCP streams being load balanced to servers which receive the feeds. These are long lived multi-gigabyte streams, and so I believe the estimator's 2-second timer is fine. As an example: # cat /proc/net/ip_vs_stats Total Incoming Outgoing Incoming Outgoing Conns Packets PacketsBytesBytes 9AB 58B7C170 1237CA2C3250 Conns/s Pkts/s Pkts/s Bytes/s Bytes/s 1 387C0 B16C4AE0 May be not so useful idea: use sum of both directions or control it with svc-flags IP_VS_SVC_F_SCHED_WLIB_xxx flags, see how sh scheduler supports flags. I.e. inbps + outbps. I see a user-mode option as increasing complexity. For example, keepalived users would need to have keepalived patched to support the new algorithm, due to flags, rather than just configuring wlib or wlip and it just working. I think I'd rather see a wlob/wlop version for users that want to load-balance based on outgoing bytes/packets, and a wlb/wlp version for users that want them summed. Another problem: pps and bps are shifted values, see how ip_vs_read_estimator() reads them. ip_vs_est.c contains comments that this code handles couple of gigabits. May be inbps and outbps in struct ip_vs_estimator should be changed to u64 to support more gigabits, with separate patch. See patch below to convert bps in ip_vs_estimator to 64-bits. Other patches, based on your feedback, to follow. Thanks, Chris From: Chris Caputo ccap...@alt.net IPVS: Change inbps and outbps to 64-bits so that estimator handles faster flows. Also increases maximum viewable at user level from ~2.15Gbits/s to ~34.35Gbits/s. Signed-off-by: Chris Caputo ccap...@alt.net --- diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h linux-3.19-rc5/include/net/ip_vs.h --- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/include/net/ip_vs.h 2015-01-20 08:01:15.548177969 + @@ -390,8 +390,8 @@ struct ip_vs_estimator { u32 cps; u32 inpps; u32 outpps; - u32 inbps; - u32 outbps; + u64 inbps; + u64 outbps; }; struct ip_vs_stats { diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c --- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c 2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c 2015-01-20 08:01:34.369840704 + @@ -45,10 +45,12 @@ NOTES. - * The stored value for average bps is scaled by 2^5, so that maximal -rate is ~2.15Gbits/s, average pps and cps are scaled by 2^10. + * Average bps is scaled by 2^5, while average pps and cps are scaled by 2^10. - * A lot code is taken from net/sched/estimator.c + * All are reported to user level as 32 bit unsigned values. Bps can +overflow for fast links : max speed being ~34.35Gbits/s. + + * A lot of code is taken from net/core/gen_estimator.c */ @@ -98,7 +100,7 @@ static void estimation_timer(unsigned lo u32 n_conns; u32 n_inpkts, n_outpkts; u64 n_inbytes, n_outbytes; - u32 rate; + u64 rate; struct net *net = (struct net *)arg; struct netns_ipvs *ipvs; @@ -118,23 +120,24 @@ static void estimation_timer(unsigned lo /* scaled by 2^10, but divided 2 seconds */ rate = (n_conns - e-last_conns) 9; e-last_conns = n_conns; - e-cps += ((long)rate - (long)e-cps) 2; + e-cps += ((s64)rate - (s64)e-cps) 2; rate = (n_inpkts - e-last_inpkts) 9; e-last_inpkts = n_inpkts; - e-inpps += ((long)rate - (long)e-inpps) 2; + e-inpps += ((s64)rate - (s64)e-inpps) 2; rate = (n_outpkts - e-last_outpkts) 9; e-last_outpkts = n_outpkts; - e-outpps += ((long)rate - (long)e-outpps) 2; + e-outpps += ((s64)rate - (s64)e-outpps) 2; + /* scaled by 2^5, but divided 2 seconds */ rate = (n_inbytes - e-last_inbytes) 4; e
[PATCH 2/3] IPVS: add wlib wlip schedulers
On Tue, 20 Jan 2015, Julian Anastasov wrote: + (u64)dr * (u64)lwgt (u64)lr * (u64)dwgt || [...] + (dr == lr dwgt lwgt)) { Above check is redundant. I accepted your feedback and applied it to the below, except for this item. I believe if dr and lr are zero (no traffic), we still want to choose the higher weight, thus a separate comparison is needed. Thanks, Chris From: Chris Caputo ccap...@alt.net IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) schedulers, updated for 3.19-rc5. Signed-off-by: Chris Caputo ccap...@alt.net --- diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig linux-3.19-rc5/net/netfilter/ipvs/Kconfig --- linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig 2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/Kconfig 2015-01-20 08:08:28.883080285 + @@ -240,6 +240,26 @@ config IP_VS_NQ If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_WLIB + tristate weighted least incoming byterate scheduling + ---help--- + The weighted least incoming byterate scheduling algorithm directs + network connections to the server with the least incoming byterate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + +config IP_VS_WLIP + tristate weighted least incoming packetrate scheduling + ---help--- + The weighted least incoming packetrate scheduling algorithm directs + network connections to the server with the least incoming packetrate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + comment 'IPVS SH scheduler' config IP_VS_SH_TAB_BITS diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile linux-3.19-rc5/net/netfilter/ipvs/Makefile --- linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile2015-01-18 06:02:20.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/Makefile 2015-01-20 08:08:28.883080285 + @@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o +obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o +obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o # IPVS application helpers obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c --- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 00:00:00.0 + +++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c 2015-01-20 08:09:00.177816054 + @@ -0,0 +1,166 @@ +/* IPVS:Weighted Least Incoming Byterate Scheduling module + * + * Authors: Chris Caputo ccap...@alt.net based on code by: + * + * Wensong Zhang wens...@linuxvirtualserver.org + * Peter Kese peter.k...@ijs.si + * Julian Anastasov j...@ssi.bg + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Changes: + * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c. + * + */ + +/* The WLIB algorithm uses the results of the estimator's inbps + * calculations to determine which real server has the lowest incoming + * byterate. + * + * Real server weight is factored into the calculation. An example way to + * use this is if you have one server that can handle 100 Mbps of input and + * another that can handle 1 Gbps you could set the weights to be 100 and 1000 + * respectively. + */ + +#define KMSG_COMPONENT IPVS +#define pr_fmt(fmt) KMSG_COMPONENT : fmt + +#include linux/module.h +#include linux/kernel.h + +#include net/ip_vs.h + +static int +ip_vs_wlib_init_svc(struct ip_vs_service *svc) +{ + svc-sched_data = svc-destinations; + return 0; +} + +static int +ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest) +{ + struct list_head *p; + + spin_lock_bh(svc-sched_lock); + p = (struct list_head *)svc-sched_data; + /* dest is already unlinked, so p-prev is not valid but +* p-next is valid, use it to reach previous entry. +*/ + if (p == dest-n_list) + svc-sched_data = p-next-prev; + spin_unlock_bh(svc-sched_lock); + return 0; +} + +/* Weighted Least Incoming Byterate scheduling */ +static struct ip_vs_dest * +ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, + struct ip_vs_iphdr
[PATCH 3/3] IPVS: add wlib wlip schedulers
From: Chris Caputo ccap...@alt.net IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) scheduler docs for ipvsadm-1.27. Signed-off-by: Chris Caputo ccap...@alt.net --- diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS --- ipvsadm-1.27-stock/SCHEDULERS 2013-09-06 08:37:27.0 + +++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 + @@ -1 +1 @@ -rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq +rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8 --- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 + +++ ipvsadm-1.27/ipvsadm.8 2015-01-17 22:14:32.812597191 + @@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s \fBnq\fR - Never Queue: assigns an incoming job to an idle server if there is, instead of waiting for a fast one; if all the servers are busy, it adopts the Shortest Expected Delay policy to assign the job. +.sp +\fBwlib\fR - Weighted Least Incoming Byterate: directs network +connections to the real server with the least incoming byterate +normalized by the server weight. +.sp +\fBwlip\fR - Weighted Least Incoming Packetrate: directs network +connections to the real server with the least incoming packetrate +normalized by the server weight. .TP .B -p, --persistent [\fItimeout\fP] Specify that a virtual service is persistent. If this option is -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] IPVS: add wlib & wlip schedulers
From: Chris Caputo IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) scheduler docs for ipvsadm-1.27. Signed-off-by: Chris Caputo --- diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS --- ipvsadm-1.27-stock/SCHEDULERS 2013-09-06 08:37:27.0 + +++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 + @@ -1 +1 @@ -rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq +rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8 --- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 + +++ ipvsadm-1.27/ipvsadm.8 2015-01-17 22:14:32.812597191 + @@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s \fBnq\fR - Never Queue: assigns an incoming job to an idle server if there is, instead of waiting for a fast one; if all the servers are busy, it adopts the Shortest Expected Delay policy to assign the job. +.sp +\fBwlib\fR - Weighted Least Incoming Byterate: directs network +connections to the real server with the least incoming byterate +normalized by the server weight. +.sp +\fBwlip\fR - Weighted Least Incoming Packetrate: directs network +connections to the real server with the least incoming packetrate +normalized by the server weight. .TP .B -p, --persistent [\fItimeout\fP] Specify that a virtual service is persistent. If this option is -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] IPVS: add wlib & wlip schedulers
Wensong, this is something we discussed 10 years ago and you liked it, but it didn't actually get into the kernel. I've updated it, tested it, and would like to work toward inclusion. Thanks, Chris --- From: Chris Caputo IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) schedulers, updated for 3.19-rc4. Signed-off-by: Chris Caputo --- diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig linux-3.19-rc4/net/netfilter/ipvs/Kconfig --- linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig 2015-01-11 20:44:53.0 + +++ linux-3.19-rc4/net/netfilter/ipvs/Kconfig 2015-01-17 22:47:52.250301042 + @@ -240,6 +240,26 @@ config IP_VS_NQ If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_WLIB + tristate "weighted least incoming byterate scheduling" + ---help--- + The weighted least incoming byterate scheduling algorithm directs + network connections to the server with the least incoming byterate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + +config IP_VS_WLIP + tristate "weighted least incoming packetrate scheduling" + ---help--- + The weighted least incoming packetrate scheduling algorithm directs + network connections to the server with the least incoming packetrate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + comment 'IPVS SH scheduler' config IP_VS_SH_TAB_BITS diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile linux-3.19-rc4/net/netfilter/ipvs/Makefile --- linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile2015-01-11 20:44:53.0 + +++ linux-3.19-rc4/net/netfilter/ipvs/Makefile 2015-01-17 22:47:35.421861075 + @@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o +obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o +obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o # IPVS application helpers obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c --- linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 00:00:00.0 + +++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c 2015-01-17 22:47:35.421861075 + @@ -0,0 +1,156 @@ +/* IPVS:Weighted Least Incoming Byterate Scheduling module + * + * Authors: Chris Caputo based on code by: + * + * Wensong Zhang + * Peter Kese + * Julian Anastasov + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Changes: + * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c. + * + */ + +/* The WLIB algorithm uses the results of the estimator's inbps + * calculations to determine which real server has the lowest incoming + * byterate. + * + * Real server weight is factored into the calculation. An example way to + * use this is if you have one server that can handle 100 Mbps of input and + * another that can handle 1 Gbps you could set the weights to be 100 and 1000 + * respectively. + */ + +#define KMSG_COMPONENT "IPVS" +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + +#include +#include + +#include + +static int +ip_vs_wlib_init_svc(struct ip_vs_service *svc) +{ + svc->sched_data = >destinations; + return 0; +} + +static int +ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest) +{ + struct list_head *p; + + spin_lock_bh(>sched_lock); + p = (struct list_head *)svc->sched_data; + /* dest is already unlinked, so p->prev is not valid but +* p->next is valid, use it to reach previous entry. +*/ + if (p == >n_list) + svc->sched_data = p->next->prev; + spin_unlock_bh(>sched_lock); + return 0; +} + +/* Weighted Least Incoming Byterate scheduling */ +static struct ip_vs_dest * +ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, + struct ip_vs_iphdr *iph) +{ + struct list_head *p, *q; + struct ip_vs_dest *dest, *least = NULL; + u32 dr, lr = -1; + int dwgt, lwgt = 0; + + IP_VS_DBG(6, "%s(): Scheduling...\n", __func__); + + /* We calculate the load of each dest server as follows: +
[PATCH 1/2] IPVS: add wlib wlip schedulers
Wensong, this is something we discussed 10 years ago and you liked it, but it didn't actually get into the kernel. I've updated it, tested it, and would like to work toward inclusion. Thanks, Chris --- From: Chris Caputo ccap...@alt.net IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) schedulers, updated for 3.19-rc4. Signed-off-by: Chris Caputo ccap...@alt.net --- diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig linux-3.19-rc4/net/netfilter/ipvs/Kconfig --- linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig 2015-01-11 20:44:53.0 + +++ linux-3.19-rc4/net/netfilter/ipvs/Kconfig 2015-01-17 22:47:52.250301042 + @@ -240,6 +240,26 @@ config IP_VS_NQ If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_WLIB + tristate weighted least incoming byterate scheduling + ---help--- + The weighted least incoming byterate scheduling algorithm directs + network connections to the server with the least incoming byterate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + +config IP_VS_WLIP + tristate weighted least incoming packetrate scheduling + ---help--- + The weighted least incoming packetrate scheduling algorithm directs + network connections to the server with the least incoming packetrate + normalized by the server weight. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + comment 'IPVS SH scheduler' config IP_VS_SH_TAB_BITS diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile linux-3.19-rc4/net/netfilter/ipvs/Makefile --- linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile2015-01-11 20:44:53.0 + +++ linux-3.19-rc4/net/netfilter/ipvs/Makefile 2015-01-17 22:47:35.421861075 + @@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o +obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o +obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o # IPVS application helpers obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c --- linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 00:00:00.0 + +++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c 2015-01-17 22:47:35.421861075 + @@ -0,0 +1,156 @@ +/* IPVS:Weighted Least Incoming Byterate Scheduling module + * + * Authors: Chris Caputo ccap...@alt.net based on code by: + * + * Wensong Zhang wens...@linuxvirtualserver.org + * Peter Kese peter.k...@ijs.si + * Julian Anastasov j...@ssi.bg + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Changes: + * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c. + * + */ + +/* The WLIB algorithm uses the results of the estimator's inbps + * calculations to determine which real server has the lowest incoming + * byterate. + * + * Real server weight is factored into the calculation. An example way to + * use this is if you have one server that can handle 100 Mbps of input and + * another that can handle 1 Gbps you could set the weights to be 100 and 1000 + * respectively. + */ + +#define KMSG_COMPONENT IPVS +#define pr_fmt(fmt) KMSG_COMPONENT : fmt + +#include linux/module.h +#include linux/kernel.h + +#include net/ip_vs.h + +static int +ip_vs_wlib_init_svc(struct ip_vs_service *svc) +{ + svc-sched_data = svc-destinations; + return 0; +} + +static int +ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest) +{ + struct list_head *p; + + spin_lock_bh(svc-sched_lock); + p = (struct list_head *)svc-sched_data; + /* dest is already unlinked, so p-prev is not valid but +* p-next is valid, use it to reach previous entry. +*/ + if (p == dest-n_list) + svc-sched_data = p-next-prev; + spin_unlock_bh(svc-sched_lock); + return 0; +} + +/* Weighted Least Incoming Byterate scheduling */ +static struct ip_vs_dest * +ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb, + struct ip_vs_iphdr *iph) +{ + struct list_head *p, *q; + struct ip_vs_dest *dest, *least = NULL; + u32 dr, lr = -1; + int dwgt, lwgt = 0; + + IP_VS_DBG(6, %s(): Scheduling...\n, __func__); + + /* We
[PATCH 2/2] IPVS: add wlib wlip schedulers
From: Chris Caputo ccap...@alt.net IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming Packetrate) scheduler docs for ipvsadm-1.27. Signed-off-by: Chris Caputo ccap...@alt.net --- diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS --- ipvsadm-1.27-stock/SCHEDULERS 2013-09-06 08:37:27.0 + +++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 + @@ -1 +1 @@ -rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq +rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8 --- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 + +++ ipvsadm-1.27/ipvsadm.8 2015-01-17 22:14:32.812597191 + @@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s \fBnq\fR - Never Queue: assigns an incoming job to an idle server if there is, instead of waiting for a fast one; if all the servers are busy, it adopts the Shortest Expected Delay policy to assign the job. +.sp +\fBwlib\fR - Weighted Least Incoming Byterate: directs network +connections to the real server with the least incoming byterate +normalized by the server weight. +.sp +\fBwlip\fR - Weighted Least Incoming Packetrate: directs network +connections to the real server with the least incoming packetrate +normalized by the server weight. .TP .B -p, --persistent [\fItimeout\fP] Specify that a virtual service is persistent. If this option is -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.19-rc6] sched: cleanup output of show_state/show_task
On Mon, 27 Nov 2006, Andrew Morton wrote: > On Sat, 25 Nov 2006 04:48:15 + (GMT) > Chris Caputo <[EMAIL PROTECTED]> wrote: > > This patch cleans up the output of show_state/task() (aka magic-sysrq-t) > > so that free stack space is printed as appropriate based on > > CONFIG_DEBUG_STACK_USAGE. > > > > Also, without this patch the header is not aligned with the data and is > > thus confusing. Free stack is labeled as pid, pid is labeled as father, > > and so on. > > > > Signed-off-by: Chris Caputo <[EMAIL PROTECTED]> > > --- > > > > diff -uprN a/kernel/sched.c b/kernel/sched.c > > --- a/kernel/sched.c2006-11-25 04:11:12.0 + > > +++ b/kernel/sched.c2006-11-25 04:13:07.0 + > > @@ -4757,7 +4757,6 @@ static const char stat_nam[] = "RSDTtZX" > > static void show_task(struct task_struct *p) > > { > > struct task_struct *relative; > > - unsigned long free = 0; > > unsigned state; > > > > state = p->state ? __ffs(p->state) + 1 : 0; > > @@ -4779,10 +4778,10 @@ static void show_task(struct task_struct > > unsigned long *n = end_of_stack(p); > > while (!*n) > > n++; > > - free = (unsigned long)n - (unsigned long)end_of_stack(p); > > + printk("%5lu ", (unsigned long)n - (unsigned > > long)end_of_stack(p)); > > } > > #endif > > - printk("%5lu %5d %6d ", free, p->pid, p->parent->pid); > > + printk("%5d %6d ", p->pid, p->parent->pid); > > This will cause the output format to be dependent upon the setting of > CONFIG_DEBUG_STACK_USAGE. So any code which attempts to parse the output > of this function will somehow need to work out whether or not the `free' > field is present. > > Which is why we still print out a zero if CONFIG_DEBUG_STACK_USAGE=n. Ahh! Should we make it so the header printed by show_state is aligned properly with the data? If yes, please consider the below patch. Chris --- From: Chris Caputo <[EMAIL PROTECTED]> [PATCH 2.6.19-rc6] sched: correct output of show_state At present show_state prints a header the does not match the output of show_task, as follows: - sibling task PC pid father child younger older init S 0 1 0 2 (NOTLB) - This patch corrects the output of show_state so that the header is aligned with the data, ala: - freesibling task PCstack pid father child younger older init S 0 1 0 2 (NOTLB) - Signed-off-by: Chris Caputo <[EMAIL PROTECTED]> --- --- a/kernel/sched.c2006-11-27 08:40:56.0 + +++ b/kernel/sched.c2006-11-27 23:23:49.0 + @@ -4810,12 +4810,12 @@ void show_state(void) #if (BITS_PER_LONG == 32) printk("\n" - " sibling\n"); - printk(" task PC pid father child younger older\n"); + " free sibling\n"); + printk(" task PCstack pid father child younger older\n"); #else printk("\n" - " sibling\n"); - printk(" task PC pid father child younger older\n"); + " free sibling\n"); + printk(" task PCstack pid father child younger older\n"); #endif read_lock(_lock); do_each_thread(g, p) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.19-rc6] sunrpc: fix race condition
On Mon, 27 Nov 2006, Chris Caputo wrote: > From: Chris Caputo <[EMAIL PROTECTED]> > [PATCH 2.6.19-rc6] sunrpc: fix race condition Turns out my patch is buggy. Don't use it. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.19-rc6] sunrpc: fix race condition
From: Chris Caputo <[EMAIL PROTECTED]> [PATCH 2.6.19-rc6] sunrpc: fix race condition Patch linux-2.6.10-01-rpc_workqueue.dif introduced a race condition into net/sunrpc/sched.c in kernels 2.6.11-rc1 through 2.6.19-rc6. The race scenario is as follows... Given: RPC_TASK_QUEUED, RPC_TASK_RUNNING and RPC_TASK_ASYNC are set. __rpc_execute() (no spinlock)rpc_make_runnable() (queue spinlock held) -- do_ret = rpc_test_and_set_running(task); rpc_clear_running(task); if (RPC_IS_ASYNC(task)) { if (RPC_IS_QUEUED(task)) return 0; rpc_clear_queued(task); if (do_ret) return; Thus both threads return and the task is abandoned forever. In my test NFS client usage (~200 Mb/s at ~3,000 RPC calls/s) this race condition has resulted in processes getting permanently stuck in 'D' state often in less than 15 minutes of uptime. The following patch fixes the problem by returning to use of a spinlock in __rpc_execute(). Signed-off-by: Chris Caputo <[EMAIL PROTECTED]> --- diff -up a/net/sunrpc/sched.c b/net/sunrpc/sched.c --- a/net/sunrpc/sched.c2006-11-27 08:41:07.0 + +++ b/net/sunrpc/sched.c2006-11-27 11:14:21.0 + @@ -587,6 +587,7 @@ EXPORT_SYMBOL(rpc_exit_task); static int __rpc_execute(struct rpc_task *task) { int status = 0; + struct rpc_wait_queue *queue; dprintk("RPC: %4d rpc_execute flgs %x\n", task->tk_pid, task->tk_flags); @@ -631,22 +632,27 @@ static int __rpc_execute(struct rpc_task lock_kernel(); task->tk_action(task); unlock_kernel(); + /* micro-optimization to avoid spinlock */ + if (!RPC_IS_QUEUED(task)) + continue; } /* -* Lockless check for whether task is sleeping or not. +* Check whether task is sleeping. */ - if (!RPC_IS_QUEUED(task)) - continue; - rpc_clear_running(task); + queue = task->u.tk_wait.rpc_waitq; + spin_lock_bh(>lock); if (RPC_IS_ASYNC(task)) { - /* Careful! we may have raced... */ - if (RPC_IS_QUEUED(task)) - return 0; - if (rpc_test_and_set_running(task)) + if (RPC_IS_QUEUED(task)) { + rpc_clear_running(task); + spin_unlock_bh(>lock); return 0; + } + spin_unlock_bh(>lock); continue; } + rpc_clear_running(task); + spin_unlock_bh(>lock); /* sync task: sleep here */ dprintk("RPC: %4d sync task going to sleep\n", task->tk_pid); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.19-rc6] sunrpc: fix race condition
From: Chris Caputo [EMAIL PROTECTED] [PATCH 2.6.19-rc6] sunrpc: fix race condition Patch linux-2.6.10-01-rpc_workqueue.dif introduced a race condition into net/sunrpc/sched.c in kernels 2.6.11-rc1 through 2.6.19-rc6. The race scenario is as follows... Given: RPC_TASK_QUEUED, RPC_TASK_RUNNING and RPC_TASK_ASYNC are set. __rpc_execute() (no spinlock)rpc_make_runnable() (queue spinlock held) -- do_ret = rpc_test_and_set_running(task); rpc_clear_running(task); if (RPC_IS_ASYNC(task)) { if (RPC_IS_QUEUED(task)) return 0; rpc_clear_queued(task); if (do_ret) return; Thus both threads return and the task is abandoned forever. In my test NFS client usage (~200 Mb/s at ~3,000 RPC calls/s) this race condition has resulted in processes getting permanently stuck in 'D' state often in less than 15 minutes of uptime. The following patch fixes the problem by returning to use of a spinlock in __rpc_execute(). Signed-off-by: Chris Caputo [EMAIL PROTECTED] --- diff -up a/net/sunrpc/sched.c b/net/sunrpc/sched.c --- a/net/sunrpc/sched.c2006-11-27 08:41:07.0 + +++ b/net/sunrpc/sched.c2006-11-27 11:14:21.0 + @@ -587,6 +587,7 @@ EXPORT_SYMBOL(rpc_exit_task); static int __rpc_execute(struct rpc_task *task) { int status = 0; + struct rpc_wait_queue *queue; dprintk(RPC: %4d rpc_execute flgs %x\n, task-tk_pid, task-tk_flags); @@ -631,22 +632,27 @@ static int __rpc_execute(struct rpc_task lock_kernel(); task-tk_action(task); unlock_kernel(); + /* micro-optimization to avoid spinlock */ + if (!RPC_IS_QUEUED(task)) + continue; } /* -* Lockless check for whether task is sleeping or not. +* Check whether task is sleeping. */ - if (!RPC_IS_QUEUED(task)) - continue; - rpc_clear_running(task); + queue = task-u.tk_wait.rpc_waitq; + spin_lock_bh(queue-lock); if (RPC_IS_ASYNC(task)) { - /* Careful! we may have raced... */ - if (RPC_IS_QUEUED(task)) - return 0; - if (rpc_test_and_set_running(task)) + if (RPC_IS_QUEUED(task)) { + rpc_clear_running(task); + spin_unlock_bh(queue-lock); return 0; + } + spin_unlock_bh(queue-lock); continue; } + rpc_clear_running(task); + spin_unlock_bh(queue-lock); /* sync task: sleep here */ dprintk(RPC: %4d sync task going to sleep\n, task-tk_pid); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.19-rc6] sunrpc: fix race condition
On Mon, 27 Nov 2006, Chris Caputo wrote: From: Chris Caputo [EMAIL PROTECTED] [PATCH 2.6.19-rc6] sunrpc: fix race condition Turns out my patch is buggy. Don't use it. Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.19-rc6] sched: cleanup output of show_state/show_task
On Mon, 27 Nov 2006, Andrew Morton wrote: On Sat, 25 Nov 2006 04:48:15 + (GMT) Chris Caputo [EMAIL PROTECTED] wrote: This patch cleans up the output of show_state/task() (aka magic-sysrq-t) so that free stack space is printed as appropriate based on CONFIG_DEBUG_STACK_USAGE. Also, without this patch the header is not aligned with the data and is thus confusing. Free stack is labeled as pid, pid is labeled as father, and so on. Signed-off-by: Chris Caputo [EMAIL PROTECTED] --- diff -uprN a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c2006-11-25 04:11:12.0 + +++ b/kernel/sched.c2006-11-25 04:13:07.0 + @@ -4757,7 +4757,6 @@ static const char stat_nam[] = RSDTtZX static void show_task(struct task_struct *p) { struct task_struct *relative; - unsigned long free = 0; unsigned state; state = p-state ? __ffs(p-state) + 1 : 0; @@ -4779,10 +4778,10 @@ static void show_task(struct task_struct unsigned long *n = end_of_stack(p); while (!*n) n++; - free = (unsigned long)n - (unsigned long)end_of_stack(p); + printk(%5lu , (unsigned long)n - (unsigned long)end_of_stack(p)); } #endif - printk(%5lu %5d %6d , free, p-pid, p-parent-pid); + printk(%5d %6d , p-pid, p-parent-pid); This will cause the output format to be dependent upon the setting of CONFIG_DEBUG_STACK_USAGE. So any code which attempts to parse the output of this function will somehow need to work out whether or not the `free' field is present. Which is why we still print out a zero if CONFIG_DEBUG_STACK_USAGE=n. Ahh! Should we make it so the header printed by show_state is aligned properly with the data? If yes, please consider the below patch. Chris --- From: Chris Caputo [EMAIL PROTECTED] [PATCH 2.6.19-rc6] sched: correct output of show_state At present show_state prints a header the does not match the output of show_task, as follows: - sibling task PC pid father child younger older init S 0 1 0 2 (NOTLB) - This patch corrects the output of show_state so that the header is aligned with the data, ala: - freesibling task PCstack pid father child younger older init S 0 1 0 2 (NOTLB) - Signed-off-by: Chris Caputo [EMAIL PROTECTED] --- --- a/kernel/sched.c2006-11-27 08:40:56.0 + +++ b/kernel/sched.c2006-11-27 23:23:49.0 + @@ -4810,12 +4810,12 @@ void show_state(void) #if (BITS_PER_LONG == 32) printk(\n - sibling\n); - printk( task PC pid father child younger older\n); + free sibling\n); + printk( task PCstack pid father child younger older\n); #else printk(\n - sibling\n); - printk( task PC pid father child younger older\n); + free sibling\n); + printk( task PCstack pid father child younger older\n); #endif read_lock(tasklist_lock); do_each_thread(g, p) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3ware driver (3w-xxxx) in 2.6.10: procfs entry
On Mon, 10 Jan 2005, Ricky Beam wrote: > On Mon, 10 Jan 2005, Peter Daum wrote: > >On Mon, 10 Jan 2005, Christoph Hellwig wrote: > >> The change came from the driver maintainer at 3ware. Get the updated > >> tools from their website. > > > >Which website do you mean? The programs in the download section of > >"www.3ware.com" are just the ones that don't work anymore. > > Yeap. The "idiot" removed the proc interface from the driver before > publishing the updated tools -- and I said so at the time. At the time > the interface was removed, the new tools weren't available - period. They > are still "beta" today (several months later.) > > Just put the procfs code back in the driver in your local tree and > walk away. That's what I did -- but it doesn't look like I commited > it to any BK tree :-( (and that box is *ahem* powered off) Or just grab the latest (version 9.1.5.2) 3dm and tw_cli software from the 3ware web site. These may not be listed as being for your version of the card, but they will work with the new driver. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
bdev_lock deadlock in 2.6.10-ac8 / e1000 / rfc2385 patch
I've been seeing bdev_lock based deadlock's since 2.6.9. Here's a latest one with 2.6.10-ac8. Not sure if the problem is related to the e1000 driver (with NAPI) or the rfc2385 patches or what. Anyone else seeing this? Chris -- 2.6.10-ac8 + rfc2385 md5 patch: SysRq : Show Regs Pid: 820, comm: sh EIP: 0060:[] CPU: 0 EIP is at _spin_lock+0x36/0x90 EFLAGS: 0246Not tainted (2.6.10-ac8) EAX: EBX: ECX: c0355300 EDX: c0414000 ESI: c03a1600 EDI: EBP: c0414fc4 DS: 007b ES: 007b CR0: 8005003b CR2: b7fd6f68 CR3: 02463000 CR4: 06d0 >>EIP; c0309276 <_spin_lock+36/90> <= >>ECX; c0355300 >>EDX; c0414000 >>ESI; c03a1600 >>EDI; <__kernel_rt_sigreturn+1bbf/> >>EBP; c0414fc4 [] defense_timer_handler+0x0/0x40 [] nr_blockdev_pages+0xd/0x60 [] si_meminfo+0x21/0x40 [] update_defense_level+0x17/0x270 [] __mod_timer+0xf9/0x140 [] e1000_clean+0xb4/0xd0 [] defense_timer_handler+0x0/0x40 [] defense_timer_handler+0x8/0x40 [] run_timer_softirq+0xda/0x1a0 [] net_rx_action+0x81/0x110 [] __do_softirq+0xba/0xd0 [] meminfo_read_proc+0x0/0x240 [] do_softirq+0x4a/0x60 === [] irq_exit+0x39/0x40 [] apic_timer_interrupt+0x1c/0x24 [] meminfo_read_proc+0x0/0x240 [] set_obsolete+0xfb/0x220 [] _spin_lock+0x1a/0x90 [] nr_blockdev_pages+0xd/0x60 [] si_meminfo+0x21/0x40 [] meminfo_read_proc+0x41/0x240 [] proc_read_inode+0x17/0x40 [] d_rehash+0x6c/0x90 [] proc_lookup+0x8f/0xd0 [] __alloc_pages+0x1d4/0x370 [] vma_merge+0xd1/0x1d0 [] meminfo_read_proc+0x0/0x240 [] proc_file_read+0xc3/0x250 [] vfs_read+0xb8/0x130 [] sys_read+0x51/0x80 [] sysenter_past_esp+0x52/0x75 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
bdev_lock deadlock in 2.6.10-ac8 / e1000 / rfc2385 patch
I've been seeing bdev_lock based deadlock's since 2.6.9. Here's a latest one with 2.6.10-ac8. Not sure if the problem is related to the e1000 driver (with NAPI) or the rfc2385 patches or what. Anyone else seeing this? Chris -- 2.6.10-ac8 + rfc2385 md5 patch: Jan/16 10:55 pmSysRq : Show Regs Jan/16 10:55 pmPid: 820, comm: sh Jan/16 10:55 pmEIP: 0060:[c0309276] CPU: 0 Jan/16 10:55 pmEIP is at _spin_lock+0x36/0x90 Jan/16 10:55 pm EFLAGS: 0246Not tainted (2.6.10-ac8) Jan/16 10:55 pmEAX: EBX: ECX: c0355300 EDX: c0414000 Jan/16 10:55 pmESI: c03a1600 EDI: EBP: c0414fc4 DS: 007b ES: 007b Jan/16 10:55 pmCR0: 8005003b CR2: b7fd6f68 CR3: 02463000 CR4: 06d0 EIP; c0309276 _spin_lock+36/90 = ECX; c0355300 contig_page_data+0/e00 EDX; c0414000 softirq_stack+0/4000 ESI; c03a1600 bdev_lock+0/80 EDI; __kernel_rt_sigreturn+1bbf/ EBP; c0414fc4 softirq_stack+fc4/4000 Jan/16 10:55 pm [c02dc0f0] defense_timer_handler+0x0/0x40 Jan/16 10:55 pm [c015bbed] nr_blockdev_pages+0xd/0x60 Jan/16 10:55 pm Jan/16 10:55 pm [c013a601] si_meminfo+0x21/0x40 Jan/16 10:55 pm [c02dbe97] update_defense_level+0x17/0x270 Jan/16 10:55 pm [c0121469] __mod_timer+0xf9/0x140 Jan/16 10:55 pm [c02336d4] e1000_clean+0xb4/0xd0 Jan/16 10:55 pm [c02dc0f0] defense_timer_handler+0x0/0x40 Jan/16 10:55 pm [c02dc0f8] defense_timer_handler+0x8/0x40 Jan/16 10:55 pm [c0121dda] run_timer_softirq+0xda/0x1a0 Jan/16 10:55 pm [c0278df1] net_rx_action+0x81/0x110 Jan/16 10:55 pm [c011db1a] __do_softirq+0xba/0xd0 Jan/16 10:55 pm [c01868f0] meminfo_read_proc+0x0/0x240 Jan/16 10:55 pm [c0104cba] do_softirq+0x4a/0x60 Jan/16 10:55 pm === Jan/16 10:55 pm [c0133d59] irq_exit+0x39/0x40 Jan/16 10:55 pm [c010309c] apic_timer_interrupt+0x1c/0x24 Jan/16 10:55 pm [c01868f0] meminfo_read_proc+0x0/0x240 Jan/16 10:55 pm Jan/16 10:55 pm [c013007b] set_obsolete+0xfb/0x220 Jan/16 10:55 pm [c030925a] _spin_lock+0x1a/0x90 Jan/16 10:55 pm [c015bbed] nr_blockdev_pages+0xd/0x60 Jan/16 10:55 pm [c013a601] si_meminfo+0x21/0x40 Jan/16 10:55 pm [c0186931] meminfo_read_proc+0x41/0x240 Jan/16 10:55 pm [c01819c7] proc_read_inode+0x17/0x40 Jan/16 10:55 pm [c016c79c] d_rehash+0x6c/0x90 Jan/16 10:55 pm Jan/16 10:55 pm [c01848df] proc_lookup+0x8f/0xd0 Jan/16 10:55 pm [c0139fd4] __alloc_pages+0x1d4/0x370 Jan/16 10:55 pm [c0146981] vma_merge+0xd1/0x1d0 Jan/16 10:55 pm [c01868f0] meminfo_read_proc+0x0/0x240 Jan/16 10:55 pm [c01843e3] proc_file_read+0xc3/0x250 Jan/16 10:55 pm [c0153cf8] vfs_read+0xb8/0x130 Jan/16 10:55 pm [c0153fe1] sys_read+0x51/0x80 Jan/16 10:55 pm [c0102649] sysenter_past_esp+0x52/0x75 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3ware driver (3w-xxxx) in 2.6.10: procfs entry
On Mon, 10 Jan 2005, Ricky Beam wrote: On Mon, 10 Jan 2005, Peter Daum wrote: On Mon, 10 Jan 2005, Christoph Hellwig wrote: The change came from the driver maintainer at 3ware. Get the updated tools from their website. Which website do you mean? The programs in the download section of www.3ware.com are just the ones that don't work anymore. Yeap. The idiot removed the proc interface from the driver before publishing the updated tools -- and I said so at the time. At the time the interface was removed, the new tools weren't available - period. They are still beta today (several months later.) Just put the procfs code back in the driver in your local tree and walk away. That's what I did -- but it doesn't look like I commited it to any BK tree :-( (and that box is *ahem* powered off) Or just grab the latest (version 9.1.5.2) 3dm and tw_cli software from the 3ware web site. These may not be listed as being for your version of the card, but they will work with the new driver. Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/