Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?

2016-10-10 Thread Chris Caputo
On Tue, 11 Oct 2016, Liping Zhang wrote:
> Yes, thanks for clarifying this. There's a bug in kernel, can you try
> this patch:
> 
> diff --git a/net/netfilter/xt_NFLOG.c b/net/netfilter/xt_NFLOG.c
> index 018eed7..8c069b4 100644
> --- a/net/netfilter/xt_NFLOG.c
> +++ b/net/netfilter/xt_NFLOG.c
> @@ -32,6 +32,7 @@ nflog_tg(struct sk_buff *skb, const struct
> xt_action_param *par)
> li.u.ulog.copy_len   = info->len;
> li.u.ulog.group  = info->group;
> li.u.ulog.qthreshold = info->threshold;
> +   li.u.ulog.flags  = 0;
> 
> if (info->flags & XT_NFLOG_F_COPY_LEN)
> li.u.ulog.flags |= NF_LOG_F_COPY_LEN;

I have tested the above patch with 4.8.1, with and without nflog-size 
defined in an iptables configuration, and it works well.

The ulogd-2.0.5 segfaults no longer happen when nflog-size is not present 
in a target.

I recommend this fix.

Thanks,
Chris


Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?

2016-10-10 Thread Chris Caputo
On Tue, 11 Oct 2016, Liping Zhang wrote:
> Yes, thanks for clarifying this. There's a bug in kernel, can you try
> this patch:
> 
> diff --git a/net/netfilter/xt_NFLOG.c b/net/netfilter/xt_NFLOG.c
> index 018eed7..8c069b4 100644
> --- a/net/netfilter/xt_NFLOG.c
> +++ b/net/netfilter/xt_NFLOG.c
> @@ -32,6 +32,7 @@ nflog_tg(struct sk_buff *skb, const struct
> xt_action_param *par)
> li.u.ulog.copy_len   = info->len;
> li.u.ulog.group  = info->group;
> li.u.ulog.qthreshold = info->threshold;
> +   li.u.ulog.flags  = 0;
> 
> if (info->flags & XT_NFLOG_F_COPY_LEN)
> li.u.ulog.flags |= NF_LOG_F_COPY_LEN;

I have tested the above patch with 4.8.1, with and without nflog-size 
defined in an iptables configuration, and it works well.

The ulogd-2.0.5 segfaults no longer happen when nflog-size is not present 
in a target.

I recommend this fix.

Thanks,
Chris


Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?

2016-10-10 Thread Chris Caputo
On Mon, 10 Oct 2016, Liping Zhang wrote:
> 2016-10-10 15:02 GMT+08:00 Chris Caputo <ccap...@alt.net>:
> >   Program received signal SIGSEGV, Segmentation fault.
> >   0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at 
> > ulogd_raw2packet_BASE.c:720
> >
> >   715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t 
> > len)
> >   716 {
> >   717 struct ulogd_key *ret = pi->output.keys;
> >   718 struct iphdr *iph =
> >   719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]);
> >   720 void *nexthdr = (uint32_t *)iph + iph->ihl;
> >
> > I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous
> > behavior by not always copying IP header data to user space.
> >
> > On my machine IPv4 log packets result in a ulogd segfault while IPv6
> > packets do not.  I'm not sure of the cause of the difference.
> >
> > The corresponding userspace commit for the 209d2585 kernel change is:
> >
> >   
> > https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882
> >
> > This adds --nflog-size to iptables.  When --nflog-size is used with my
> > iptables NFLOG lines, the ulogd-2.0.5 segfaults cease.
> 
> What numbers did you specify after --nflog-size option?
> --nflog-size 0 or ...? If you want log the whole packet to
> the ulogd, please do not specify this nflog-size option.

Not specifying nflog-size does not appear to log the whole packet...

If "--nflog-size" is unspecified, and the iptables config is left 
unchanged when the kernel is upgraded to 4.8, ulogd-2.0.5 crashes.

If "--nflog-size 0" is used, ulogd-2.0.5 crashes.

If "--nflog-size" is used with size 1 or greater, ulogd-2.0.5 is fine.

> > I'm surprised to see a kernel change cause unexpected userspace segfaults,
> > so further investigation into a kernel fix would seem a good idea.
> 
> According to the original user's manual, nflog-range option was
> designed to be the number of bytes copied to userspace, but
> unfortunately there's a bug from the beginning and it never works,
> i.e. in kernel, it just ignored this option.
> 
> Try to change the current nflog-range option's semantics may
> cause unexpected results(maybe like this ulogd crash) ...
> 
> In order to keep compatibility, Vishwanath introduce a new
> nflog-size option and keep nflog-range unchanged. If you just
> upgrade the kernel, and do not change iptables rules, this
> problem will not happen.

I am reporting that the problem does happen simply with an upgrade to 
kernel 4.8 and no other changes.  When "--nflog-size" is unspecified or 
set to 0, the bug in ulogd-2.0.5 gets triggered.

I agree there is a bug in ulogd-2.0.5 that this kernel change exposed, but 
I am trying to explain that all ulogd users risk this segfault if they 
upgrade to kernel 4.8 and don't either update to a fixed ulogd (possibly 
using your patch below) or an unreleased iptables with iptables config 
changes to implement nflog-size on each NFLOG target.

> So I think this is ulogd's bug, in _interp_iphdr, it try to
> dereference the iphdr pointer before validation check, meanwhile
> this problem does not exist in ipv6 path.  Can you try this patch:
> 
> diff --git a/filter/raw2packet/ulogd_raw2packet_BASE.c
> b/filter/raw2packet/ulogd_raw2packet_BASE.c
> index 8a6180c..fd2665a 100644
> --- a/filter/raw2packet/ulogd_raw2packet_BASE.c
> +++ b/filter/raw2packet/ulogd_raw2packet_BASE.c
> @@ -717,7 +717,7 @@ static int _interp_iphdr(struct ulogd_pluginstance
> *pi, uint32_t len)
> struct ulogd_key *ret = pi->output.keys;
> struct iphdr *iph =
> ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]);
> -   void *nexthdr = (uint32_t *)iph + iph->ihl;
> +   void *nexthdr;
> 
> if (len < sizeof(struct iphdr) || len <= (uint32_t)(iph->ihl * 4))
> return ULOGD_IRET_OK;
> @@ -734,6 +734,7 @@ static int _interp_iphdr(struct ulogd_pluginstance
> *pi, uint32_t len)
> okey_set_u16([KEY_IP_ID], ntohs(iph->id));
> okey_set_u16([KEY_IP_FRAGOFF], ntohs(iph->frag_off));
> 
> +   nexthdr = (uint32_t *)iph + iph->ihl;
> switch (iph->protocol) {
> case IPPROTO_TCP:
> _interp_tcp(pi, nexthdr, len);

I agree this will likely fix ulogd, but this misses the point about the 
new kernel defaulting to a zero size return when it used to return the 
packet.

Thanks,
Chris


Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?

2016-10-10 Thread Chris Caputo
On Mon, 10 Oct 2016, Liping Zhang wrote:
> 2016-10-10 15:02 GMT+08:00 Chris Caputo :
> >   Program received signal SIGSEGV, Segmentation fault.
> >   0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at 
> > ulogd_raw2packet_BASE.c:720
> >
> >   715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t 
> > len)
> >   716 {
> >   717 struct ulogd_key *ret = pi->output.keys;
> >   718 struct iphdr *iph =
> >   719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]);
> >   720 void *nexthdr = (uint32_t *)iph + iph->ihl;
> >
> > I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous
> > behavior by not always copying IP header data to user space.
> >
> > On my machine IPv4 log packets result in a ulogd segfault while IPv6
> > packets do not.  I'm not sure of the cause of the difference.
> >
> > The corresponding userspace commit for the 209d2585 kernel change is:
> >
> >   
> > https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882
> >
> > This adds --nflog-size to iptables.  When --nflog-size is used with my
> > iptables NFLOG lines, the ulogd-2.0.5 segfaults cease.
> 
> What numbers did you specify after --nflog-size option?
> --nflog-size 0 or ...? If you want log the whole packet to
> the ulogd, please do not specify this nflog-size option.

Not specifying nflog-size does not appear to log the whole packet...

If "--nflog-size" is unspecified, and the iptables config is left 
unchanged when the kernel is upgraded to 4.8, ulogd-2.0.5 crashes.

If "--nflog-size 0" is used, ulogd-2.0.5 crashes.

If "--nflog-size" is used with size 1 or greater, ulogd-2.0.5 is fine.

> > I'm surprised to see a kernel change cause unexpected userspace segfaults,
> > so further investigation into a kernel fix would seem a good idea.
> 
> According to the original user's manual, nflog-range option was
> designed to be the number of bytes copied to userspace, but
> unfortunately there's a bug from the beginning and it never works,
> i.e. in kernel, it just ignored this option.
> 
> Try to change the current nflog-range option's semantics may
> cause unexpected results(maybe like this ulogd crash) ...
> 
> In order to keep compatibility, Vishwanath introduce a new
> nflog-size option and keep nflog-range unchanged. If you just
> upgrade the kernel, and do not change iptables rules, this
> problem will not happen.

I am reporting that the problem does happen simply with an upgrade to 
kernel 4.8 and no other changes.  When "--nflog-size" is unspecified or 
set to 0, the bug in ulogd-2.0.5 gets triggered.

I agree there is a bug in ulogd-2.0.5 that this kernel change exposed, but 
I am trying to explain that all ulogd users risk this segfault if they 
upgrade to kernel 4.8 and don't either update to a fixed ulogd (possibly 
using your patch below) or an unreleased iptables with iptables config 
changes to implement nflog-size on each NFLOG target.

> So I think this is ulogd's bug, in _interp_iphdr, it try to
> dereference the iphdr pointer before validation check, meanwhile
> this problem does not exist in ipv6 path.  Can you try this patch:
> 
> diff --git a/filter/raw2packet/ulogd_raw2packet_BASE.c
> b/filter/raw2packet/ulogd_raw2packet_BASE.c
> index 8a6180c..fd2665a 100644
> --- a/filter/raw2packet/ulogd_raw2packet_BASE.c
> +++ b/filter/raw2packet/ulogd_raw2packet_BASE.c
> @@ -717,7 +717,7 @@ static int _interp_iphdr(struct ulogd_pluginstance
> *pi, uint32_t len)
> struct ulogd_key *ret = pi->output.keys;
> struct iphdr *iph =
> ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]);
> -   void *nexthdr = (uint32_t *)iph + iph->ihl;
> +   void *nexthdr;
> 
> if (len < sizeof(struct iphdr) || len <= (uint32_t)(iph->ihl * 4))
> return ULOGD_IRET_OK;
> @@ -734,6 +734,7 @@ static int _interp_iphdr(struct ulogd_pluginstance
> *pi, uint32_t len)
> okey_set_u16([KEY_IP_ID], ntohs(iph->id));
> okey_set_u16([KEY_IP_FRAGOFF], ntohs(iph->frag_off));
> 
> +   nexthdr = (uint32_t *)iph + iph->ihl;
> switch (iph->protocol) {
> case IPPROTO_TCP:
> _interp_tcp(pi, nexthdr, len);

I agree this will likely fix ulogd, but this misses the point about the 
new kernel defaulting to a zero size return when it used to return the 
packet.

Thanks,
Chris


Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?

2016-10-10 Thread Chris Caputo
On Tue, 4 Oct 2016, Justin Piszcz wrote:
> kernel 4.8 with ulogd-2.0.5- IPs are no longer logged:
> 
> Oct  4 17:51:30 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00
> TTL=0 ID=0 PROTO=0 MARK=0
> Oct  4 17:51:31 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00
> TTL=0 ID=0 PROTO=0 MARK=0
> Oct  4 17:51:32 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00
> TTL=0 ID=0 PROTO=0 MARK=0
> 
> (reboot back to kernel 4.7, works fine)
> 
> kernel 4.7 with ulogd-2.0.5:
> Oct  4 17:56:44 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125
> DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58093 PROTO=TCP
> SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH
> URGP=0 MARK=0
> Oct  4 17:56:45 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125
> DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58725 PROTO=TCP
> SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH
> URGP=0 MARK=0
> 
> Looks like there were some changes in the 4.8 kernel regarding ulogd,
> has anyone else run into this problem?

For me, kernel 4.8.1 results in segfaults in ulogd-2.0.5 at:

  Program received signal SIGSEGV, Segmentation fault.
  0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at 
ulogd_raw2packet_BASE.c:720

  715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t len)
  716 {
  717 struct ulogd_key *ret = pi->output.keys;
  718 struct iphdr *iph =
  719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]);
  720 void *nexthdr = (uint32_t *)iph + iph->ihl;

I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous 
behavior by not always copying IP header data to user space.

On my machine IPv4 log packets result in a ulogd segfault while IPv6 
packets do not.  I'm not sure of the cause of the difference.

The corresponding userspace commit for the 209d2585 kernel change is:

  
https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882

This adds --nflog-size to iptables.  When --nflog-size is used with my 
iptables NFLOG lines, the ulogd-2.0.5 segfaults cease.

I'm surprised to see a kernel change cause unexpected userspace segfaults, 
so further investigation into a kernel fix would seem a good idea.  
Having to add the likes of "--nflog-size 200" (200 simply being what I am 
using) to every NFLOG line in firewall configs is a significant burden for 
many.

Putting out a new release of iptables may help ease this transition if the 
kernel is not patched to fix this.  I had to use the git code since 1.6.0 
doesn't have it.

Chris


Re: kernel v4.8: iptables logs are truncated with the 4.8 kernel?

2016-10-10 Thread Chris Caputo
On Tue, 4 Oct 2016, Justin Piszcz wrote:
> kernel 4.8 with ulogd-2.0.5- IPs are no longer logged:
> 
> Oct  4 17:51:30 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00
> TTL=0 ID=0 PROTO=0 MARK=0
> Oct  4 17:51:31 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00
> TTL=0 ID=0 PROTO=0 MARK=0
> Oct  4 17:51:32 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 LEN=0 TOS=00 PREC=0x00
> TTL=0 ID=0 PROTO=0 MARK=0
> 
> (reboot back to kernel 4.7, works fine)
> 
> kernel 4.7 with ulogd-2.0.5:
> Oct  4 17:56:44 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125
> DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58093 PROTO=TCP
> SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH
> URGP=0 MARK=0
> Oct  4 17:56:45 atom INPUT_BLOCK IN=eth1 OUT=
> MAC=00:1b:21:9c:3b:fa:3e:94:d5:d2:49:1e:08:00 SRC=74.125.22.125
> DST=1.2.3.4 LEN=397 TOS=00 PREC=0x00 TTL=48 ID=58725 PROTO=TCP
> SPT=5222 DPT=19804 SEQ=2032644254 ACK=2273184383 WINDOW=55272 ACK PSH
> URGP=0 MARK=0
> 
> Looks like there were some changes in the 4.8 kernel regarding ulogd,
> has anyone else run into this problem?

For me, kernel 4.8.1 results in segfaults in ulogd-2.0.5 at:

  Program received signal SIGSEGV, Segmentation fault.
  0x765fd18a in _interp_iphdr (pi=0x617f50, len=0) at 
ulogd_raw2packet_BASE.c:720

  715 static int _interp_iphdr(struct ulogd_pluginstance *pi, uint32_t len)
  716 {
  717 struct ulogd_key *ret = pi->output.keys;
  718 struct iphdr *iph =
  719 ikey_get_ptr(>input.keys[INKEY_RAW_PCKT]);
  720 void *nexthdr = (uint32_t *)iph + iph->ihl;

I believe 7643507fe8b5bd8ab7522f6a81058cc1209d2585 changed previous 
behavior by not always copying IP header data to user space.

On my machine IPv4 log packets result in a ulogd segfault while IPv6 
packets do not.  I'm not sure of the cause of the difference.

The corresponding userspace commit for the 209d2585 kernel change is:

  
https://git.netfilter.org/iptables/commit/?id=7070b1f3c88a0c3d4e315c00cca61f05b0fbc882

This adds --nflog-size to iptables.  When --nflog-size is used with my 
iptables NFLOG lines, the ulogd-2.0.5 segfaults cease.

I'm surprised to see a kernel change cause unexpected userspace segfaults, 
so further investigation into a kernel fix would seem a good idea.  
Having to add the likes of "--nflog-size 200" (200 simply being what I am 
using) to every NFLOG line in firewall configs is a significant burden for 
many.

Putting out a new release of iptables may help ease this transition if the 
kernel is not patched to fix this.  I had to use the git code since 1.6.0 
doesn't have it.

Chris


Re: [PATCH 1/3] IPVS: add wlib & wlip schedulers

2015-01-22 Thread Chris Caputo
On Fri, 23 Jan 2015, Julian Anastasov wrote:
>   Hello,
> 
> On Tue, 20 Jan 2015, Chris Caputo wrote:
> > My application consists of incoming TCP streams being load balanced to 
> > servers which receive the feeds. These are long lived multi-gigabyte 
> > streams, and so I believe the estimator's 2-second timer is fine. As an 
> > example:
> > 
> > # cat /proc/net/ip_vs_stats
> >Total Incoming Outgoing Incoming Outgoing
> >Conns  Packets  PacketsBytesBytes
> >  9AB  58B7C170  1237CA2C3250
> > 
> >  Conns/s   Pkts/s   Pkts/s  Bytes/s  Bytes/s
> >1 387C0  B16C4AE0
> 
>   All other schedulers react and see different
> picture after every new connection. The worst example
> is WLC where slow-start mechanism is desired because
> idle server can be overloaded before the load is noticed
> properly. Even WRR accounts every connection in its state.
> 
>   Your setup may expect low number of connections per
> second but for other kind of setups sending all connections
> to same server for 2 seconds looks scary. In fact, what
> changes is the position, so we rotate only among the
> least loaded servers that look equally loaded but it is
> one server in the common case. And as our stats are per
> CPU and designed for human reading, it is difficult to
> read them often for other purposes. We need a good idea
> to solve this problem, so that we can have faster feedback
> after every scheduling.

This is exactly why my wlib/wlip code is a hybrid of wlc and rr.  Last 
location is saved, and search is started after it.  Thus when traffic is 
zero, round-robin occurs.  When flows already exist, bursts of new 
connections do choose poorly based on repeated use of last estimation, but 
the complexity of working around that seems complex.

> > >   May be not so useful idea: use sum of both directions
> > > or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx
> > > flags, see how "sh" scheduler supports flags. I.e.
> > > inbps + outbps.
> > 
> > I see a user-mode option as increasing complexity. For example, 
> > keepalived users would need to have keepalived patched to support the new 
> > algorithm, due to flags, rather than just configuring "wlib" or "wlip" and 
> > it just working.
> 
>   That is also true.
> 
> > I think I'd rather see a wlob/wlop version for users that want to 
> > load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
> > users that want them summed.
> 
>   ok
> 
> > From: Chris Caputo  
> > 
> > IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
> > flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
> > ~34.35Gbits/s.
> 
>   Yep, we are limited from u32 in user space structs.
> I have to think how to solve this problem.
> 
> 1gbit => ~1.5 million pps
> 10gbit => ~15 million pps
> 100gbit => ~150 million pps
> 
> > Signed-off-by: Chris Caputo 
> > ---
> > diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h 
> > linux-3.19-rc5/include/net/ip_vs.h
> > --- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 
> > 06:02:20.0 +
> > +++ linux-3.19-rc5/include/net/ip_vs.h  2015-01-20 08:01:15.548177969 
> > +
> > @@ -390,8 +390,8 @@ struct ip_vs_estimator {
> > u32 cps;
> > u32 inpps;
> > u32 outpps;
> > -   u32 inbps;
> > -   u32 outbps;
> > +   u64 inbps;
> > +   u64 outbps;
> 
>   Not sure, may be everything here should be u64 because
> we have shifted values. I'll need some days to investigate
> this issue...
> 
> Regards
> 
> --
> Julian Anastasov 

Sounds good and thanks!

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] IPVS: add wlib wlip schedulers

2015-01-22 Thread Chris Caputo
On Fri, 23 Jan 2015, Julian Anastasov wrote:
   Hello,
 
 On Tue, 20 Jan 2015, Chris Caputo wrote:
  My application consists of incoming TCP streams being load balanced to 
  servers which receive the feeds. These are long lived multi-gigabyte 
  streams, and so I believe the estimator's 2-second timer is fine. As an 
  example:
  
  # cat /proc/net/ip_vs_stats
 Total Incoming Outgoing Incoming Outgoing
 Conns  Packets  PacketsBytesBytes
   9AB  58B7C170  1237CA2C3250
  
   Conns/s   Pkts/s   Pkts/s  Bytes/s  Bytes/s
 1 387C0  B16C4AE0
 
   All other schedulers react and see different
 picture after every new connection. The worst example
 is WLC where slow-start mechanism is desired because
 idle server can be overloaded before the load is noticed
 properly. Even WRR accounts every connection in its state.
 
   Your setup may expect low number of connections per
 second but for other kind of setups sending all connections
 to same server for 2 seconds looks scary. In fact, what
 changes is the position, so we rotate only among the
 least loaded servers that look equally loaded but it is
 one server in the common case. And as our stats are per
 CPU and designed for human reading, it is difficult to
 read them often for other purposes. We need a good idea
 to solve this problem, so that we can have faster feedback
 after every scheduling.

This is exactly why my wlib/wlip code is a hybrid of wlc and rr.  Last 
location is saved, and search is started after it.  Thus when traffic is 
zero, round-robin occurs.  When flows already exist, bursts of new 
connections do choose poorly based on repeated use of last estimation, but 
the complexity of working around that seems complex.

 May be not so useful idea: use sum of both directions
   or control it with svc-flags  IP_VS_SVC_F_SCHED_WLIB_xxx
   flags, see how sh scheduler supports flags. I.e.
   inbps + outbps.
  
  I see a user-mode option as increasing complexity. For example, 
  keepalived users would need to have keepalived patched to support the new 
  algorithm, due to flags, rather than just configuring wlib or wlip and 
  it just working.
 
   That is also true.
 
  I think I'd rather see a wlob/wlop version for users that want to 
  load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
  users that want them summed.
 
   ok
 
  From: Chris Caputo ccap...@alt.net 
  
  IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
  flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
  ~34.35Gbits/s.
 
   Yep, we are limited from u32 in user space structs.
 I have to think how to solve this problem.
 
 1gbit = ~1.5 million pps
 10gbit = ~15 million pps
 100gbit = ~150 million pps
 
  Signed-off-by: Chris Caputo ccap...@alt.net
  ---
  diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h 
  linux-3.19-rc5/include/net/ip_vs.h
  --- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 
  06:02:20.0 +
  +++ linux-3.19-rc5/include/net/ip_vs.h  2015-01-20 08:01:15.548177969 
  +
  @@ -390,8 +390,8 @@ struct ip_vs_estimator {
  u32 cps;
  u32 inpps;
  u32 outpps;
  -   u32 inbps;
  -   u32 outbps;
  +   u64 inbps;
  +   u64 outbps;
 
   Not sure, may be everything here should be u64 because
 we have shifted values. I'll need some days to investigate
 this issue...
 
 Regards
 
 --
 Julian Anastasov j...@ssi.bg

Sounds good and thanks!

Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] IPVS: add wlib & wlip schedulers

2015-01-20 Thread Chris Caputo
From: Chris Caputo  

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) scheduler docs for ipvsadm-1.27.

Signed-off-by: Chris Caputo 
---
diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS
--- ipvsadm-1.27-stock/SCHEDULERS   2013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 +
@@ -1 +1 @@
-rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq
+rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip
diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8
--- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/ipvsadm.8  2015-01-17 22:14:32.812597191 +
@@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s
 \fBnq\fR - Never Queue: assigns an incoming job to an idle server if
 there is, instead of waiting for a fast one; if all the servers are
 busy, it adopts the Shortest Expected Delay policy to assign the job.
+.sp
+\fBwlib\fR - Weighted Least Incoming Byterate: directs network
+connections to the real server with the least incoming byterate
+normalized by the server weight.
+.sp
+\fBwlip\fR - Weighted Least Incoming Packetrate: directs network
+connections to the real server with the least incoming packetrate
+normalized by the server weight.
 .TP
 .B -p, --persistent [\fItimeout\fP]
 Specify that a virtual service is persistent. If this option is
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] IPVS: add wlib & wlip schedulers

2015-01-20 Thread Chris Caputo
On Tue, 20 Jan 2015, Julian Anastasov wrote:
> On Sat, 17 Jan 2015, Chris Caputo wrote:
> > From: Chris Caputo  
> > 
> > IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least 
> > Incoming 
> > Packetrate) schedulers, updated for 3.19-rc4.

Hi Julian,

Thanks for the review.

>   The IPVS estimator uses 2-second timer to update
> the stats, isn't that a problem for such schedulers?
> Also, you schedule by incoming traffic rate which is
> ok when clients mostly upload. But in the common case
> clients mostly download and IPVS processes download
> traffic only for NAT method.

My application consists of incoming TCP streams being load balanced to 
servers which receive the feeds. These are long lived multi-gigabyte 
streams, and so I believe the estimator's 2-second timer is fine. As an 
example:

# cat /proc/net/ip_vs_stats
   Total Incoming Outgoing Incoming Outgoing
   Conns  Packets  PacketsBytesBytes
 9AB  58B7C170  1237CA2C3250

 Conns/s   Pkts/s   Pkts/s  Bytes/s  Bytes/s
   1 387C0  B16C4AE0

>   May be not so useful idea: use sum of both directions
> or control it with svc->flags & IP_VS_SVC_F_SCHED_WLIB_xxx
> flags, see how "sh" scheduler supports flags. I.e.
> inbps + outbps.

I see a user-mode option as increasing complexity. For example, 
keepalived users would need to have keepalived patched to support the new 
algorithm, due to flags, rather than just configuring "wlib" or "wlip" and 
it just working.

I think I'd rather see a wlob/wlop version for users that want to 
load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
users that want them summed.

>   Another problem: pps and bps are shifted values,
> see how ip_vs_read_estimator() reads them. ip_vs_est.c
> contains comments that this code handles couple of
> gigabits. May be inbps and outbps in struct ip_vs_estimator
> should be changed to u64 to support more gigabits, with
> separate patch.

See patch below to convert bps in ip_vs_estimator to 64-bits.

Other patches, based on your feedback, to follow.

Thanks,
Chris

From: Chris Caputo  

IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
~34.35Gbits/s.

Signed-off-by: Chris Caputo 
---
diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h 
linux-3.19-rc5/include/net/ip_vs.h
--- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 06:02:20.0 
+
+++ linux-3.19-rc5/include/net/ip_vs.h  2015-01-20 08:01:15.548177969 +
@@ -390,8 +390,8 @@ struct ip_vs_estimator {
u32 cps;
u32 inpps;
u32 outpps;
-   u32 inbps;
-   u32 outbps;
+   u64 inbps;
+   u64 outbps;
 };
 
 struct ip_vs_stats {
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c 
linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c 2015-01-18 
06:02:20.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c   2015-01-20 
08:01:34.369840704 +
@@ -45,10 +45,12 @@
 
   NOTES.
 
-  * The stored value for average bps is scaled by 2^5, so that maximal
-rate is ~2.15Gbits/s, average pps and cps are scaled by 2^10.
+  * Average bps is scaled by 2^5, while average pps and cps are scaled by 2^10.
 
-  * A lot code is taken from net/sched/estimator.c
+  * All are reported to user level as 32 bit unsigned values. Bps can
+overflow for fast links : max speed being ~34.35Gbits/s.
+
+  * A lot of code is taken from net/core/gen_estimator.c
  */
 
 
@@ -98,7 +100,7 @@ static void estimation_timer(unsigned lo
u32 n_conns;
u32 n_inpkts, n_outpkts;
u64 n_inbytes, n_outbytes;
-   u32 rate;
+   u64 rate;
struct net *net = (struct net *)arg;
struct netns_ipvs *ipvs;
 
@@ -118,23 +120,24 @@ static void estimation_timer(unsigned lo
/* scaled by 2^10, but divided 2 seconds */
rate = (n_conns - e->last_conns) << 9;
e->last_conns = n_conns;
-   e->cps += ((long)rate - (long)e->cps) >> 2;
+   e->cps += ((s64)rate - (s64)e->cps) >> 2;
 
rate = (n_inpkts - e->last_inpkts) << 9;
e->last_inpkts = n_inpkts;
-   e->inpps += ((long)rate - (long)e->inpps) >> 2;
+   e->inpps += ((s64)rate - (s64)e->inpps) >> 2;
 
rate = (n_outpkts - e->last_outpkts) << 9;
e->last_outpkts = n_outpkts;
-   e->outpps += ((long)r

[PATCH 2/3] IPVS: add wlib & wlip schedulers

2015-01-20 Thread Chris Caputo
On Tue, 20 Jan 2015, Julian Anastasov wrote:
> > +  (u64)dr * (u64)lwgt < (u64)lr * (u64)dwgt ||
[...]
> > +  (dr == lr && dwgt > lwgt)) {
> 
>   Above check is redundant.

I accepted your feedback and applied it to the below, except for this 
item.  I believe if dr and lr are zero (no traffic), we still want to 
choose the higher weight, thus a separate comparison is needed.

Thanks,
Chris

From: Chris Caputo  

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) schedulers, updated for 3.19-rc5.

Signed-off-by: Chris Caputo 
---
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig 
linux-3.19-rc5/net/netfilter/ipvs/Kconfig
--- linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig 2015-01-18 
06:02:20.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/Kconfig   2015-01-20 08:08:28.883080285 
+
@@ -240,6 +240,26 @@ config IP_VS_NQ
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_WLIB
+   tristate "weighted least incoming byterate scheduling"
+   ---help---
+ The weighted least incoming byterate scheduling algorithm directs
+ network connections to the server with the least incoming byterate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
+config IP_VS_WLIP
+   tristate "weighted least incoming packetrate scheduling"
+   ---help---
+ The weighted least incoming packetrate scheduling algorithm directs
+ network connections to the server with the least incoming packetrate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 comment 'IPVS SH scheduler'
 
 config IP_VS_SH_TAB_BITS
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile 
linux-3.19-rc5/net/netfilter/ipvs/Makefile
--- linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile2015-01-18 
06:02:20.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/Makefile  2015-01-20 08:08:28.883080285 
+
@@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
+obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o
+obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c 
linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 
00:00:00.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c  2015-01-20 
08:09:00.177816054 +
@@ -0,0 +1,166 @@
+/* IPVS:Weighted Least Incoming Byterate Scheduling module
+ *
+ * Authors: Chris Caputo  based on code by:
+ *
+ *  Wensong Zhang 
+ *  Peter Kese 
+ *  Julian Anastasov 
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIB algorithm uses the results of the estimator's inbps
+ * calculations to determine which real server has the lowest incoming
+ * byterate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 100 Mbps of input and
+ * another that can handle 1 Gbps you could set the weights to be 100 and 1000
+ * respectively.
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include 
+#include 
+
+#include 
+
+static int
+ip_vs_wlib_init_svc(struct ip_vs_service *svc)
+{
+   svc->sched_data = >destinations;
+   return 0;
+}
+
+static int
+ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+   struct list_head *p;
+
+   spin_lock_bh(>sched_lock);
+   p = (struct list_head *)svc->sched_data;
+   /* dest is already unlinked, so p->prev is not valid but
+* p->next is valid, use it to reach previous entry.
+*/
+   if (p == >n_list)
+   svc->sched_data = p->next->prev;
+   spin_unlock_bh(>sched_lock);
+   return 0;
+}
+
+/* Weighted Least Incoming Byterate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+   struct ip_vs_iphdr *iph)
+{
+   struct 

[PATCH 1/3] IPVS: add wlib wlip schedulers

2015-01-20 Thread Chris Caputo
On Tue, 20 Jan 2015, Julian Anastasov wrote:
 On Sat, 17 Jan 2015, Chris Caputo wrote:
  From: Chris Caputo ccap...@alt.net 
  
  IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least 
  Incoming 
  Packetrate) schedulers, updated for 3.19-rc4.

Hi Julian,

Thanks for the review.

   The IPVS estimator uses 2-second timer to update
 the stats, isn't that a problem for such schedulers?
 Also, you schedule by incoming traffic rate which is
 ok when clients mostly upload. But in the common case
 clients mostly download and IPVS processes download
 traffic only for NAT method.

My application consists of incoming TCP streams being load balanced to 
servers which receive the feeds. These are long lived multi-gigabyte 
streams, and so I believe the estimator's 2-second timer is fine. As an 
example:

# cat /proc/net/ip_vs_stats
   Total Incoming Outgoing Incoming Outgoing
   Conns  Packets  PacketsBytesBytes
 9AB  58B7C170  1237CA2C3250

 Conns/s   Pkts/s   Pkts/s  Bytes/s  Bytes/s
   1 387C0  B16C4AE0

   May be not so useful idea: use sum of both directions
 or control it with svc-flags  IP_VS_SVC_F_SCHED_WLIB_xxx
 flags, see how sh scheduler supports flags. I.e.
 inbps + outbps.

I see a user-mode option as increasing complexity. For example, 
keepalived users would need to have keepalived patched to support the new 
algorithm, due to flags, rather than just configuring wlib or wlip and 
it just working.

I think I'd rather see a wlob/wlop version for users that want to 
load-balance based on outgoing bytes/packets, and a wlb/wlp version for 
users that want them summed.

   Another problem: pps and bps are shifted values,
 see how ip_vs_read_estimator() reads them. ip_vs_est.c
 contains comments that this code handles couple of
 gigabits. May be inbps and outbps in struct ip_vs_estimator
 should be changed to u64 to support more gigabits, with
 separate patch.

See patch below to convert bps in ip_vs_estimator to 64-bits.

Other patches, based on your feedback, to follow.

Thanks,
Chris

From: Chris Caputo ccap...@alt.net 

IPVS: Change inbps and outbps to 64-bits so that estimator handles faster
flows. Also increases maximum viewable at user level from ~2.15Gbits/s to
~34.35Gbits/s.

Signed-off-by: Chris Caputo ccap...@alt.net
---
diff -uprN linux-3.19-rc5-stock/include/net/ip_vs.h 
linux-3.19-rc5/include/net/ip_vs.h
--- linux-3.19-rc5-stock/include/net/ip_vs.h2015-01-18 06:02:20.0 
+
+++ linux-3.19-rc5/include/net/ip_vs.h  2015-01-20 08:01:15.548177969 +
@@ -390,8 +390,8 @@ struct ip_vs_estimator {
u32 cps;
u32 inpps;
u32 outpps;
-   u32 inbps;
-   u32 outbps;
+   u64 inbps;
+   u64 outbps;
 };
 
 struct ip_vs_stats {
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c 
linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_est.c 2015-01-18 
06:02:20.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_est.c   2015-01-20 
08:01:34.369840704 +
@@ -45,10 +45,12 @@
 
   NOTES.
 
-  * The stored value for average bps is scaled by 2^5, so that maximal
-rate is ~2.15Gbits/s, average pps and cps are scaled by 2^10.
+  * Average bps is scaled by 2^5, while average pps and cps are scaled by 2^10.
 
-  * A lot code is taken from net/sched/estimator.c
+  * All are reported to user level as 32 bit unsigned values. Bps can
+overflow for fast links : max speed being ~34.35Gbits/s.
+
+  * A lot of code is taken from net/core/gen_estimator.c
  */
 
 
@@ -98,7 +100,7 @@ static void estimation_timer(unsigned lo
u32 n_conns;
u32 n_inpkts, n_outpkts;
u64 n_inbytes, n_outbytes;
-   u32 rate;
+   u64 rate;
struct net *net = (struct net *)arg;
struct netns_ipvs *ipvs;
 
@@ -118,23 +120,24 @@ static void estimation_timer(unsigned lo
/* scaled by 2^10, but divided 2 seconds */
rate = (n_conns - e-last_conns)  9;
e-last_conns = n_conns;
-   e-cps += ((long)rate - (long)e-cps)  2;
+   e-cps += ((s64)rate - (s64)e-cps)  2;
 
rate = (n_inpkts - e-last_inpkts)  9;
e-last_inpkts = n_inpkts;
-   e-inpps += ((long)rate - (long)e-inpps)  2;
+   e-inpps += ((s64)rate - (s64)e-inpps)  2;
 
rate = (n_outpkts - e-last_outpkts)  9;
e-last_outpkts = n_outpkts;
-   e-outpps += ((long)rate - (long)e-outpps)  2;
+   e-outpps += ((s64)rate - (s64)e-outpps)  2;
 
+   /* scaled by 2^5, but divided 2 seconds */
rate = (n_inbytes - e-last_inbytes)  4;
e

[PATCH 2/3] IPVS: add wlib wlip schedulers

2015-01-20 Thread Chris Caputo
On Tue, 20 Jan 2015, Julian Anastasov wrote:
  +  (u64)dr * (u64)lwgt  (u64)lr * (u64)dwgt ||
[...]
  +  (dr == lr  dwgt  lwgt)) {
 
   Above check is redundant.

I accepted your feedback and applied it to the below, except for this 
item.  I believe if dr and lr are zero (no traffic), we still want to 
choose the higher weight, thus a separate comparison is needed.

Thanks,
Chris

From: Chris Caputo ccap...@alt.net 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) schedulers, updated for 3.19-rc5.

Signed-off-by: Chris Caputo ccap...@alt.net
---
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig 
linux-3.19-rc5/net/netfilter/ipvs/Kconfig
--- linux-3.19-rc5-stock/net/netfilter/ipvs/Kconfig 2015-01-18 
06:02:20.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/Kconfig   2015-01-20 08:08:28.883080285 
+
@@ -240,6 +240,26 @@ config IP_VS_NQ
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_WLIB
+   tristate weighted least incoming byterate scheduling
+   ---help---
+ The weighted least incoming byterate scheduling algorithm directs
+ network connections to the server with the least incoming byterate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
+config IP_VS_WLIP
+   tristate weighted least incoming packetrate scheduling
+   ---help---
+ The weighted least incoming packetrate scheduling algorithm directs
+ network connections to the server with the least incoming packetrate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 comment 'IPVS SH scheduler'
 
 config IP_VS_SH_TAB_BITS
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile 
linux-3.19-rc5/net/netfilter/ipvs/Makefile
--- linux-3.19-rc5-stock/net/netfilter/ipvs/Makefile2015-01-18 
06:02:20.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/Makefile  2015-01-20 08:08:28.883080285 
+
@@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
+obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o
+obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
diff -uprN linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c 
linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c
--- linux-3.19-rc5-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 
00:00:00.0 +
+++ linux-3.19-rc5/net/netfilter/ipvs/ip_vs_wlib.c  2015-01-20 
08:09:00.177816054 +
@@ -0,0 +1,166 @@
+/* IPVS:Weighted Least Incoming Byterate Scheduling module
+ *
+ * Authors: Chris Caputo ccap...@alt.net based on code by:
+ *
+ *  Wensong Zhang wens...@linuxvirtualserver.org
+ *  Peter Kese peter.k...@ijs.si
+ *  Julian Anastasov j...@ssi.bg
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIB algorithm uses the results of the estimator's inbps
+ * calculations to determine which real server has the lowest incoming
+ * byterate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 100 Mbps of input and
+ * another that can handle 1 Gbps you could set the weights to be 100 and 1000
+ * respectively.
+ */
+
+#define KMSG_COMPONENT IPVS
+#define pr_fmt(fmt) KMSG_COMPONENT :  fmt
+
+#include linux/module.h
+#include linux/kernel.h
+
+#include net/ip_vs.h
+
+static int
+ip_vs_wlib_init_svc(struct ip_vs_service *svc)
+{
+   svc-sched_data = svc-destinations;
+   return 0;
+}
+
+static int
+ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+   struct list_head *p;
+
+   spin_lock_bh(svc-sched_lock);
+   p = (struct list_head *)svc-sched_data;
+   /* dest is already unlinked, so p-prev is not valid but
+* p-next is valid, use it to reach previous entry.
+*/
+   if (p == dest-n_list)
+   svc-sched_data = p-next-prev;
+   spin_unlock_bh(svc-sched_lock);
+   return 0;
+}
+
+/* Weighted Least Incoming Byterate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+   struct ip_vs_iphdr

[PATCH 3/3] IPVS: add wlib wlip schedulers

2015-01-20 Thread Chris Caputo
From: Chris Caputo ccap...@alt.net 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) scheduler docs for ipvsadm-1.27.

Signed-off-by: Chris Caputo ccap...@alt.net
---
diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS
--- ipvsadm-1.27-stock/SCHEDULERS   2013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 +
@@ -1 +1 @@
-rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq
+rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip
diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8
--- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/ipvsadm.8  2015-01-17 22:14:32.812597191 +
@@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s
 \fBnq\fR - Never Queue: assigns an incoming job to an idle server if
 there is, instead of waiting for a fast one; if all the servers are
 busy, it adopts the Shortest Expected Delay policy to assign the job.
+.sp
+\fBwlib\fR - Weighted Least Incoming Byterate: directs network
+connections to the real server with the least incoming byterate
+normalized by the server weight.
+.sp
+\fBwlip\fR - Weighted Least Incoming Packetrate: directs network
+connections to the real server with the least incoming packetrate
+normalized by the server weight.
 .TP
 .B -p, --persistent [\fItimeout\fP]
 Specify that a virtual service is persistent. If this option is
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] IPVS: add wlib & wlip schedulers

2015-01-17 Thread Chris Caputo
From: Chris Caputo  

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) scheduler docs for ipvsadm-1.27.

Signed-off-by: Chris Caputo 
---
diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS
--- ipvsadm-1.27-stock/SCHEDULERS   2013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 +
@@ -1 +1 @@
-rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq
+rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip
diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8
--- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/ipvsadm.8  2015-01-17 22:14:32.812597191 +
@@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s
 \fBnq\fR - Never Queue: assigns an incoming job to an idle server if
 there is, instead of waiting for a fast one; if all the servers are
 busy, it adopts the Shortest Expected Delay policy to assign the job.
+.sp
+\fBwlib\fR - Weighted Least Incoming Byterate: directs network
+connections to the real server with the least incoming byterate
+normalized by the server weight.
+.sp
+\fBwlip\fR - Weighted Least Incoming Packetrate: directs network
+connections to the real server with the least incoming packetrate
+normalized by the server weight.
 .TP
 .B -p, --persistent [\fItimeout\fP]
 Specify that a virtual service is persistent. If this option is
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] IPVS: add wlib & wlip schedulers

2015-01-17 Thread Chris Caputo
Wensong, this is something we discussed 10 years ago and you liked it, but 
it didn't actually get into the kernel.  I've updated it, tested it, and 
would like to work toward inclusion.

Thanks,
Chris

---
From: Chris Caputo  

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) schedulers, updated for 3.19-rc4.

Signed-off-by: Chris Caputo 
---
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig 
linux-3.19-rc4/net/netfilter/ipvs/Kconfig
--- linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig 2015-01-11 
20:44:53.0 +
+++ linux-3.19-rc4/net/netfilter/ipvs/Kconfig   2015-01-17 22:47:52.250301042 
+
@@ -240,6 +240,26 @@ config IP_VS_NQ
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_WLIB
+   tristate "weighted least incoming byterate scheduling"
+   ---help---
+ The weighted least incoming byterate scheduling algorithm directs
+ network connections to the server with the least incoming byterate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
+config IP_VS_WLIP
+   tristate "weighted least incoming packetrate scheduling"
+   ---help---
+ The weighted least incoming packetrate scheduling algorithm directs
+ network connections to the server with the least incoming packetrate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 comment 'IPVS SH scheduler'
 
 config IP_VS_SH_TAB_BITS
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile 
linux-3.19-rc4/net/netfilter/ipvs/Makefile
--- linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile2015-01-11 
20:44:53.0 +
+++ linux-3.19-rc4/net/netfilter/ipvs/Makefile  2015-01-17 22:47:35.421861075 
+
@@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
+obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o
+obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c 
linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c
--- linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 
00:00:00.0 +
+++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c  2015-01-17 
22:47:35.421861075 +
@@ -0,0 +1,156 @@
+/* IPVS:Weighted Least Incoming Byterate Scheduling module
+ *
+ * Authors: Chris Caputo  based on code by:
+ *
+ *  Wensong Zhang 
+ *  Peter Kese 
+ *  Julian Anastasov 
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIB algorithm uses the results of the estimator's inbps
+ * calculations to determine which real server has the lowest incoming
+ * byterate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 100 Mbps of input and
+ * another that can handle 1 Gbps you could set the weights to be 100 and 1000
+ * respectively.
+ */
+
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include 
+#include 
+
+#include 
+
+static int
+ip_vs_wlib_init_svc(struct ip_vs_service *svc)
+{
+   svc->sched_data = >destinations;
+   return 0;
+}
+
+static int
+ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+   struct list_head *p;
+
+   spin_lock_bh(>sched_lock);
+   p = (struct list_head *)svc->sched_data;
+   /* dest is already unlinked, so p->prev is not valid but
+* p->next is valid, use it to reach previous entry.
+*/
+   if (p == >n_list)
+   svc->sched_data = p->next->prev;
+   spin_unlock_bh(>sched_lock);
+   return 0;
+}
+
+/* Weighted Least Incoming Byterate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+   struct ip_vs_iphdr *iph)
+{
+   struct list_head *p, *q;
+   struct ip_vs_dest *dest, *least = NULL;
+   u32 dr, lr = -1;
+   int dwgt, lwgt = 0;
+
+   IP_VS_DBG(6, "%s(): Scheduling...\n", __func__);
+
+   /* We calculate the load of each dest server as follows:
+   

[PATCH 1/2] IPVS: add wlib wlip schedulers

2015-01-17 Thread Chris Caputo
Wensong, this is something we discussed 10 years ago and you liked it, but 
it didn't actually get into the kernel.  I've updated it, tested it, and 
would like to work toward inclusion.

Thanks,
Chris

---
From: Chris Caputo ccap...@alt.net 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) schedulers, updated for 3.19-rc4.

Signed-off-by: Chris Caputo ccap...@alt.net
---
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig 
linux-3.19-rc4/net/netfilter/ipvs/Kconfig
--- linux-3.19-rc4-stock/net/netfilter/ipvs/Kconfig 2015-01-11 
20:44:53.0 +
+++ linux-3.19-rc4/net/netfilter/ipvs/Kconfig   2015-01-17 22:47:52.250301042 
+
@@ -240,6 +240,26 @@ config IP_VS_NQ
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_WLIB
+   tristate weighted least incoming byterate scheduling
+   ---help---
+ The weighted least incoming byterate scheduling algorithm directs
+ network connections to the server with the least incoming byterate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
+config IP_VS_WLIP
+   tristate weighted least incoming packetrate scheduling
+   ---help---
+ The weighted least incoming packetrate scheduling algorithm directs
+ network connections to the server with the least incoming packetrate
+ normalized by the server weight.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 comment 'IPVS SH scheduler'
 
 config IP_VS_SH_TAB_BITS
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile 
linux-3.19-rc4/net/netfilter/ipvs/Makefile
--- linux-3.19-rc4-stock/net/netfilter/ipvs/Makefile2015-01-11 
20:44:53.0 +
+++ linux-3.19-rc4/net/netfilter/ipvs/Makefile  2015-01-17 22:47:35.421861075 
+
@@ -33,6 +33,8 @@ obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
+obj-$(CONFIG_IP_VS_WLIB) += ip_vs_wlib.o
+obj-$(CONFIG_IP_VS_WLIP) += ip_vs_wlip.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
diff -uprN linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c 
linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c
--- linux-3.19-rc4-stock/net/netfilter/ipvs/ip_vs_wlib.c1970-01-01 
00:00:00.0 +
+++ linux-3.19-rc4/net/netfilter/ipvs/ip_vs_wlib.c  2015-01-17 
22:47:35.421861075 +
@@ -0,0 +1,156 @@
+/* IPVS:Weighted Least Incoming Byterate Scheduling module
+ *
+ * Authors: Chris Caputo ccap...@alt.net based on code by:
+ *
+ *  Wensong Zhang wens...@linuxvirtualserver.org
+ *  Peter Kese peter.k...@ijs.si
+ *  Julian Anastasov j...@ssi.bg
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Changes:
+ * Chris Caputo: Based code on ip_vs_wlc.c ip_vs_rr.c.
+ *
+ */
+
+/* The WLIB algorithm uses the results of the estimator's inbps
+ * calculations to determine which real server has the lowest incoming
+ * byterate.
+ *
+ * Real server weight is factored into the calculation.  An example way to
+ * use this is if you have one server that can handle 100 Mbps of input and
+ * another that can handle 1 Gbps you could set the weights to be 100 and 1000
+ * respectively.
+ */
+
+#define KMSG_COMPONENT IPVS
+#define pr_fmt(fmt) KMSG_COMPONENT :  fmt
+
+#include linux/module.h
+#include linux/kernel.h
+
+#include net/ip_vs.h
+
+static int
+ip_vs_wlib_init_svc(struct ip_vs_service *svc)
+{
+   svc-sched_data = svc-destinations;
+   return 0;
+}
+
+static int
+ip_vs_wlib_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
+{
+   struct list_head *p;
+
+   spin_lock_bh(svc-sched_lock);
+   p = (struct list_head *)svc-sched_data;
+   /* dest is already unlinked, so p-prev is not valid but
+* p-next is valid, use it to reach previous entry.
+*/
+   if (p == dest-n_list)
+   svc-sched_data = p-next-prev;
+   spin_unlock_bh(svc-sched_lock);
+   return 0;
+}
+
+/* Weighted Least Incoming Byterate scheduling */
+static struct ip_vs_dest *
+ip_vs_wlib_schedule(struct ip_vs_service *svc, const struct sk_buff *skb,
+   struct ip_vs_iphdr *iph)
+{
+   struct list_head *p, *q;
+   struct ip_vs_dest *dest, *least = NULL;
+   u32 dr, lr = -1;
+   int dwgt, lwgt = 0;
+
+   IP_VS_DBG(6, %s(): Scheduling...\n, __func__);
+
+   /* We

[PATCH 2/2] IPVS: add wlib wlip schedulers

2015-01-17 Thread Chris Caputo
From: Chris Caputo ccap...@alt.net 

IPVS wlib (Weighted Least Incoming Byterate) and wlip (Weighted Least Incoming 
Packetrate) scheduler docs for ipvsadm-1.27.

Signed-off-by: Chris Caputo ccap...@alt.net
---
diff -upr ipvsadm-1.27-stock/SCHEDULERS ipvsadm-1.27/SCHEDULERS
--- ipvsadm-1.27-stock/SCHEDULERS   2013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/SCHEDULERS 2015-01-17 22:14:32.812597191 +
@@ -1 +1 @@
-rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq
+rr|wrr|lc|wlc|lblc|lblcr|dh|sh|sed|nq|wlib|wlip
diff -upr ipvsadm-1.27-stock/ipvsadm.8 ipvsadm-1.27/ipvsadm.8
--- ipvsadm-1.27-stock/ipvsadm.82013-09-06 08:37:27.0 +
+++ ipvsadm-1.27/ipvsadm.8  2015-01-17 22:14:32.812597191 +
@@ -261,6 +261,14 @@ fixed service rate (weight) of the ith s
 \fBnq\fR - Never Queue: assigns an incoming job to an idle server if
 there is, instead of waiting for a fast one; if all the servers are
 busy, it adopts the Shortest Expected Delay policy to assign the job.
+.sp
+\fBwlib\fR - Weighted Least Incoming Byterate: directs network
+connections to the real server with the least incoming byterate
+normalized by the server weight.
+.sp
+\fBwlip\fR - Weighted Least Incoming Packetrate: directs network
+connections to the real server with the least incoming packetrate
+normalized by the server weight.
 .TP
 .B -p, --persistent [\fItimeout\fP]
 Specify that a virtual service is persistent. If this option is
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.19-rc6] sched: cleanup output of show_state/show_task

2006-11-27 Thread Chris Caputo
On Mon, 27 Nov 2006, Andrew Morton wrote:
> On Sat, 25 Nov 2006 04:48:15 + (GMT)
> Chris Caputo <[EMAIL PROTECTED]> wrote:
> > This patch cleans up the output of show_state/task() (aka magic-sysrq-t) 
> > so that free stack space is printed as appropriate based on 
> > CONFIG_DEBUG_STACK_USAGE.
> > 
> > Also, without this patch the header is not aligned with the data and is 
> > thus confusing.  Free stack is labeled as pid, pid is labeled as father, 
> > and so on.
> > 
> > Signed-off-by: Chris Caputo <[EMAIL PROTECTED]>
> > ---
> > 
> > diff -uprN a/kernel/sched.c b/kernel/sched.c
> > --- a/kernel/sched.c2006-11-25 04:11:12.0 +
> > +++ b/kernel/sched.c2006-11-25 04:13:07.0 +
> > @@ -4757,7 +4757,6 @@ static const char stat_nam[] = "RSDTtZX"
> >  static void show_task(struct task_struct *p)
> >  {
> > struct task_struct *relative;
> > -   unsigned long free = 0;
> > unsigned state;
> >  
> > state = p->state ? __ffs(p->state) + 1 : 0;
> > @@ -4779,10 +4778,10 @@ static void show_task(struct task_struct
> > unsigned long *n = end_of_stack(p);
> > while (!*n)
> > n++;
> > -   free = (unsigned long)n - (unsigned long)end_of_stack(p);
> > +   printk("%5lu ", (unsigned long)n - (unsigned 
> > long)end_of_stack(p));
> > }
> >  #endif
> > -   printk("%5lu %5d %6d ", free, p->pid, p->parent->pid);
> > +   printk("%5d %6d ", p->pid, p->parent->pid);
> 
> This will cause the output format to be dependent upon the setting of
> CONFIG_DEBUG_STACK_USAGE.  So any code which attempts to parse the output
> of this function will somehow need to work out whether or not the `free'
> field is present.
> 
> Which is why we still print out a zero if CONFIG_DEBUG_STACK_USAGE=n.

Ahh!

Should we make it so the header printed by show_state is aligned properly 
with the data?  If yes, please consider the below patch.

Chris

---
From: Chris Caputo <[EMAIL PROTECTED]>
[PATCH 2.6.19-rc6] sched: correct output of show_state

At present show_state prints a header the does not match the output of 
show_task, as follows:

-
   sibling
  task PC  pid father child younger older
init  S  0 1  0 2   (NOTLB)
-

This patch corrects the output of show_state so that the header is 
aligned with the data, ala:

-
 freesibling
  task PCstack   pid father child younger older
init  S  0 1  0 2   (NOTLB)
-

Signed-off-by: Chris Caputo <[EMAIL PROTECTED]>
---

--- a/kernel/sched.c2006-11-27 08:40:56.0 +
+++ b/kernel/sched.c2006-11-27 23:23:49.0 +
@@ -4810,12 +4810,12 @@ void show_state(void)
 
 #if (BITS_PER_LONG == 32)
printk("\n"
-  "   sibling\n");
-   printk("  task PC  pid father child younger older\n");
+  " free
sibling\n");
+   printk("  task PCstack   pid father child younger 
older\n");
 #else
printk("\n"
-  "   
sibling\n");
-   printk("  task PC  pid father child younger 
older\n");
+  " free
sibling\n");
+   printk("  task PCstack   pid father child 
younger older\n");
 #endif
read_lock(_lock);
do_each_thread(g, p) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.19-rc6] sunrpc: fix race condition

2006-11-27 Thread Chris Caputo
On Mon, 27 Nov 2006, Chris Caputo wrote:
> From: Chris Caputo <[EMAIL PROTECTED]>
> [PATCH 2.6.19-rc6] sunrpc: fix race condition

Turns out my patch is buggy.  Don't use it.

Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.19-rc6] sunrpc: fix race condition

2006-11-27 Thread Chris Caputo
From: Chris Caputo <[EMAIL PROTECTED]>
[PATCH 2.6.19-rc6] sunrpc: fix race condition

Patch linux-2.6.10-01-rpc_workqueue.dif introduced a race condition into 
net/sunrpc/sched.c in kernels 2.6.11-rc1 through 2.6.19-rc6.  The race 
scenario is as follows...

Given: RPC_TASK_QUEUED, RPC_TASK_RUNNING and RPC_TASK_ASYNC are set.

__rpc_execute() (no spinlock)rpc_make_runnable() (queue spinlock held)
--
 do_ret = rpc_test_and_set_running(task);
rpc_clear_running(task);
if (RPC_IS_ASYNC(task)) {
if (RPC_IS_QUEUED(task))
return 0;
 rpc_clear_queued(task);
 if (do_ret)
 return;

Thus both threads return and the task is abandoned forever.

In my test NFS client usage (~200 Mb/s at ~3,000 RPC calls/s) this race 
condition has resulted in processes getting permanently stuck in 'D' state 
often in less than 15 minutes of uptime.

The following patch fixes the problem by returning to use of a spinlock in 
__rpc_execute().

Signed-off-by: Chris Caputo <[EMAIL PROTECTED]>
---

diff -up a/net/sunrpc/sched.c b/net/sunrpc/sched.c
--- a/net/sunrpc/sched.c2006-11-27 08:41:07.0 +
+++ b/net/sunrpc/sched.c2006-11-27 11:14:21.0 +
@@ -587,6 +587,7 @@ EXPORT_SYMBOL(rpc_exit_task);
 static int __rpc_execute(struct rpc_task *task)
 {
int status = 0;
+   struct rpc_wait_queue *queue;
 
dprintk("RPC: %4d rpc_execute flgs %x\n",
task->tk_pid, task->tk_flags);
@@ -631,22 +632,27 @@ static int __rpc_execute(struct rpc_task
lock_kernel();
task->tk_action(task);
unlock_kernel();
+   /* micro-optimization to avoid spinlock */
+   if (!RPC_IS_QUEUED(task))
+   continue;
}
 
/*
-* Lockless check for whether task is sleeping or not.
+* Check whether task is sleeping.
 */
-   if (!RPC_IS_QUEUED(task))
-   continue;
-   rpc_clear_running(task);
+   queue = task->u.tk_wait.rpc_waitq;
+   spin_lock_bh(>lock);
if (RPC_IS_ASYNC(task)) {
-   /* Careful! we may have raced... */
-   if (RPC_IS_QUEUED(task))
-   return 0;
-   if (rpc_test_and_set_running(task))
+   if (RPC_IS_QUEUED(task)) {
+   rpc_clear_running(task);
+   spin_unlock_bh(>lock);
return 0;
+   }
+   spin_unlock_bh(>lock);
continue;
}
+   rpc_clear_running(task);
+   spin_unlock_bh(>lock);
 
/* sync task: sleep here */
dprintk("RPC: %4d sync task going to sleep\n", task->tk_pid);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.19-rc6] sunrpc: fix race condition

2006-11-27 Thread Chris Caputo
From: Chris Caputo [EMAIL PROTECTED]
[PATCH 2.6.19-rc6] sunrpc: fix race condition

Patch linux-2.6.10-01-rpc_workqueue.dif introduced a race condition into 
net/sunrpc/sched.c in kernels 2.6.11-rc1 through 2.6.19-rc6.  The race 
scenario is as follows...

Given: RPC_TASK_QUEUED, RPC_TASK_RUNNING and RPC_TASK_ASYNC are set.

__rpc_execute() (no spinlock)rpc_make_runnable() (queue spinlock held)
--
 do_ret = rpc_test_and_set_running(task);
rpc_clear_running(task);
if (RPC_IS_ASYNC(task)) {
if (RPC_IS_QUEUED(task))
return 0;
 rpc_clear_queued(task);
 if (do_ret)
 return;

Thus both threads return and the task is abandoned forever.

In my test NFS client usage (~200 Mb/s at ~3,000 RPC calls/s) this race 
condition has resulted in processes getting permanently stuck in 'D' state 
often in less than 15 minutes of uptime.

The following patch fixes the problem by returning to use of a spinlock in 
__rpc_execute().

Signed-off-by: Chris Caputo [EMAIL PROTECTED]
---

diff -up a/net/sunrpc/sched.c b/net/sunrpc/sched.c
--- a/net/sunrpc/sched.c2006-11-27 08:41:07.0 +
+++ b/net/sunrpc/sched.c2006-11-27 11:14:21.0 +
@@ -587,6 +587,7 @@ EXPORT_SYMBOL(rpc_exit_task);
 static int __rpc_execute(struct rpc_task *task)
 {
int status = 0;
+   struct rpc_wait_queue *queue;
 
dprintk(RPC: %4d rpc_execute flgs %x\n,
task-tk_pid, task-tk_flags);
@@ -631,22 +632,27 @@ static int __rpc_execute(struct rpc_task
lock_kernel();
task-tk_action(task);
unlock_kernel();
+   /* micro-optimization to avoid spinlock */
+   if (!RPC_IS_QUEUED(task))
+   continue;
}
 
/*
-* Lockless check for whether task is sleeping or not.
+* Check whether task is sleeping.
 */
-   if (!RPC_IS_QUEUED(task))
-   continue;
-   rpc_clear_running(task);
+   queue = task-u.tk_wait.rpc_waitq;
+   spin_lock_bh(queue-lock);
if (RPC_IS_ASYNC(task)) {
-   /* Careful! we may have raced... */
-   if (RPC_IS_QUEUED(task))
-   return 0;
-   if (rpc_test_and_set_running(task))
+   if (RPC_IS_QUEUED(task)) {
+   rpc_clear_running(task);
+   spin_unlock_bh(queue-lock);
return 0;
+   }
+   spin_unlock_bh(queue-lock);
continue;
}
+   rpc_clear_running(task);
+   spin_unlock_bh(queue-lock);
 
/* sync task: sleep here */
dprintk(RPC: %4d sync task going to sleep\n, task-tk_pid);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.19-rc6] sunrpc: fix race condition

2006-11-27 Thread Chris Caputo
On Mon, 27 Nov 2006, Chris Caputo wrote:
 From: Chris Caputo [EMAIL PROTECTED]
 [PATCH 2.6.19-rc6] sunrpc: fix race condition

Turns out my patch is buggy.  Don't use it.

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.19-rc6] sched: cleanup output of show_state/show_task

2006-11-27 Thread Chris Caputo
On Mon, 27 Nov 2006, Andrew Morton wrote:
 On Sat, 25 Nov 2006 04:48:15 + (GMT)
 Chris Caputo [EMAIL PROTECTED] wrote:
  This patch cleans up the output of show_state/task() (aka magic-sysrq-t) 
  so that free stack space is printed as appropriate based on 
  CONFIG_DEBUG_STACK_USAGE.
  
  Also, without this patch the header is not aligned with the data and is 
  thus confusing.  Free stack is labeled as pid, pid is labeled as father, 
  and so on.
  
  Signed-off-by: Chris Caputo [EMAIL PROTECTED]
  ---
  
  diff -uprN a/kernel/sched.c b/kernel/sched.c
  --- a/kernel/sched.c2006-11-25 04:11:12.0 +
  +++ b/kernel/sched.c2006-11-25 04:13:07.0 +
  @@ -4757,7 +4757,6 @@ static const char stat_nam[] = RSDTtZX
   static void show_task(struct task_struct *p)
   {
  struct task_struct *relative;
  -   unsigned long free = 0;
  unsigned state;
   
  state = p-state ? __ffs(p-state) + 1 : 0;
  @@ -4779,10 +4778,10 @@ static void show_task(struct task_struct
  unsigned long *n = end_of_stack(p);
  while (!*n)
  n++;
  -   free = (unsigned long)n - (unsigned long)end_of_stack(p);
  +   printk(%5lu , (unsigned long)n - (unsigned 
  long)end_of_stack(p));
  }
   #endif
  -   printk(%5lu %5d %6d , free, p-pid, p-parent-pid);
  +   printk(%5d %6d , p-pid, p-parent-pid);
 
 This will cause the output format to be dependent upon the setting of
 CONFIG_DEBUG_STACK_USAGE.  So any code which attempts to parse the output
 of this function will somehow need to work out whether or not the `free'
 field is present.
 
 Which is why we still print out a zero if CONFIG_DEBUG_STACK_USAGE=n.

Ahh!

Should we make it so the header printed by show_state is aligned properly 
with the data?  If yes, please consider the below patch.

Chris

---
From: Chris Caputo [EMAIL PROTECTED]
[PATCH 2.6.19-rc6] sched: correct output of show_state

At present show_state prints a header the does not match the output of 
show_task, as follows:

-
   sibling
  task PC  pid father child younger older
init  S  0 1  0 2   (NOTLB)
-

This patch corrects the output of show_state so that the header is 
aligned with the data, ala:

-
 freesibling
  task PCstack   pid father child younger older
init  S  0 1  0 2   (NOTLB)
-

Signed-off-by: Chris Caputo [EMAIL PROTECTED]
---

--- a/kernel/sched.c2006-11-27 08:40:56.0 +
+++ b/kernel/sched.c2006-11-27 23:23:49.0 +
@@ -4810,12 +4810,12 @@ void show_state(void)
 
 #if (BITS_PER_LONG == 32)
printk(\n
- sibling\n);
-   printk(  task PC  pid father child younger older\n);
+   free
sibling\n);
+   printk(  task PCstack   pid father child younger 
older\n);
 #else
printk(\n
- 
sibling\n);
-   printk(  task PC  pid father child younger 
older\n);
+   free
sibling\n);
+   printk(  task PCstack   pid father child 
younger older\n);
 #endif
read_lock(tasklist_lock);
do_each_thread(g, p) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3ware driver (3w-xxxx) in 2.6.10: procfs entry

2005-01-16 Thread Chris Caputo
On Mon, 10 Jan 2005, Ricky Beam wrote:
> On Mon, 10 Jan 2005, Peter Daum wrote:
> >On Mon, 10 Jan 2005, Christoph Hellwig wrote:
> >> The change came from the driver maintainer at 3ware.  Get the updated
> >> tools from their website.
> >
> >Which website do you mean? The programs in the download section of
> >"www.3ware.com" are just the ones that don't work anymore.
> 
> Yeap.  The "idiot" removed the proc interface from the driver before
> publishing the updated tools  -- and I said so at the time.  At the time
> the interface was removed, the new tools weren't available - period.  They
> are still "beta" today (several months later.)
> 
> Just put the procfs code back in the driver in your local tree and
> walk away.  That's what I did -- but it doesn't look like I commited
> it to any BK tree :-( (and that box is *ahem* powered off)

Or just grab the latest (version 9.1.5.2) 3dm and tw_cli software from the
3ware web site.  These may not be listed as being for your version of the
card, but they will work with the new driver.

Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


bdev_lock deadlock in 2.6.10-ac8 / e1000 / rfc2385 patch

2005-01-16 Thread Chris Caputo
I've been seeing bdev_lock based deadlock's since 2.6.9.  Here's a latest
one with 2.6.10-ac8.  Not sure if the problem is related to the e1000
driver (with NAPI) or the rfc2385 patches or what.  Anyone else seeing
this?

Chris

--
2.6.10-ac8 + rfc2385 md5 patch:

SysRq : Show Regs
Pid: 820, comm:   sh
EIP: 0060:[] CPU: 0
EIP is at _spin_lock+0x36/0x90
 EFLAGS: 0246Not tainted  (2.6.10-ac8)
EAX:  EBX:  ECX: c0355300 EDX: c0414000
ESI: c03a1600 EDI:  EBP: c0414fc4 DS: 007b ES: 007b
CR0: 8005003b CR2: b7fd6f68 CR3: 02463000 CR4: 06d0

>>EIP; c0309276 <_spin_lock+36/90>   <=

>>ECX; c0355300 
>>EDX; c0414000 
>>ESI; c03a1600 
>>EDI;  <__kernel_rt_sigreturn+1bbf/>
>>EBP; c0414fc4 

 [] defense_timer_handler+0x0/0x40
 [] nr_blockdev_pages+0xd/0x60

 [] si_meminfo+0x21/0x40
 [] update_defense_level+0x17/0x270
 [] __mod_timer+0xf9/0x140
 [] e1000_clean+0xb4/0xd0
 [] defense_timer_handler+0x0/0x40
 [] defense_timer_handler+0x8/0x40
 [] run_timer_softirq+0xda/0x1a0
 [] net_rx_action+0x81/0x110
 [] __do_softirq+0xba/0xd0
 [] meminfo_read_proc+0x0/0x240
 [] do_softirq+0x4a/0x60
 ===
 [] irq_exit+0x39/0x40
 [] apic_timer_interrupt+0x1c/0x24
 [] meminfo_read_proc+0x0/0x240

 [] set_obsolete+0xfb/0x220
 [] _spin_lock+0x1a/0x90
 [] nr_blockdev_pages+0xd/0x60
 [] si_meminfo+0x21/0x40
 [] meminfo_read_proc+0x41/0x240
 [] proc_read_inode+0x17/0x40
 [] d_rehash+0x6c/0x90

 [] proc_lookup+0x8f/0xd0
 [] __alloc_pages+0x1d4/0x370
 [] vma_merge+0xd1/0x1d0
 [] meminfo_read_proc+0x0/0x240
 [] proc_file_read+0xc3/0x250
 [] vfs_read+0xb8/0x130
 [] sys_read+0x51/0x80
 [] sysenter_past_esp+0x52/0x75

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


bdev_lock deadlock in 2.6.10-ac8 / e1000 / rfc2385 patch

2005-01-16 Thread Chris Caputo
I've been seeing bdev_lock based deadlock's since 2.6.9.  Here's a latest
one with 2.6.10-ac8.  Not sure if the problem is related to the e1000
driver (with NAPI) or the rfc2385 patches or what.  Anyone else seeing
this?

Chris

--
2.6.10-ac8 + rfc2385 md5 patch:

Jan/16 10:55 pmSysRq : Show Regs
Jan/16 10:55 pmPid: 820, comm:   sh
Jan/16 10:55 pmEIP: 0060:[c0309276] CPU: 0
Jan/16 10:55 pmEIP is at _spin_lock+0x36/0x90
Jan/16 10:55 pm EFLAGS: 0246Not tainted  (2.6.10-ac8)
Jan/16 10:55 pmEAX:  EBX:  ECX: c0355300 EDX: c0414000
Jan/16 10:55 pmESI: c03a1600 EDI:  EBP: c0414fc4 DS: 007b ES: 007b
Jan/16 10:55 pmCR0: 8005003b CR2: b7fd6f68 CR3: 02463000 CR4: 06d0

EIP; c0309276 _spin_lock+36/90   =

ECX; c0355300 contig_page_data+0/e00
EDX; c0414000 softirq_stack+0/4000
ESI; c03a1600 bdev_lock+0/80
EDI;  __kernel_rt_sigreturn+1bbf/
EBP; c0414fc4 softirq_stack+fc4/4000

Jan/16 10:55 pm [c02dc0f0] defense_timer_handler+0x0/0x40
Jan/16 10:55 pm [c015bbed] nr_blockdev_pages+0xd/0x60
Jan/16 10:55 pm
Jan/16 10:55 pm [c013a601] si_meminfo+0x21/0x40
Jan/16 10:55 pm [c02dbe97] update_defense_level+0x17/0x270
Jan/16 10:55 pm [c0121469] __mod_timer+0xf9/0x140
Jan/16 10:55 pm [c02336d4] e1000_clean+0xb4/0xd0
Jan/16 10:55 pm [c02dc0f0] defense_timer_handler+0x0/0x40
Jan/16 10:55 pm [c02dc0f8] defense_timer_handler+0x8/0x40
Jan/16 10:55 pm [c0121dda] run_timer_softirq+0xda/0x1a0
Jan/16 10:55 pm [c0278df1] net_rx_action+0x81/0x110
Jan/16 10:55 pm [c011db1a] __do_softirq+0xba/0xd0
Jan/16 10:55 pm [c01868f0] meminfo_read_proc+0x0/0x240
Jan/16 10:55 pm [c0104cba] do_softirq+0x4a/0x60
Jan/16 10:55 pm ===
Jan/16 10:55 pm [c0133d59] irq_exit+0x39/0x40
Jan/16 10:55 pm [c010309c] apic_timer_interrupt+0x1c/0x24
Jan/16 10:55 pm [c01868f0] meminfo_read_proc+0x0/0x240
Jan/16 10:55 pm
Jan/16 10:55 pm [c013007b] set_obsolete+0xfb/0x220
Jan/16 10:55 pm [c030925a] _spin_lock+0x1a/0x90
Jan/16 10:55 pm [c015bbed] nr_blockdev_pages+0xd/0x60
Jan/16 10:55 pm [c013a601] si_meminfo+0x21/0x40
Jan/16 10:55 pm [c0186931] meminfo_read_proc+0x41/0x240
Jan/16 10:55 pm [c01819c7] proc_read_inode+0x17/0x40
Jan/16 10:55 pm [c016c79c] d_rehash+0x6c/0x90
Jan/16 10:55 pm
Jan/16 10:55 pm [c01848df] proc_lookup+0x8f/0xd0
Jan/16 10:55 pm [c0139fd4] __alloc_pages+0x1d4/0x370
Jan/16 10:55 pm [c0146981] vma_merge+0xd1/0x1d0
Jan/16 10:55 pm [c01868f0] meminfo_read_proc+0x0/0x240
Jan/16 10:55 pm [c01843e3] proc_file_read+0xc3/0x250
Jan/16 10:55 pm [c0153cf8] vfs_read+0xb8/0x130
Jan/16 10:55 pm [c0153fe1] sys_read+0x51/0x80
Jan/16 10:55 pm [c0102649] sysenter_past_esp+0x52/0x75

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3ware driver (3w-xxxx) in 2.6.10: procfs entry

2005-01-16 Thread Chris Caputo
On Mon, 10 Jan 2005, Ricky Beam wrote:
 On Mon, 10 Jan 2005, Peter Daum wrote:
 On Mon, 10 Jan 2005, Christoph Hellwig wrote:
  The change came from the driver maintainer at 3ware.  Get the updated
  tools from their website.
 
 Which website do you mean? The programs in the download section of
 www.3ware.com are just the ones that don't work anymore.
 
 Yeap.  The idiot removed the proc interface from the driver before
 publishing the updated tools  -- and I said so at the time.  At the time
 the interface was removed, the new tools weren't available - period.  They
 are still beta today (several months later.)
 
 Just put the procfs code back in the driver in your local tree and
 walk away.  That's what I did -- but it doesn't look like I commited
 it to any BK tree :-( (and that box is *ahem* powered off)

Or just grab the latest (version 9.1.5.2) 3dm and tw_cli software from the
3ware web site.  These may not be listed as being for your version of the
card, but they will work with the new driver.

Chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/