Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2016-06-17 Thread Jean He
Richard Laing  alliedtelesis.co.nz> writes:

> >>> Fair enough, I will look at making it a sysctl option. I guess the
> >>> default can be the current behaviour.

> Cheers
> Richard
> 


Hi Richard,

Could you please share if this new patch for per-flow IPv4 ECMP has landed?

If so, which version should I use? 

And is there a sysctl option that we can just use to enable per-flow hashing?

Many thanks!
Jean



Re: [PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-08-04 Thread Oleg A Arkhangelsky


04.08.2015, 04:29, Richard Laing richard.la...@alliedtelesis.co.nz:
 Enable flow-based ECMP.

Looks like your approach is only restricted to the case when sockets are
involved. What about forwarded traffic (IP routing case)?

Have you seen similar work done by Peter Nørlund?

http://comments.gmane.org/gmane.linux.kernel.api/12201

--
wbr, Oleg.

Anarchy is about taking complete responsibility for yourself.
  Alan Moore.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-08-04 Thread Tom Herbert
On Tue, Aug 4, 2015 at 12:37 AM, Oleg A Arkhangelsky syso...@yandex.ru wrote:


 04.08.2015, 04:29, Richard Laing richard.la...@alliedtelesis.co.nz:
 Enable flow-based ECMP.

 Looks like your approach is only restricted to the case when sockets are
 involved. What about forwarded traffic (IP routing case)?

skb_get_hash_flowi4 can be called to get the hash now. This should
work with or without (connected) sockets.

Thanks,
Tom

 Have you seen similar work done by Peter Nørlund?

 http://comments.gmane.org/gmane.linux.kernel.api/12201

 --
 wbr, Oleg.

 Anarchy is about taking complete responsibility for yourself.
   Alan Moore.
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-08-04 Thread Richard Laing
Hi Oleg,

I hadn't seen that patch, thanks, it looks like a pretty thorough solution.

Best Regards,
Richard

On 08/04/2015 07:37 PM, Oleg A Arkhangelsky wrote:

 04.08.2015, 04:29, Richard Laing richard.la...@alliedtelesis.co.nz:
 Enable flow-based ECMP.
 Looks like your approach is only restricted to the case when sockets are
 involved. What about forwarded traffic (IP routing case)?

 Have you seen similar work done by Peter Nørlund?

 http://comments.gmane.org/gmane.linux.kernel.api/12201

 --
 wbr, Oleg.

 Anarchy is about taking complete responsibility for yourself.
Alan Moore.

-- 

*Richard Laing*
Software Team Leader*
Allied Telesis Labs*| 27 Nazareth Ave | Christchurch 8024 | New Zealand
Phone: +64 3 3393000 | DDI: +64 3 339 9248 | Web: *alliedtelesis.com 
http://alliedtelesis.com/*skype:andrewriddell3?chat


http://productselector.alliedtelesis.eu/ 
http://productselector.alliedtelesis.eu/N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-08-04 Thread Richard Laing
Hi Stephen,

Given that fib_nhs is currently an int I would rather keep live_nexthops 
also as an int to match, probably fib_nhs could at least be set as 
unsigned or changed to u16 or even u8 perhaps.

Best Regards,
Richard

On 08/04/2015 05:31 PM, Stephen Hemminger wrote:
 On Tue,  4 Aug 2015 13:28:47 +1200
 Richard Laing richard.la...@alliedtelesis.co.nz wrote:

 diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
 index 5fa643b..7db9f72 100644
 --- a/include/net/ip_fib.h
 +++ b/include/net/ip_fib.h
 @@ -117,6 +117,8 @@ struct fib_info {
   #ifdef CONFIG_IP_ROUTE_MULTIPATH
  int fib_power;
   #endif
 +/* Cache the number of live nexthops for flow based ECMP calculation. */
 +int live_nexthops;
 unsigned or u16 ? rather than risking sign issues.

-- 

*Richard Laing*
Software Team Leader*
Allied Telesis Labs*| 27 Nazareth Ave | Christchurch 8024 | New Zealand
Phone: +64 3 3393000 | DDI: +64 3 339 9248 | Web: *alliedtelesis.com 
http://alliedtelesis.com/*skype:andrewriddell3?chat


http://productselector.alliedtelesis.eu/ 
http://productselector.alliedtelesis.eu/--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-08-03 Thread Richard Laing
Enable flow-based ECMP.

Currently if equal-cost multipath is enabled the kernel chooses between
equal cost paths for each matching packet, essentially packets are
round-robined between the routes. This means that packets from a single
flow can traverse different routes. If one of the routes experiences
congestion this can result in delayed or out of order packets arriving
at the destination.

This patch allows packets to be routed based on their flow - packets
in the same flow will always use the same route. This prevents out of
order packets. There are other issues with round-robin based ECMP routing
related to variable path MTU handling and debugging. The default
behaviour is changed by this patch to enable flow based ECMP routing
rather than the previous round-robin routing. The behaviour can be
changed using a new sysctl option /net/ipv4/route/flow_based_ecmp.

See RFC2991 for more details on the problems associated with packet
based ECMP routing.

This patch relies on the skb hash value to select between routes. The
selection uses a hash-threshold algorithm (see RFC2992).

Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz
---

 include/net/flow.h   |8 
 include/net/ip_fib.h |4 
 include/net/route.h  |2 ++
 net/ipv4/fib_semantics.c |   30 ++
 net/ipv4/route.c |   19 +++
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..b0a2524 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -79,6 +79,8 @@ struct flowi4 {
 #define fl4_ipsec_spi  uli.spi
 #define fl4_mh_typeuli.mht.type
 #define fl4_gre_keyuli.gre_key
+
+   __u32   flowi4_hash;
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
@@ -99,6 +101,7 @@ static inline void flowi4_init_output(struct flowi4 *fl4, 
int oif,
fl4-saddr = saddr;
fl4-fl4_dport = dport;
fl4-fl4_sport = sport;
+   fl4-flowi4_hash = 0;
 }
 
 /* Reset some input parameters after previous lookup */
@@ -182,6 +185,11 @@ static inline struct flowi *flowidn_to_flowi(struct 
flowidn *fldn)
return container_of(fldn, struct flowi, u.dn);
 }
 
+static inline void flowi4_set_flow_hash(struct flowi4 *fl, __u32 hash)
+{
+   fl-flowi4_hash = hash;
+}
+
 typedef unsigned long flow_compare_t;
 
 static inline size_t flow_key_size(u16 family)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 5fa643b..7db9f72 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -117,6 +117,8 @@ struct fib_info {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
int fib_power;
 #endif
+   /* Cache the number of live nexthops for flow based ECMP calculation. */
+   int live_nexthops;
struct rcu_head rcu;
struct fib_nh   fib_nh[0];
 #define fib_devfib_nh[0].nh_dev
@@ -310,6 +312,8 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long 
event);
 int fib_sync_down_addr(struct net *net, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
 void fib_select_multipath(struct fib_result *res);
+void fib_select_multipath_for_flow(struct fib_result *res,
+  const struct flowi4 *fl4);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03..a00e606 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -252,6 +252,8 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
 
flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
+
+   flowi4_set_flow_hash(fl4, sk-sk_txhash);
 }
 
 static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 3a06586..0a56ad3 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -981,6 +981,7 @@ link_it:
head = fib_info_devhash[hash];
hlist_add_head(nexthop_nh-nh_hash, head);
} endfor_nexthops(fi)
+   fi-live_nexthops = fi-fib_nhs;
spin_unlock_bh(fib_info_lock);
return fi;
 
@@ -1196,6 +1197,7 @@ int fib_sync_down_dev(struct net_device *dev, unsigned 
long event)
}
ret++;
}
+   fi-live_nexthops = fi-fib_nhs - dead;
}
 
return ret;
@@ -1331,6 +1333,7 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags)
if (alive  0) {
fi-fib_flags = ~nh_flags;
ret++;
+   fi-live_nexthops = alive;
}
}
 
@@ -1397,4 +1400,31 @@ void fib_select_multipath(struct 

Re: [PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-08-03 Thread Stephen Hemminger
On Tue,  4 Aug 2015 13:28:47 +1200
Richard Laing richard.la...@alliedtelesis.co.nz wrote:

 diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
 index 5fa643b..7db9f72 100644
 --- a/include/net/ip_fib.h
 +++ b/include/net/ip_fib.h
 @@ -117,6 +117,8 @@ struct fib_info {
  #ifdef CONFIG_IP_ROUTE_MULTIPATH
   int fib_power;
  #endif
 + /* Cache the number of live nexthops for flow based ECMP calculation. */
 + int live_nexthops;

unsigned or u16 ? rather than risking sign issues.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-29 Thread Tom Herbert
On Tue, Jul 28, 2015 at 11:11 PM, Michal Kubecek mkube...@suse.cz wrote:
 On Tue, Jul 28, 2015 at 09:20:02PM +, Richard Laing wrote:
  diff --git a/include/net/flow.h b/include/net/flow.h
  index 8109a15..d1d933d 100644
  --- a/include/net/flow.h
  +++ b/include/net/flow.h
  @@ -79,6 +79,10 @@ struct flowi4 {
 #define fl4_ipsec_spiuli.spi
 #define fl4_mh_typeuli.mht.type
 #define fl4_gre_keyuli.gre_key
  +
  +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
  Why bother making this a CONFIG, round robin is a miserable algorithm
  anyway and nearly all the other packet steering mechanisms already use
  a hash.

 Fair enough, I will look at making it a sysctl option. I guess the
 default can be the current behaviour.

 Hm... that's an interesting question. In general, it's better to use
 current behaviour as a default so that people are not surprised on
 upgrade. On the other hand, it used to be per-flow  - or rather per
 destination - earlier (until the routing cache removal, I believe) and
 per-flow distribution is IMHO preferrable in majority of use cases. In
 theory, there was a route attribute equalize to switch to per-packet
 distribution, but it was never actually implemented, AFAIK (it was
 recognized by ip and passed to kernel but ignored there).

 Anyway, config option is definitely inconvenient as most users install
 distribution kernels and do not configure their kernels themselves. Even
 boot parameter would be better - but sysctl sounds much better. Having
 both sysctl and per-route attribute would be perfect, of course.

IPv6 routing already uses a hash without any capability of setting
this to be round robin, probably every network device on the planet
performs ECMP using a hash since it is stateless algorithm, and as I
pointed most of our other packet steering mechanisms use a hash. So
the IPv4 path seems to be the odd man out here. Keeping round robin
seems around awfully conservative to me, and implies that we need to
implement it in IPv6 to maintain feature parity.

Tom

 Michal Kubecek

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-29 Thread Richard Laing


On 07/30/2015 04:12 AM, Tom Herbert wrote:
 On Tue, Jul 28, 2015 at 11:11 PM, Michal Kubecek mkube...@suse.cz wrote:
 On Tue, Jul 28, 2015 at 09:20:02PM +, Richard Laing wrote:
 diff --git a/include/net/flow.h b/include/net/flow.h
 index 8109a15..d1d933d 100644
 --- a/include/net/flow.h
 +++ b/include/net/flow.h
 @@ -79,6 +79,10 @@ struct flowi4 {
 #define fl4_ipsec_spiuli.spi
 #define fl4_mh_typeuli.mht.type
 #define fl4_gre_keyuli.gre_key
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 Why bother making this a CONFIG, round robin is a miserable algorithm
 anyway and nearly all the other packet steering mechanisms already use
 a hash.
 Fair enough, I will look at making it a sysctl option. I guess the
 default can be the current behaviour.
 Hm... that's an interesting question. In general, it's better to use
 current behaviour as a default so that people are not surprised on
 upgrade. On the other hand, it used to be per-flow  - or rather per
 destination - earlier (until the routing cache removal, I believe) and
 per-flow distribution is IMHO preferrable in majority of use cases. In
 theory, there was a route attribute equalize to switch to per-packet
 distribution, but it was never actually implemented, AFAIK (it was
 recognized by ip and passed to kernel but ignored there).

 Anyway, config option is definitely inconvenient as most users install
 distribution kernels and do not configure their kernels themselves. Even
 boot parameter would be better - but sysctl sounds much better. Having
 both sysctl and per-route attribute would be perfect, of course.

 IPv6 routing already uses a hash without any capability of setting
 this to be round robin, probably every network device on the planet
 performs ECMP using a hash since it is stateless algorithm, and as I
 pointed most of our other packet steering mechanisms use a hash. So
 the IPv4 path seems to be the odd man out here. Keeping round robin
 seems around awfully conservative to me, and implies that we need to
 implement it in IPv6 to maintain feature parity.

 Tom

Thanks Tom, I would certainly agree that flow based is preferable to the 
current behaviour and would be happy to make it the default. I will update my 
patch with the feedback and send an update, no ETA right now!

Cheers
Richard

--
Richard Laing
Software Team Leader
Allied Telesis Labs| 27 Nazareth Ave | Christchurch 8024 | New Zealand
Phone: +64 3 339 9248
Web: www.alliedtelesis.com


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-29 Thread David Miller
From: Michal Kubecek mkube...@suse.cz
Date: Wed, 29 Jul 2015 08:11:48 +0200

 Having both sysctl and per-route attribute would be perfect, of
 course.

+1
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-29 Thread Julian Anastasov

Hello,

On Tue, 28 Jul 2015, Richard Laing wrote:

 From: Richard Laing richard.la...@alliedtelesis.co.nz
 
 Enable flow-based ECMP.
 
 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.
 
 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.
 
 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).

What about forwarding?

Also, we can make it lockless and to consider
nexthop weights. The DNS SRV (RFC 2782:Weight) has such WRR algorithm,
I'll try to describe it with such example, may be it can
be properly implemented but this is just to show the idea:

- 2 NHs, alive:
- nexthop 1: weight 10
- nexthop 2: weight 20

- calculate the sum of weight of all nexthops: 10+20=30,
maintain it in fib_info, may be for all alive nexthops

- get a random number in the 0..29 range (in our case L3/L4 hash % 30):

rand_val = hash % sum;

This means the lowest bits of hash must be random.
rand_val in range 0..9 should use NH1, 10..29 NH2.

- walk the list with nexthops by increasing the running sum:

int run;

again:
run = 0;
for_nexthops(fi) {
/* run: -10-30 */
run += nh-nh_weight;
if (run  rand_val)
goto found;
}

/* race on NH DEAD flag change, retry? */
smp_rmb();
goto again;

found:
/* use nhsel */

Some questions remain:

- events can mark nexthops as down/dead/whatever and they may be
ignored. As result, same hash can go to different nexthop for
next route lookups. One option is to walk even dead nexthops
but if we select such one we have to to skip to the next available.
As result, hashes that hit alive NH will always use their NH,
only hashes that hit dead NH will get new NH. Then
change of dead state will not affect the binding of hash
to alive nexthop.

Regards

--
Julian Anastasov j...@ssi.bg
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-29 Thread Michal Kubecek
On Tue, Jul 28, 2015 at 09:20:02PM +, Richard Laing wrote:
  diff --git a/include/net/flow.h b/include/net/flow.h
  index 8109a15..d1d933d 100644
  --- a/include/net/flow.h
  +++ b/include/net/flow.h
  @@ -79,6 +79,10 @@ struct flowi4 {
 #define fl4_ipsec_spiuli.spi
 #define fl4_mh_typeuli.mht.type
 #define fl4_gre_keyuli.gre_key
  +
  +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
  Why bother making this a CONFIG, round robin is a miserable algorithm
  anyway and nearly all the other packet steering mechanisms already use
  a hash.
 
 Fair enough, I will look at making it a sysctl option. I guess the 
 default can be the current behaviour.

Hm... that's an interesting question. In general, it's better to use
current behaviour as a default so that people are not surprised on
upgrade. On the other hand, it used to be per-flow  - or rather per
destination - earlier (until the routing cache removal, I believe) and
per-flow distribution is IMHO preferrable in majority of use cases. In
theory, there was a route attribute equalize to switch to per-packet
distribution, but it was never actually implemented, AFAIK (it was
recognized by ip and passed to kernel but ignored there).

Anyway, config option is definitely inconvenient as most users install
distribution kernels and do not configure their kernels themselves. Even
boot parameter would be better - but sysctl sounds much better. Having
both sysctl and per-route attribute would be perfect, of course.

Michal Kubecek

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Michal Kubecek
On Tue, Jul 28, 2015 at 02:27:57AM +, Richard Laing wrote:
 From: Richard Laing richard.la...@alliedtelesis.co.nz
 
 Enable flow-based ECMP.
 
 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.
 
 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.
 
 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).
 
 Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz

The patch looks corrupted (long lines split, tabs converted to (four?)
spaces etc.

Michal Kubecek

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Stephen Hemminger
On Tue, 28 Jul 2015 02:27:57 +
Richard Laing richard.la...@alliedtelesis.co.nz wrote:

 From: Richard Laing richard.la...@alliedtelesis.co.nz
 
 Enable flow-based ECMP.
 
 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.
 
 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.
 
 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).
 
 Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz

ECMP needs some work, glad to see someone has started again.
A little bit of history, the original ECMP algorithm selection code was dropped 
years ago
because it was buggy. The API was good, but the implemenation had problems.

Somewhere in my dustbin, I have some patches from internal stuff that allows
selecting the algorithm at runtime.

Your implementation makes it a compile time choice. Although this is good
for testing, it really needs to be a sysctl or route attribute to be
useful in a distro environment.

As others said, your mailer is mangling the lines by wrapping.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Tom Herbert
On Mon, Jul 27, 2015 at 7:27 PM, Richard Laing
richard.la...@alliedtelesis.co.nz wrote:
 From: Richard Laing richard.la...@alliedtelesis.co.nz

 Enable flow-based ECMP.

 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.

Richard, someone was complaining to me just last week about the
weakness of the round robin algorithm. Thanks for looking into this!

 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.

 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).

 Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz
 ---

   include/net/flow.h   |   14 ++
   include/net/ip_fib.h |9 +
   include/net/route.h  |4 
   net/ipv4/Kconfig |9 +
   net/ipv4/fib_semantics.c |   38 ++
   net/ipv4/route.c |   14 +-
   6 files changed, 87 insertions(+), 1 deletion(-)

 diff --git a/include/net/flow.h b/include/net/flow.h
 index 8109a15..d1d933d 100644
 --- a/include/net/flow.h
 +++ b/include/net/flow.h
 @@ -79,6 +79,10 @@ struct flowi4 {
   #define fl4_ipsec_spiuli.spi
   #define fl4_mh_typeuli.mht.type
   #define fl4_gre_keyuli.gre_key
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH

Why bother making this a CONFIG, round robin is a miserable algorithm
anyway and nearly all the other packet steering mechanisms already use
a hash.

 +__u32flowi4_hash;
 +#endif
   } __attribute__((__aligned__(BITS_PER_LONG/8)));

   static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
 @@ -99,6 +103,9 @@ static inline void flowi4_init_output(struct flowi4
 *fl4, int oif,
   fl4-saddr = saddr;
   fl4-fl4_dport = dport;
   fl4-fl4_sport = sport;
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +fl4-flowi4_hash = 0;
 +#endif
   }

   /* Reset some input parameters after previous lookup */
 @@ -182,6 +189,13 @@ static inline struct flowi *flowidn_to_flowi(struct
 flowidn *fldn)
   return container_of(fldn, struct flowi, u.dn);
   }

 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +static inline void flowi4_set_flow_hash(struct flowi4 *fl, __u32 hash)
 +{
 +fl-flowi4_hash = hash;
 +}
 +#endif
 +
   typedef unsigned long flow_compare_t;

   static inline size_t flow_key_size(u16 family)
 diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
 index 5fa643b..c841698 100644
 --- a/include/net/ip_fib.h
 +++ b/include/net/ip_fib.h
 @@ -117,6 +117,10 @@ struct fib_info {
   #ifdef CONFIG_IP_ROUTE_MULTIPATH
   intfib_power;
   #endif
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +/* Cache the number of live nexthops for flow based ECMP
 calculation. */
 +intlive_nexthops;
 +#endif
   struct rcu_headrcu;
   struct fib_nhfib_nh[0];
   #define fib_devfib_nh[0].nh_dev
 @@ -311,6 +315,11 @@ int fib_sync_down_addr(struct net *net, __be32 local);
   int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
   void fib_select_multipath(struct fib_result *res);

 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +void fib_select_multipath_for_flow(struct fib_result *res,
 +   const struct flowi4 *fl4);
 +#endif
 +
   /* Exported by fib_trie.c */
   void fib_trie_init(void);
   struct fib_table *fib_trie_table(u32 id, struct fib_table *alias);
 diff --git a/include/net/route.h b/include/net/route.h
 index fe22d03..fdbbe7f 100644
 --- a/include/net/route.h
 +++ b/include/net/route.h
 @@ -252,6 +252,10 @@ static inline void ip_route_connect_init(struct
 flowi4 *fl4, __be32 dst, __be32

   flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
  protocol, flow_flags, dst, src, dport, sport);
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +flowi4_set_flow_hash(fl4, sk-sk_hash);

Should be sk_txhash I think.

 +#endif
   }

   static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
 diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
 index 6fb3c90..76714d0 100644
 --- a/net/ipv4/Kconfig
 +++ b/net/ipv4/Kconfig
 @@ -90,6 +90,15 @@ config IP_ROUTE_MULTIPATH
 equal cost and chooses one of them in a non-deterministic fashion
 if a matching packet arrives.

 +config IP_FLOW_BASED_MULTIPATH
 +bool IP: flow based equal cost multipath
 +depends on IP_ROUTE_MULTIPATH

Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Tom Herbert
On Tue, Jul 28, 2015 at 2:02 PM, Tom Herbert t...@herbertland.com wrote:
 On Mon, Jul 27, 2015 at 7:27 PM, Richard Laing
 richard.la...@alliedtelesis.co.nz wrote:
 From: Richard Laing richard.la...@alliedtelesis.co.nz

 Enable flow-based ECMP.

 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.

 Richard, someone was complaining to me just last week about the
 weakness of the round robin algorithm. Thanks for looking into this!

 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.


btw, it looks like IPv6 is already doing the hash but is calculating
it by hand instead of using sk_txhash or skb-hash. It would be nice
if we could unify this between v4 and v6 at some point to both using
the existing hashes.

Tom

 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).

 Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz
 ---

   include/net/flow.h   |   14 ++
   include/net/ip_fib.h |9 +
   include/net/route.h  |4 
   net/ipv4/Kconfig |9 +
   net/ipv4/fib_semantics.c |   38 ++
   net/ipv4/route.c |   14 +-
   6 files changed, 87 insertions(+), 1 deletion(-)

 diff --git a/include/net/flow.h b/include/net/flow.h
 index 8109a15..d1d933d 100644
 --- a/include/net/flow.h
 +++ b/include/net/flow.h
 @@ -79,6 +79,10 @@ struct flowi4 {
   #define fl4_ipsec_spiuli.spi
   #define fl4_mh_typeuli.mht.type
   #define fl4_gre_keyuli.gre_key
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH

 Why bother making this a CONFIG, round robin is a miserable algorithm
 anyway and nearly all the other packet steering mechanisms already use
 a hash.

 +__u32flowi4_hash;
 +#endif
   } __attribute__((__aligned__(BITS_PER_LONG/8)));

   static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
 @@ -99,6 +103,9 @@ static inline void flowi4_init_output(struct flowi4
 *fl4, int oif,
   fl4-saddr = saddr;
   fl4-fl4_dport = dport;
   fl4-fl4_sport = sport;
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +fl4-flowi4_hash = 0;
 +#endif
   }

   /* Reset some input parameters after previous lookup */
 @@ -182,6 +189,13 @@ static inline struct flowi *flowidn_to_flowi(struct
 flowidn *fldn)
   return container_of(fldn, struct flowi, u.dn);
   }

 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +static inline void flowi4_set_flow_hash(struct flowi4 *fl, __u32 hash)
 +{
 +fl-flowi4_hash = hash;
 +}
 +#endif
 +
   typedef unsigned long flow_compare_t;

   static inline size_t flow_key_size(u16 family)
 diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
 index 5fa643b..c841698 100644
 --- a/include/net/ip_fib.h
 +++ b/include/net/ip_fib.h
 @@ -117,6 +117,10 @@ struct fib_info {
   #ifdef CONFIG_IP_ROUTE_MULTIPATH
   intfib_power;
   #endif
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +/* Cache the number of live nexthops for flow based ECMP
 calculation. */
 +intlive_nexthops;
 +#endif
   struct rcu_headrcu;
   struct fib_nhfib_nh[0];
   #define fib_devfib_nh[0].nh_dev
 @@ -311,6 +315,11 @@ int fib_sync_down_addr(struct net *net, __be32 local);
   int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
   void fib_select_multipath(struct fib_result *res);

 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +void fib_select_multipath_for_flow(struct fib_result *res,
 +   const struct flowi4 *fl4);
 +#endif
 +
   /* Exported by fib_trie.c */
   void fib_trie_init(void);
   struct fib_table *fib_trie_table(u32 id, struct fib_table *alias);
 diff --git a/include/net/route.h b/include/net/route.h
 index fe22d03..fdbbe7f 100644
 --- a/include/net/route.h
 +++ b/include/net/route.h
 @@ -252,6 +252,10 @@ static inline void ip_route_connect_init(struct
 flowi4 *fl4, __be32 dst, __be32

   flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
  protocol, flow_flags, dst, src, dport, sport);
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +flowi4_set_flow_hash(fl4, sk-sk_hash);

 Should be sk_txhash I think.

 +#endif
   }

   static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
 diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
 index 6fb3c90..76714d0 100644
 --- 

Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Richard Laing

On 07/29/2015 07:51 AM, Stephen Hemminger wrote:
 On Tue, 28 Jul 2015 02:27:57 +
 Richard Laing richard.la...@alliedtelesis.co.nz wrote:

 From: Richard Laing richard.la...@alliedtelesis.co.nz

 Enable flow-based ECMP.

 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.

 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.

 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).

 Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz
 ECMP needs some work, glad to see someone has started again.
 A little bit of history, the original ECMP algorithm selection code was 
 dropped years ago
 because it was buggy. The API was good, but the implemenation had problems.

 Somewhere in my dustbin, I have some patches from internal stuff that allows
 selecting the algorithm at runtime.

 Your implementation makes it a compile time choice. Although this is good
 for testing, it really needs to be a sysctl or route attribute to be
 useful in a distro environment.

 As others said, your mailer is mangling the lines by wrapping.

Thanks Stephen, not sure what happened with the formatting, I'll try and 
find out. I will look at a sysctl option to enable the flow based ECMP 
option.

Cheers
Richard

-- 
Richard Laing
Software Team Leader
Allied Telesis Labs| 27 Nazareth Ave | Christchurch 8024 | New Zealand
Phone: +64 3 339 9248
Web: www.alliedtelesis.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Richard Laing


On 07/29/2015 09:02 AM, Tom Herbert wrote:
 On Mon, Jul 27, 2015 at 7:27 PM, Richard Laing
 richard.la...@alliedtelesis.co.nz wrote:
 From: Richard Laing richard.la...@alliedtelesis.co.nz

 Enable flow-based ECMP.

 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.

 Richard, someone was complaining to me just last week about the
 weakness of the round robin algorithm. Thanks for looking into this!

 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.

 This patch relies on the skb hash value to select between routes. The
 selection uses a hash-threshold algorithm (see RFC2992).

 Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz
 ---

include/net/flow.h   |   14 ++
include/net/ip_fib.h |9 +
include/net/route.h  |4 
net/ipv4/Kconfig |9 +
net/ipv4/fib_semantics.c |   38 ++
net/ipv4/route.c |   14 +-
6 files changed, 87 insertions(+), 1 deletion(-)

 diff --git a/include/net/flow.h b/include/net/flow.h
 index 8109a15..d1d933d 100644
 --- a/include/net/flow.h
 +++ b/include/net/flow.h
 @@ -79,6 +79,10 @@ struct flowi4 {
#define fl4_ipsec_spiuli.spi
#define fl4_mh_typeuli.mht.type
#define fl4_gre_keyuli.gre_key
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 Why bother making this a CONFIG, round robin is a miserable algorithm
 anyway and nearly all the other packet steering mechanisms already use
 a hash.

Fair enough, I will look at making it a sysctl option. I guess the 
default can be the current behaviour.

 +__u32flowi4_hash;
 +#endif
} __attribute__((__aligned__(BITS_PER_LONG/8)));

static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
 @@ -99,6 +103,9 @@ static inline void flowi4_init_output(struct flowi4
 *fl4, int oif,
fl4-saddr = saddr;
fl4-fl4_dport = dport;
fl4-fl4_sport = sport;
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +fl4-flowi4_hash = 0;
 +#endif
}

/* Reset some input parameters after previous lookup */
 @@ -182,6 +189,13 @@ static inline struct flowi *flowidn_to_flowi(struct
 flowidn *fldn)
return container_of(fldn, struct flowi, u.dn);
}

 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +static inline void flowi4_set_flow_hash(struct flowi4 *fl, __u32 hash)
 +{
 +fl-flowi4_hash = hash;
 +}
 +#endif
 +
typedef unsigned long flow_compare_t;

static inline size_t flow_key_size(u16 family)
 diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
 index 5fa643b..c841698 100644
 --- a/include/net/ip_fib.h
 +++ b/include/net/ip_fib.h
 @@ -117,6 +117,10 @@ struct fib_info {
#ifdef CONFIG_IP_ROUTE_MULTIPATH
intfib_power;
#endif
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +/* Cache the number of live nexthops for flow based ECMP
 calculation. */
 +intlive_nexthops;
 +#endif
struct rcu_headrcu;
struct fib_nhfib_nh[0];
#define fib_devfib_nh[0].nh_dev
 @@ -311,6 +315,11 @@ int fib_sync_down_addr(struct net *net, __be32 local);
int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
void fib_select_multipath(struct fib_result *res);

 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +void fib_select_multipath_for_flow(struct fib_result *res,
 +   const struct flowi4 *fl4);
 +#endif
 +
/* Exported by fib_trie.c */
void fib_trie_init(void);
struct fib_table *fib_trie_table(u32 id, struct fib_table *alias);
 diff --git a/include/net/route.h b/include/net/route.h
 index fe22d03..fdbbe7f 100644
 --- a/include/net/route.h
 +++ b/include/net/route.h
 @@ -252,6 +252,10 @@ static inline void ip_route_connect_init(struct
 flowi4 *fl4, __be32 dst, __be32

flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
 +
 +#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
 +flowi4_set_flow_hash(fl4, sk-sk_hash);
 Should be sk_txhash I think.

I will check, it looks likely!

 +#endif
}

static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
 diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
 index 6fb3c90..76714d0 100644
 --- a/net/ipv4/Kconfig
 +++ b/net/ipv4/Kconfig
 @@ -90,6 +90,15 @@ config IP_ROUTE_MULTIPATH
 

Re: [RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-28 Thread Richard Laing


On 07/29/2015 09:15 AM, Tom Herbert wrote:
 On Tue, Jul 28, 2015 at 2:02 PM, Tom Herbert t...@herbertland.com wrote:
 On Mon, Jul 27, 2015 at 7:27 PM, Richard Laing
 richard.la...@alliedtelesis.co.nz wrote:
 From: Richard Laing richard.la...@alliedtelesis.co.nz

 Enable flow-based ECMP.

 Currently if equal-cost multipath is enabled the kernel chooses between
 equal cost paths for each matching packet, essentially packets are
 round-robined between the routes. This means that packets from a single
 flow can traverse different routes. If one of the routes experiences
 congestion this can result in delayed or out of order packets arriving
 at the destination.

 Richard, someone was complaining to me just last week about the
 weakness of the round robin algorithm. Thanks for looking into this!

 This patch allows packets to be routed based on their
 flow - packets in the same flow will always use the same route. This
 prevents out of order packets. There are other issues with round-robin
 based ECMP routing related to variable path MTU handling and debugging.
 See RFC2991 for more details on the problems associated with packet
 based ECMP routing.

 btw, it looks like IPv6 is already doing the hash but is calculating
 it by hand instead of using sk_txhash or skb-hash. It would be nice
 if we could unify this between v4 and v6 at some point to both using
 the existing hashes.

 Tom

Thanks Tom, yes IPv6 does work just fine currently. Assuming I get a 
version of this patch pushed updating the IPv6 version would seem like a 
good option.


-- 
Richard Laing
Software Team Leader
Allied Telesis Labs| 27 Nazareth Ave | Christchurch 8024 | New Zealand
Phone: +64 3 339 9248
Web: www.alliedtelesis.com


[RFC PATCH 1/1] net/ipv4: Enable flow-based ECMP

2015-07-27 Thread Richard Laing
From: Richard Laing richard.la...@alliedtelesis.co.nz

Enable flow-based ECMP.

Currently if equal-cost multipath is enabled the kernel chooses between
equal cost paths for each matching packet, essentially packets are
round-robined between the routes. This means that packets from a single
flow can traverse different routes. If one of the routes experiences
congestion this can result in delayed or out of order packets arriving
at the destination.

This patch allows packets to be routed based on their
flow - packets in the same flow will always use the same route. This
prevents out of order packets. There are other issues with round-robin
based ECMP routing related to variable path MTU handling and debugging.
See RFC2991 for more details on the problems associated with packet
based ECMP routing.

This patch relies on the skb hash value to select between routes. The
selection uses a hash-threshold algorithm (see RFC2992).

Signed-off-by: Richard Laing richard.la...@alliedtelesis.co.nz
---

  include/net/flow.h   |   14 ++
  include/net/ip_fib.h |9 +
  include/net/route.h  |4 
  net/ipv4/Kconfig |9 +
  net/ipv4/fib_semantics.c |   38 ++
  net/ipv4/route.c |   14 +-
  6 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..d1d933d 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -79,6 +79,10 @@ struct flowi4 {
  #define fl4_ipsec_spiuli.spi
  #define fl4_mh_typeuli.mht.type
  #define fl4_gre_keyuli.gre_key
+
+#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
+__u32flowi4_hash;
+#endif
  } __attribute__((__aligned__(BITS_PER_LONG/8)));

  static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
@@ -99,6 +103,9 @@ static inline void flowi4_init_output(struct flowi4 
*fl4, int oif,
  fl4-saddr = saddr;
  fl4-fl4_dport = dport;
  fl4-fl4_sport = sport;
+#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
+fl4-flowi4_hash = 0;
+#endif
  }

  /* Reset some input parameters after previous lookup */
@@ -182,6 +189,13 @@ static inline struct flowi *flowidn_to_flowi(struct 
flowidn *fldn)
  return container_of(fldn, struct flowi, u.dn);
  }

+#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
+static inline void flowi4_set_flow_hash(struct flowi4 *fl, __u32 hash)
+{
+fl-flowi4_hash = hash;
+}
+#endif
+
  typedef unsigned long flow_compare_t;

  static inline size_t flow_key_size(u16 family)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 5fa643b..c841698 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -117,6 +117,10 @@ struct fib_info {
  #ifdef CONFIG_IP_ROUTE_MULTIPATH
  intfib_power;
  #endif
+#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
+/* Cache the number of live nexthops for flow based ECMP 
calculation. */
+intlive_nexthops;
+#endif
  struct rcu_headrcu;
  struct fib_nhfib_nh[0];
  #define fib_devfib_nh[0].nh_dev
@@ -311,6 +315,11 @@ int fib_sync_down_addr(struct net *net, __be32 local);
  int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
  void fib_select_multipath(struct fib_result *res);

+#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
+void fib_select_multipath_for_flow(struct fib_result *res,
+   const struct flowi4 *fl4);
+#endif
+
  /* Exported by fib_trie.c */
  void fib_trie_init(void);
  struct fib_table *fib_trie_table(u32 id, struct fib_table *alias);
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03..fdbbe7f 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -252,6 +252,10 @@ static inline void ip_route_connect_init(struct 
flowi4 *fl4, __be32 dst, __be32

  flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
 protocol, flow_flags, dst, src, dport, sport);
+
+#ifdef CONFIG_IP_FLOW_BASED_MULTIPATH
+flowi4_set_flow_hash(fl4, sk-sk_hash);
+#endif
  }

  static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 6fb3c90..76714d0 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -90,6 +90,15 @@ config IP_ROUTE_MULTIPATH
equal cost and chooses one of them in a non-deterministic fashion
if a matching packet arrives.

+config IP_FLOW_BASED_MULTIPATH
+bool IP: flow based equal cost multipath
+depends on IP_ROUTE_MULTIPATH
+help
+  Normally if equal cost multipath is enabled the router chooses 
between
+  equal cost paths for each matching packet, essentially 
packets are round-
+  robined between the routes. This option allows packets to be 
routed based on
+  their flow - packets in the same flow will always use the 
same route.
+
  config IP_ROUTE_VERBOSE
  bool IP: verbose route monitoring
  depends on IP_ADVANCED_ROUTER
diff --git a/net/ipv4/fib_semantics.c