On Thu, Oct 15, 2009 at 12:27:21PM -0700, David J. Wilder wrote: > On Wed, 2009-10-14 at 13:09 Jason Gunthorpe wrote: > > > So, it tries to match the source addr to the addrs bound to the > > device, which is wrong - that isn't how the ip stack works. > > > You can patch this up a little bit by fixing up addr_resolve_local to > > set sin6_scope_ip. > > I found the bug in addr_resolve_local(). (more comments below)
Yes, that is the hacky work around I was mentioning.. > > But really the correct thing to do is to remove addr_resolve_local and > > place the source address into the struct flowi and use the result of > > the route lookup to bind to the source device, and set the source > > address if it is unset. > > Sorry I don't get it.. > Are you saying that ip6_route_output() will resolve the address even if > it is a link-local address bound to my own interface? Therefor > addr_resolve_local() is not needed. Yes, and more. In Linux the routing table takes as input the source (optional), device (optional) and destination address and returns as output the device to use. To determine the device to bind to you ask the routing table what device to use for all the route information you have. For example: $ ip route get fe80::c2d from fe80::213:72ff:fe29:e65d oif eth0 fe80::c2d via fe80::c2d dev eth0 src fe80::213:72ff:fe29:e65d metric 0 cache mtu 1500 advmss 1440 hoplimit 4294967295 $ ip route get fe80::c2d oif eth0 fe80::c2d via fe80::c2d dev eth0 src fe80::213:72ff:fe29:e65d metric 0 cache mtu 1500 advmss 1440 hoplimit 4294967295 You can see in both cases the routing table returns a 'src' entry. 'src' is the address to bind to if no bind address was specified. When doing link local addresess the sin6_scope_id should sets the 'oif' key in the routing lookup, which will result in the correct src address and output device being selected by the routing algorithm. For instance on my machine here, I have two interfaces: $ ip route get fe80::c2d oif virbr0 fe80::c2d via fe80::c2d dev virbr0 src fe80::2c5d:c4ff:feb8:1ce5 metric 0 cache mtu 1500 advmss 1440 hoplimit 4294967295 As you can see it is returning the link local address for virbr0 as the source. So the algorithm in RDMA CM should look like this: - If src is specified then set the bind local address to src [if src is link local then it must specify sin6_scope_id, and sin6_scope_id becomes the oif input to the route lookup] - If dst is link local then its sin6_scope_id is the oif to the route lookup (and must match src, as we did last go round) - Src (or 'any'), dst and device (or 'any') are passed to the route lookup - The RDMA CM ID is bound to the device returned by the route lookup - If the src address was not specified then the connection source IP is set to the 'src' value from the route lookup. This is why addr_resolve_local/rdma_translate_ip is not needed, that entire entire function is done by the routing table code. You can see why this becomes important when it is combined with policy routing, for instance consider this example: $ ip rule 32765: from 10.0.0.4 lookup dnat $ ip route show table dnat default via 10.0.0.1 dev eth1 $ ip route get 10.0.0.100 10.0.0.100 dev eth0 src 10.0.0.2 $ ip route get 10.0.0.100 from 10.0.0.4 10.0.0.100 from 10.0.0.4 via 10.0.0.1 dev eth1 cache mtu 1500 advmss 1460 hoplimit 64 The two results are radically different and dependant on the source address. (10.0.0.4 could be attached to eth0, and eth1!) The actual fixing to the code is not hard, remove rdma_translate_ip, addr_resolve_local, split addr_resolve_remote into a part to resolve the route and a part that does the arp/nd. Make the route resolve part work almost exactly like addr4_resolve_remote (noting that the v6 version is wrong, since is doesn't respect unset source addres, another bug). Call rdma_copy_addr based on the rt->idev->dev (or should it be odev??). Do the ARP. The pain is in retesting everything :| Jason _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg