Re: [PATCH][XFRM] export SAD info
On Thu, 2007-26-04 at 14:18 -0700, David Miller wrote: I wouldn't mind if it actually helped anything. The SMP cache line transactions are more expensive than the execution of the code blocks they are protecting. rwlock's rarely help, and when they do (the execution path is more expensive than the SMP atomic operations) then you're holding the lock too long :-) Ok ;- So if i was to make any change, it would be for consistency with SPD. If this is sufficiently compelling i will send a patch. I would prefer a dynamic algorithm that reacts to system memory pressure and yet-another-knob that people will get wrong and there is no sane default for. This would certainly be a better approach if doable. I plan to do away with all the GC threshold madness in the routing cache, for example, and just let the MM layer call back into us when there is memory pressure to trigger GC. See set_shrinker() and friends. The MM calls into these handlers in response to memory pressure. There is no reason the networking can't hook into this and do things properly instead of the ad-hoc manner we currently use. Scanning the kernel ... I wasnt aware of this, neat; not many areas in the kernel seem to use it. I find this stuff interesting, so i may get too verbose ;- One approach i tried was to write an oom_handler - but it seemed to get invoked a little too late, i.e when shit has already hit the fan. If only i could get notified just before swap kicks in or just when some preconfigured (by me) memmory threshold is hit This may do it? I will experiment. Actually for it to work well, I will need to know when the memory threshold is crossed as it goes down and when it is going up as more memory gets freed. I can see the shrinker working well with dynamically createable entries (route cache, arp cache, contrack etc); shrinking a SAD, SPD, FIB etc that was created by some user space app without user space consent or at least notification may be unacceptable (imagine Quagga/BGP adding FIB entries and the kernel deciding its gonna run out of mem and starting to delete entries; worse deleting SAD entries may be a security risk etc etc). My problem is more related to these sorts of user controlled tables. One approach that may work to address the above is to send a signal to user space when the low mem threshold is approaching.. User space then uses that info to slow down its abuse of memory. I think that signaling maybe achievable by a genlmsg being sent to a multicast group which a user space app will have to subscribe to. Another approach is to use the shrinker callback to set a lowmem condition to start rejecting any new table additions. A timer to retry would take it back; a callback from the VM to say you can go ahead and alloc more now would be better of course - i couldnt see this anywhere in the VM code, but it is one of those subsystem i dont pay attention to, it may be there. Thoughts? ;- cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][XFRM] export SAD info
From: jamal [EMAIL PROTECTED] Date: Wed, 25 Apr 2007 11:42:41 -0400 [XFRM] export SAD info On a system with a lot of SAs, counting SAD entries chews useful CPU time since you need to dump the whole SAD to user space; i.e something like ip xfrm state ls | grep -i src | wc -l I have seen taking literally minutes on a 40K SAs when the system is swapping. With this patch, some of the SAD info (that was already being tracked) is exposed to user space. i.e you do: ip xfrm state count And you get the count; you can also pass -s to the command line and get the hash info. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] Applied, thanks Jamal. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][XFRM] export SAD info
From: jamal [EMAIL PROTECTED] Date: Wed, 25 Apr 2007 11:54:50 -0400 That patch has xfrm_state_num being mucked with; just ignore that bit. I need to send a patch against net-2.6.22 and i will clean that up - just need some feedback. Would it make sense to have those vars as u32 instead of unsigned int? I'm ambivalent, unsigned int happens to be 32-bit on every platform. So changing it would cause no harm :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][XFRM] export SAD info
Thanks Dave. Here's a missing bit to get in sync with latest net-2.6 I will send you the policy side sometime tonight or tommorow. cheers, jamal [XFRM] missing bits to SAD info This brings the SAD info in sync with net-2.6.22/net-2.6 Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit e1dbdd993637d14c17bc30ba55e5c66640d39907 tree fef4015f92e4e50f98e1a1630333cc59ed4ec523 parent bfbf3c0968498f5232c02965cf41695edae1bc4d author Jamal Hadi Salim [EMAIL PROTECTED] Thu, 26 Apr 2007 08:52:29 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Thu, 26 Apr 2007 08:52:29 -0400 net/xfrm/xfrm_user.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index cb4cc1b..69110fe 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1878,6 +1878,7 @@ static const int xfrm_msg_min[XFRM_NR_MSGTYPES] = { [XFRM_MSG_GETAE - XFRM_MSG_BASE] = XMSGSIZE(xfrm_aevent_id), [XFRM_MSG_REPORT - XFRM_MSG_BASE] = XMSGSIZE(xfrm_user_report), [XFRM_MSG_MIGRATE - XFRM_MSG_BASE] = XMSGSIZE(xfrm_userpolicy_id), + [XFRM_MSG_GETSADINFO - XFRM_MSG_BASE] = NLMSG_LENGTH(sizeof(u32)), }; #undef XMSGSIZE @@ -1905,7 +1906,7 @@ static struct xfrm_link { [XFRM_MSG_NEWAE - XFRM_MSG_BASE] = { .doit = xfrm_new_ae }, [XFRM_MSG_GETAE - XFRM_MSG_BASE] = { .doit = xfrm_get_ae }, [XFRM_MSG_MIGRATE - XFRM_MSG_BASE] = { .doit = xfrm_do_migrate}, - [XFRM_MSG_GETSADINFO - XFRM_MSG_BASE] = { .doit = xfrm_get_sadinfo }, + [XFRM_MSG_GETSADINFO - XFRM_MSG_BASE] = { .doit = xfrm_get_sadinfo }, }; static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
Re: [PATCH][XFRM] export SAD info
On Thu, 2007-26-04 at 00:18 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Would it make sense to have those vars as u32 instead of unsigned int? I'm ambivalent, unsigned int happens to be 32-bit on every platform. So changing it would cause no harm :-) If unsigned int is always u32 i will leave it as is. I would have liked to just do a read_lock_bh when retrieving the table metadata; however, the state table lock is defined as DEFINE_SPINLOCK unlike the policy table which is defined as DEFINE_RWLOCK. Any objection to change the state lock to be RW? BTW, if i can get the SADinfo, then i should be able to set it from user space too;- So that would be my next change unless there is objection. One other angle is start rejecting additions to the table after some point. To test, I wrote a little DOS tool that just kept adding entries until an OOM hit. It is a lot of fun to watch when you hit a point that swap is guzzling 2G or more. The add latency starts going up exponentially. I would like to enable the admin to set the proper param settings for upper bound. Exceeding the upper bounds of the max entries a table should have returns ENOMEM for any new entries. By default current behavior is maintained. Thoughts? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][XFRM] export SAD info
From: jamal [EMAIL PROTECTED] Date: Thu, 26 Apr 2007 08:55:54 -0400 Here's a missing bit to get in sync with latest net-2.6 Applied, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][XFRM] export SAD info
From: jamal [EMAIL PROTECTED] Date: Thu, 26 Apr 2007 09:10:10 -0400 I would have liked to just do a read_lock_bh when retrieving the table metadata; however, the state table lock is defined as DEFINE_SPINLOCK unlike the policy table which is defined as DEFINE_RWLOCK. Any objection to change the state lock to be RW? I wouldn't mind if it actually helped anything. The SMP cache line transactions are more expensive than the execution of the code blocks they are protecting. rwlock's rarely help, and when they do (the execution path is more expensive than the SMP atomic operations) then you're holding the lock too long :-) One other angle is start rejecting additions to the table after some point. To test, I wrote a little DOS tool that just kept adding entries until an OOM hit. It is a lot of fun to watch when you hit a point that swap is guzzling 2G or more. The add latency starts going up exponentially. I would prefer a dynamic algorithm that reacts to system memory pressure and yet-another-knob that people will get wrong and there is no sane default for. I plan to do away with all the GC threshold madness in the routing cache, for example, and just let the MM layer call back into us when there is memory pressure to trigger GC. See set_shrinker() and friends. The MM calls into these handlers in response to memory pressure. There is no reason the networking can't hook into this and do things properly instead of the ad-hoc manner we currently use. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][XFRM] export SAD info
Dave, Something ive been meaning to do since you made the hash changes. I will be doing one also for policy. Against latest Linus tree because i am having strange challenges syncing net-2.6.22. cheers, jamal [XFRM] export SAD info On a system with a lot of SAs, counting SAD entries chews useful CPU time since you need to dump the whole SAD to user space; i.e something like ip xfrm state ls | grep -i src | wc -l I have seen taking literally minutes on a 40K SAs when the system is swapping. With this patch, some of the SAD info (that was already being tracked) is exposed to user space. i.e you do: ip xfrm state count And you get the count; you can also pass -s to the command line and get the hash info. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 1fb99604e38f27c1ad4cb74b11f148c34d0d3be6 tree 1bb35db627ac5d3d2f370d0fc993ba6b80392696 parent 146d97b89c83c9460012185bfd584d21a3b5fe19 author Jamal Hadi Salim [EMAIL PROTECTED] Wed, 25 Apr 2007 11:30:21 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Wed, 25 Apr 2007 11:30:21 -0400 include/linux/xfrm.h | 25 ++ include/net/xfrm.h|8 +++ net/xfrm/xfrm_state.c | 12 ++- net/xfrm/xfrm_user.c | 56 + 4 files changed, 100 insertions(+), 1 deletions(-) diff --git a/include/linux/xfrm.h b/include/linux/xfrm.h index 15ca89e..9c656a5 100644 --- a/include/linux/xfrm.h +++ b/include/linux/xfrm.h @@ -181,6 +181,10 @@ enum { XFRM_MSG_MIGRATE, #define XFRM_MSG_MIGRATE XFRM_MSG_MIGRATE + XFRM_MSG_NEWSADINFO, +#define XFRM_MSG_NEWSADINFO XFRM_MSG_NEWSADINFO + XFRM_MSG_GETSADINFO, +#define XFRM_MSG_GETSADINFO XFRM_MSG_GETSADINFO __XFRM_MSG_MAX }; #define XFRM_MSG_MAX (__XFRM_MSG_MAX - 1) @@ -234,6 +238,17 @@ enum xfrm_ae_ftype_t { #define XFRM_AE_MAX (__XFRM_AE_MAX - 1) }; +/* SAD Table filter flags */ +enum xfrm_sad_ftype_t { + XFRM_SAD_UNSPEC, + XFRM_SAD_HMASK=1, + XFRM_SAD_HMAX=2, + XFRM_SAD_CNT=4, + __XFRM_SAD_MAX + +#define XFRM_SAD_MAX (__XFRM_SAD_MAX - 1) +}; + struct xfrm_userpolicy_type { __u8type; __u16 reserved1; @@ -265,6 +280,16 @@ enum xfrm_attr_type_t { #define XFRMA_MAX (__XFRMA_MAX - 1) }; +enum xfrm_sadattr_type_t { + XFRMA_SAD_UNSPEC, + XFRMA_SADHMASK, + XFRMA_SADHMAX, + XFRMA_SADCNT, + __XFRMA_SAD_MAX + +#define XFRMA_SAD_MAX (__XFRMA_SAD_MAX - 1) +}; + struct xfrm_usersa_info { struct xfrm_selectorsel; struct xfrm_id id; diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 5a00aa8..4922e9f 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -416,6 +416,13 @@ struct xfrm_audit u32 secid; }; +/* SAD metadata, add more later */ +struct xfrm_sadinfo +{ + u32 sadhcnt; /* current hash bkts */ + u32 sadhmcnt; /* max allowed hash bkts */ + u32 sadcnt; /* current running count */ +}; #ifdef CONFIG_AUDITSYSCALL extern void xfrm_audit_log(uid_t auid, u32 secid, int type, int result, struct xfrm_policy *xp, struct xfrm_state *x); @@ -938,6 +945,7 @@ static inline int xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **s extern struct xfrm_state *xfrm_find_acq_byseq(u32 seq); extern int xfrm_state_delete(struct xfrm_state *x); extern void xfrm_state_flush(u8 proto, struct xfrm_audit *audit_info); +extern void xfrm_sad_getinfo(struct xfrm_sadinfo *si); extern int xfrm_replay_check(struct xfrm_state *x, __be32 seq); extern void xfrm_replay_advance(struct xfrm_state *x, __be32 seq); extern void xfrm_replay_notify(struct xfrm_state *x, int event); diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index c1581fb..98e5ce3 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -53,7 +53,7 @@ static struct hlist_head *xfrm_state_bysrc __read_mostly; static struct hlist_head *xfrm_state_byspi __read_mostly; static unsigned int xfrm_state_hmask __read_mostly; static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; -static u32 xfrm_state_num; +static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; static inline unsigned int xfrm_dst_hash(xfrm_address_t *daddr, @@ -421,6 +421,16 @@ restart: } EXPORT_SYMBOL(xfrm_state_flush); +void xfrm_sad_getinfo(struct xfrm_sadinfo *si) +{ + spin_lock_bh(xfrm_state_lock); + si-sadcnt = xfrm_state_num; + si-sadhcnt = xfrm_state_hmask; + si-sadhmcnt = xfrm_state_hashmax; + spin_unlock_bh(xfrm_state_lock); +} +EXPORT_SYMBOL(xfrm_sad_getinfo); + static int xfrm_init_tempsel(struct xfrm_state *x, struct flowi *fl, struct xfrm_tmpl *tmpl, diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 816e369..089159a 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -672,6 +672,61 @@ static struct sk_buff *xfrm_state_netlink(struct sk_buff
Re: [PATCH][XFRM] export SAD info
That patch has xfrm_state_num being mucked with; just ignore that bit. I need to send a patch against net-2.6.22 and i will clean that up - just need some feedback. Would it make sense to have those vars as u32 instead of unsigned int? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html