Re: [RFC PATCH 0/7] Share events between metrics
> Or, is the concern more about trying to time-slice the results in a > fairly granular way and expecting accurate results then? Usually the later. It's especially important for divisions. You want both divisor and dividend to be in the same time slice, otherwise the result usually doesn't make a lot of sense. -Andi
Re: perf measure for stalled cycles per instruction on newer Intel processors
> > Don't use it. It's misleading on a out-of-order CPU because you don't > > know if it's actually limiting anything. > > > > If you want useful bottleneck data use --topdown. > > So running again, this time with the below params, I got this output > where all the right most column is colored red. I wonder what can be > said on the amount/ratio of stalls for this app - if you can maybe recommend > some posts of yours to better understand that, I saw some comment in the > perf-stat man page and some lwn article but wasn't really able to figure it > out. TopDown determines what limits the execution the most. The application is mostly backend bound (55-72%). This can be either memory issues (more common), or sometimes also execution issues. Standard perf doesn't support a further break down beyond these high level categories, but there are alternative tools that do (e.g. mine is "toplev" in https://github.com/andikleen/pmu-tools or VTune) Some references on TopDown: https://github.com/andikleen/pmu-tools/wiki/toplev-manual http://bit.ly/tma-ispass14 The tools above would also allow you to sample where the stalls are occuring. -Andi
Re: perf measure for stalled cycles per instruction on newer Intel processors
On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote: > Hi, > > Earlier Intel processors (e.g E5-2650) support the more of classical > two stall events (for backend and frontend [1]) and then perf shows > the nice measure of stalled cycles per instruction - e.g here where we > have IPC of 0.91 and CSPI (see [2]) of 0.68: Don't use it. It's misleading on a out-of-order CPU because you don't know if it's actually limiting anything. If you want useful bottleneck data use --topdown. -Andi
Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size
> > I'm not sure if size is that great an heuristic. The dedup algorithm should > > work in any case even if you don't order by size, right? > > Consider two metrics: > - metric 1 with events {A,B} > - metric 2 with events {A,B,C,D} > If the list isn't sorted then as the matching takes the first group > with all the events, metric 1 will match {A,B} and metric 2 {A,B,C,D}. > If the order is sorted to {A,B,C,D},{A,B} then metric 1 matches within > the {A,B,C,D} group as does metric 2. The events in metric 1 aren't > used and are removed. Ok. It's better for the longer metric if they stay together. > > The dedup algorithm is very naive :-) I guess what matters is that it gives reasonable results on the current metrics. I assume it does? How much deduping is happening if you run all metrics? For toplev on my long term todo list was to compare it against a hopefully better schedule generated by or-tools, but I never got around to coding that up. -Andi
Re: [RFC PATCH v3 06/14] perf evsel: fix 2 memory leaks
On Thu, May 07, 2020 at 10:36:21PM -0700, Ian Rogers wrote: > If allocated, perf_pkg_mask and metric_events need freeing. All these patches at the beginning look like straight forward bug fixes and are really independent of the new features. For them Reviewed-by: Andi Kleen > > Signed-off-by: Ian Rogers > --- > tools/perf/util/evsel.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c > index 28683b0eb738..05bb46baad6a 100644 > --- a/tools/perf/util/evsel.c > +++ b/tools/perf/util/evsel.c > @@ -1263,6 +1263,8 @@ void evsel__exit(struct evsel *evsel) > zfree(&evsel->group_name); > zfree(&evsel->name); > zfree(&evsel->pmu_name); > + zfree(&evsel->per_pkg_mask); > + zfree(&evsel->metric_events); > perf_evsel__object.fini(evsel); > } > > -- > 2.26.2.645.ge9eca65c58-goog >
Re: [RFC PATCH v3 13/14] perf metricgroup: remove duped metric group events
> static struct evsel *find_evsel_group(struct evlist *perf_evlist, > struct expr_parse_ctx *pctx, > + bool has_constraint, > struct evsel **metric_events, > unsigned long *evlist_used) > { > - struct evsel *ev; > - bool leader_found; > - const size_t idnum = hashmap__size(&pctx->ids); > - size_t i = 0; > - int j = 0; > + struct evsel *ev, *current_leader = NULL; > double *val_ptr; > + int i = 0, matched_events = 0, events_to_match; > + const int idnum = (int)hashmap__size(&pctx->ids); BTW standard perf data structure would be a rblist or strlist I think it would be really better to do the deduping in a separate pass than trying to add it to find_evsel_group. This leads to very complicated logic. This will likely make it easier to implement more sophisticated algorithms too. -Andi
Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size
On Thu, May 07, 2020 at 10:36:27PM -0700, Ian Rogers wrote: > When adding event groups to the group list, insert them in size order. > This performs an insertion sort on the group list. By placing the > largest groups at the front of the group list it is possible to see if a > larger group contains the same events as a later group. This can make > the later group redundant - it can reuse the events from the large group. > A later patch will add this sharing. I'm not sure if size is that great an heuristic. The dedup algorithm should work in any case even if you don't order by size, right? I suppose in theory some kind of topological sort would be better. -Andi > > Signed-off-by: Ian Rogers > --- > tools/perf/util/metricgroup.c | 16 +++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c > index 2a6456fa178b..69fbff47089f 100644 > --- a/tools/perf/util/metricgroup.c > +++ b/tools/perf/util/metricgroup.c > @@ -520,7 +520,21 @@ static int __metricgroup__add_metric(struct list_head > *group_list, > return -EINVAL; > } > > - list_add_tail(&eg->nd, group_list); > + if (list_empty(group_list)) > + list_add(&eg->nd, group_list); > + else { > + struct list_head *pos; > + > + /* Place the largest groups at the front. */ > + list_for_each_prev(pos, group_list) { > + struct egroup *old = list_entry(pos, struct egroup, nd); > + > + if (hashmap__size(&eg->pctx.ids) <= > + hashmap__size(&old->pctx.ids)) > + break; > + } > + list_add(&eg->nd, pos); > + } > > return 0; > } > -- > 2.26.2.645.ge9eca65c58-goog >
Re: [RFC PATCH v3 01/14] perf parse-events: expand add PMU error/verbose messages
This seems like a independent bug fix and should probably be pushed independently of the rest. Perhaps add a Fixes: tag. Reviewed-by: Andi Kleen On Thu, May 07, 2020 at 10:36:16PM -0700, Ian Rogers wrote: > On a CPU like skylakex an uncore_iio_0 PMU may alias with > uncore_iio_free_running_0. The latter PMU doesn't support fc_mask > as a parameter and so pmu_config_term fails. Typically > parse_events_add_pmu is called in a loop where if one alias succeeds > errors are ignored, however, if multiple errors occur > parse_events__handle_error will currently give a WARN_ONCE. > > This change removes the WARN_ONCE in parse_events__handle_error and > makes it a pr_debug. It adds verbose messages to parse_events_add_pmu > warning that non-fatal errors may occur, while giving details on the pmu > and config terms for useful context. pmu_config_term is altered so the > failing term and pmu are present in the case of the 'unknown term' > error which makes spotting the free_running case more straightforward. > > Before: > $ perf --debug verbose=3 stat -M llc_misses.pcie_read sleep 1 > Using CPUID GenuineIntel-6-55-4 > metric expr unc_iio_data_req_of_cpu.mem_read.part0 + > unc_iio_data_req_of_cpu.mem_read.part1 + > unc_iio_data_req_of_cpu.mem_read.part2 + > unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ > found event unc_iio_data_req_of_cpu.mem_read.part0 > found event unc_iio_data_req_of_cpu.mem_read.part1 > found event unc_iio_data_req_of_cpu.mem_read.part2 > found event unc_iio_data_req_of_cpu.mem_read.part3 > metric expr unc_iio_data_req_of_cpu.mem_read.part0 + > unc_iio_data_req_of_cpu.mem_read.part1 + > unc_iio_data_req_of_cpu.mem_read.part2 + > unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ > found event unc_iio_data_req_of_cpu.mem_read.part0 > found event unc_iio_data_req_of_cpu.mem_read.part1 > found event unc_iio_data_req_of_cpu.mem_read.part2 > found event unc_iio_data_req_of_cpu.mem_read.part3 > adding > {unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W,{unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W > intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch > WARNING: multiple event parsing errors > ... > Invalid event/parameter 'fc_mask' > ... > > After: > $ perf --debug verbose=3 stat -M llc_misses.pcie_read sleep 1 > Using CPUID GenuineIntel-6-55-4 > metric expr unc_iio_data_req_of_cpu.mem_read.part0 + > unc_iio_data_req_of_cpu.mem_read.part1 + > unc_iio_data_req_of_cpu.mem_read.part2 + > unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ > found event unc_iio_data_req_of_cpu.mem_read.part0 > found event unc_iio_data_req_of_cpu.mem_read.part1 > found event unc_iio_data_req_of_cpu.mem_read.part2 > found event unc_iio_data_req_of_cpu.mem_read.part3 > metric expr unc_iio_data_req_of_cpu.mem_read.part0 + > unc_iio_data_req_of_cpu.mem_read.part1 + > unc_iio_data_req_of_cpu.mem_read.part2 + > unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ > found event unc_iio_data_req_of_cpu.mem_read.part0 > found event unc_iio_data_req_of_cpu.mem_read.part1 > found event unc_iio_data_req_of_cpu.mem_read.part2 > found event unc_iio_data_req_of_cpu.mem_read.part3 > adding > {unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W,{unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W > intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch > Attempting to add event pmu 'uncore_iio_free_running_5' with > 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors > After aliases, add event pmu 'uncore_iio_free_running_5' with > 'fc_mask,ch_mask,umask,event,' that may result in non-fatal errors > Attempting to add event pmu 'uncore_iio_free_running_3' with > 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors > After aliases, add event pmu 'uncore_iio_free_running_3' with > 'fc_mask,ch_mask,umask,event,' that may result in non-fatal errors > Attempting to add event pmu 'uncore_iio_free_running_1' with > 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors > After aliases, add event pmu 'uncore_iio_free_running_1' with > 'fc_mask,ch_mask,umask,event,
Re: [RFC PATCH v3 00/14] Share events between metrics
On Thu, May 07, 2020 at 10:36:15PM -0700, Ian Rogers wrote: > Metric groups contain metrics. Metrics create groups of events to > ideally be scheduled together. Often metrics refer to the same events, > for example, a cache hit and cache miss rate. Using separate event > groups means these metrics are multiplexed at different times and the > counts don't sum to 100%. More multiplexing also decreases the > accuracy of the measurement. > > This change orders metrics from groups or the command line, so that > the ones with the most events are set up first. Later metrics see if > groups already provide their events, and reuse them if > possible. Unnecessary events and groups are eliminated. Yes some improvements here are great. > > The option --metric-no-group is added so that metrics aren't placed in > groups. This affects multiplexing and may increase sharing. > > The option --metric-mo-merge is added and with this option the > existing grouping behavior is preserved. Could we also make this a per metric option, like -M foo:nomerge,... or somesuch? Okay i suppose this could be a followon. Ultimatively like you said we probably want to configure defaults in the event file. -Andi
Re: [RFC PATCH 0/7] Share events between metrics
> > - without this change events within a metric may get scheduled > > together, after they may appear as part of a larger group and be > > multiplexed at different times, lowering accuracy - however, less > > multiplexing may compensate for this. > > I agree the heuristic in this patch set is naive and would welcome to > improve it from your toplev experience. I think this change is > progress on TopDownL1 - would you agree? TopdownL1 in non SMT mode should always fit. Inside a group deduping always makes sense. The problem is SMT mode where it doesn't fit. toplev tries to group each node and each level together. > > I'm wondering if what is needed are flags to control behavior. For > example, avoiding the use of groups altogether. For TopDownL1 I see. Yes the current situation isn't great. For Topdown your patch clearly is an improvement, I'm not sure it's for everything though. Probably the advanced heuristics are only useful for a few formulas, most are very simple. So maybe it's ok. I guess would need some testing over the existing formulas. -Andi
Re: [RFC PATCH 0/7] Share events between metrics
On Thu, May 07, 2020 at 01:14:29AM -0700, Ian Rogers wrote: > Metric groups contain metrics. Metrics create groups of events to > ideally be scheduled together. Often metrics refer to the same events, > for example, a cache hit and cache miss rate. Using separate event > groups means these metrics are multiplexed at different times and the > counts don't sum to 100%. More multiplexing also decreases the > accuracy of the measurement. > > This change orders metrics from groups or the command line, so that > the ones with the most events are set up first. Later metrics see if > groups already provide their events, and reuse them if > possible. Unnecessary events and groups are eliminated. Note this actually may make multiplexing errors worse. For metrics it is often important that all the input values to the metric run at the same time. So e.g. if you have two metrics and they each fit into a group, but not together, even though you have more multiplexing it will give more accurate results for each metric. I think you change can make sense for metrics that don't fit into single groups anyways. perf currently doesn't quite know this but some heuristic could be added. But I wouldn't do it for simple metrics that fit into groups. The result may well be worse. My toplev tool has some heuristics for this, also some more sophisticated ones that tracks subexpressions. That would be far too complicated for perf likely. -Andi
Re: [RESEND] test_verifier #13 fails on arm64: "retval 65507 != -29"
> > Reading csum_partial/csum_fold, seems like after calculation of > > checksum (so-called unfolded checksum), it is supposed to be passed > > into csum_fold() to convert it into 16-bit one and invert. Yes, you always need to fold at the end. The low level code does fold sometimes, but not always. -Andi
[PATCH] ipv4 ping: Fix __init* attributes
From: Andi Kleen ping_v4_net_ops references init functions, so needs to be __initdata. ping_proc_exit is then referenced from __initdata, so also needs to be __init. Signed-off-by: Andi Kleen --- net/ipv4/ping.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c index 7ccb5f87f70b..070e078612a0 100644 --- a/net/ipv4/ping.c +++ b/net/ipv4/ping.c @@ -1160,7 +1160,7 @@ static void __net_exit ping_v4_proc_exit_net(struct net *net) remove_proc_entry("icmp", net->proc_net); } -static struct pernet_operations ping_v4_net_ops = { +static __initdata struct pernet_operations ping_v4_net_ops = { .init = ping_v4_proc_init_net, .exit = ping_v4_proc_exit_net, }; @@ -1170,7 +1170,7 @@ int __init ping_proc_init(void) return register_pernet_subsys(&ping_v4_net_ops); } -void ping_proc_exit(void) +void __init ping_proc_exit(void) { unregister_pernet_subsys(&ping_v4_net_ops); } -- 2.20.1
Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line
> I wonder if there is some way to use the split format for the > intermediate files, but then for the very final link bring them all in > and make the end result be a traditional single binary. I'm not > talking the separate "dwp" package that packs multiple dwo files into > one, but to actually link them all in at the one. > > Sure, it would lose some of the advantage, but I think a large portion > of the -gsplit-dwarf advantage is about the intermediate state. No? Not sure it's worth to do complicated workarounds. I assume it will be not that difficult to fix binutils (after all gdb works), and disabling the option is a reasonable workaround. > > I tried to google for it, but couldn't find anything. But apparently > elfutils doesn't support dwo files either. It seems mainly the linker > and gdb itself that supports it. The original design document is https://gcc.gnu.org/wiki/DebugFission -Andi
Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line
On Mon, Nov 13, 2017 at 12:56:31PM -0800, Linus Torvalds wrote: > On Mon, Nov 13, 2017 at 12:10 PM, Andi Kleen wrote: > > > > You're right. It works for line information, but strangely not for > > inlines. I assume it can be fixed. > > So I'm not 100% sure it's strictly a addr2line bug. It seems to be broken for normal programs too $ cat tinline.c int i; static inline int finline(void) { i++; } main() { finline(); } $ gcc -O2 -gsplit-dwarf tinline.c $ addr2line -i -e a.out 0x4003b0 /home/ak/tsrc/tinline.c:6 $ gcc -O2 -g tinline.c $ addr2line -i -e a.out 0x4003b0 /home/ak/tsrc/tinline.c:6 /home/ak/tsrc/tinline.c:12 $ I filed https://sourceware.org/bugzilla/show_bug.cgi?id=22434 -Andi
Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line
> Put another way: the CONFIG_DEBUG_INFO_SPLIT option is useless. Yes, > it saves time and disk space, but does so at the expense of making all > the debug information unavailable to basic tools. You're right. It works for line information, but strangely not for inlines. I assume it can be fixed. -Andi
Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line
> > It's the "CONFIG_DEBUG_INFO_SPLIT" thing that makes faddr2line unable > > to see the inlining information, > > > > Using OPTIMIZE_INLINING is fine. > > Good to know that! It works for me. Perhaps your binutils is too old? It was added at some point. Can you try upgrading? % ./linux/scripts/faddr2line obj/vmlinux schedule+10 schedule+10/0x80: schedule at arch/x86/include/asm/current.h:15 % addr2line --version GNU addr2line version 2.27-24.fc26 -Andi
Re: Page allocator bottleneck
Tariq Toukan writes: > > Congestion in this case is very clear. > When monitored in perf top: > 85.58% [kernel] [k] queued_spin_lock_slowpath Please look at the callers. Spinlock profiles without callers are usually useless because it's just blaming the messenger. Most likely the PCP lists are too small for your extreme allocation rate, so it goes back too often to the shared pool. You can play with the vm.percpu_pagelist_fraction setting. -Andi
Re: [PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables
> 64-bit x86_64: > [0.509409] test_siphash: SipHash2-4 cycles: 4049181 > [0.510650] test_siphash: SipHash1-3 cycles: 2512884 > [0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920 > [0.512904] test_siphash:JenkinsHash cycles: 978267 I'm not sure what these numbers mean. Surely a single siphash2-4 does not take 4+ million cycles? If you run them in a loop please divide by the iterations. But generally running small code in a loop is often an unrealistic benchmark strategy because it hides cache misses, primes predictors, changes frequencies and changes memory costs, but also can overload pipelines and oversubscribe resources. [see also page 46+ in http://halobates.de/applicative-mental-models.pdf] So the numbers you get there are at least somewhat dubious. It would be good to have at least some test which is not just a tiny micro benchmark to compare before making conclusions. -Andi
Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
Eric Dumazet writes: > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> with the number of fds passed. We had a customer report page allocation >> failures of order-4 for this allocation. This is a costly order, so it might >> easily fail, as the VM expects such allocation to have a lower-order >> fallback. >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> physically contiguous. Also the allocation is temporary for the duration of >> the >> syscall, so it's unlikely to stress vmalloc too much. > > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. > > So I guess allowing vmalloc() being called from an innocent application > doing a select() might be dangerous, especially if this select() happens > thousands of time per second. Yes it seems like a bad idea because of all the scaling problems here. The right solution would be to fix select to use multiple non virtually contiguous pages. -Andi
Re: [RFC] bridge: MAC learning uevents
"D. Herrendoerfer" writes: > > I may be missing something here - I'm pretty sure there I am, but is > there any conceptual > > reason why this should not be done this way ? What happens if someone floods the network with random mac addresses? Sounds like an easy way to do a DoS attack against your host? -Andi
Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue
> +1, I tried to bring this up here [1] in the last spin. I think only very > few changes would be needed, f.e. on eBPF side to add a queue setting > helper function which is probably straight forward ~10loc patch; and with > regards to actually picking it up after clsact egress, we'd need to adapt > __netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it. You're proposing to rewrite the whole net policy manager as EBPF and run it in a crappy JITer? Is that a serious proposal? It just sounds crazy to me. Especially since we already have a perfectly good compiler and programming language to write system code in. EBPF is ok for temporal instrumentation (if you somehow can accept its security challenges), but using it to replace core kernel functionality (which network policy IMHO is) with some bizarre JITed setup and multiple languages doesn't really make any sense. Especially it doesn't make sense for anything with shared state, which is the core part of network policy: it negotiates with multiple users. After all we're writing Linux here and not some research toy. Thanks, -Andi -- a...@linux.intel.com -- Speaking for myself only.
Re: [RFC PATCH 00/30] Kernel NET policy
> > So where is your policy for power saving? From past experience I can tell > > you > > There is no policy for power saving yet. I will add it to my todo list. Yes it's interesting to consider. The main goal here is to maximize CPU idle residency? I wonder if that's that much different from the CPU policy. -Andi
Re: [RFC PATCH 00/30] Kernel NET policy
> I wonder if this can be attacked from a different angle. What would be > missing to add support for this in user space? The first possibility > that came to my mind is to just multiplex those hints in the kernel. "just" is the handwaving part here -- you're proposing a micro kernel approach where part of the multiplexing job that the kernel is doing is farmed out to a message passing user space component. I suspect this would be far more complicated to get right and perform well than a straight forward monolithic kernel subsystem -- which is traditionally how Linux has approached things. The daemon would always need to work with out of date state compared to the latest, because it cannot do any locking with the kernel state. So you end up with a complex distributed system with multiple agents "fighting" with each other, and the tuning agent never being able to keep up with the actual work. Also of course it would be fundamentally less efficient than kernel code doing that, just because of the additional context switches needed. -Andi
Re: [RFC PATCH 00/30] Kernel NET policy
> It seems strange to me to add such policies to the kernel. > Addmittingly, documentation of some settings is non-existent and one needs > various different tools to set this (sysctl, procfs, sysfs, ethtool, etc). The problem is that different applications need different policies. The only entity which can efficiently negotiate between different applications' conflicting requests is the kernel. And that is pretty much the basic job description of a kernel: multiplex hardware efficiently between different users. So yes the user space tuning approach works for simple cases ("only run workloads that require the same tuning"), but is ultimately not very interesting nor scalable. -Andi
Re: [PATCH] net:ppp: replace too strict capability restriction on opening /dev/ppp
Shanker Wang writes: > This patch removes the check for CAP_NET_ADMIN in the initial namespace > when opening /dev/open. Instead, CAP_NET_ADMIN is checked in the user > namespace the net namespace was created so that /dev/ppp cat get opened > in a unprivileged container. Seems dangerous. From a quick look at the PPP ioctl there is no limit how many PPP devices this can create. So a container having access to this would be able to fill all kernel memory. Probably needs more auditing and hardening first. In general there seems to be a lot of attack surface for root in PPP. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: [PATCH 00/33] Compile-time stack metadata validation
> > There are two ways to fix the warnings: > > > > 1. get rid of the thunks and call the C functions directly; or > > No. Not until gcc learns about per-function callibg conventions (so that it > can > be marked as not clobbering registers). It does already for static functions in 5.x (with -fipa-ra). And with LTO it can be used even somewhat globally. Even older version supported it, for only for x86->SSE on 32bit, which is useless for the kernel. But the new IPA-RA propagates which registers are clobbered. That said it will probably be a long time until we can drop support for older compilers. So for now the manual method is still needed. -Andi
Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64
Tom Herbert writes: > Also, we don't do anything special for alignment, unaligned > accesses on x86 do not appear to be a performance issue. This is not true on Atom CPUs. Also on most CPUs there is still a larger penalty when crossing cache lines. > Verified correctness by testing arbitrary length buffer filled with > random data. For each buffer I compared the computed checksum > using the original algorithm for each possible alignment (0-7 bytes). > > Checksum performance: > > Isolating old and new implementation for some common cases: You forgot to state the CPU. The results likely depend heavily on the micro architecture. The original C code was optimized for K8 FWIW. Overall your assembler looks similar to the C code, except for the jump table. Jump table has the disadvantage that it is much harder to branch predict, with a large penalty if it's mispredicted. I would expect it to be slower for cases where the length changes frequently. Did you benchmark that case? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Increasing skb->mark size
> >This would be be great. I've recently ran into some issues with > >the overhead of the Android firewall setup. > > > >So basically you need 4 extra bytes in sk_buff. How about: > > > >- shrinking skb->priority to 2 byte > > That wouldn't work, see SO_PRIORITY and such (4 bytes) ... But does anybody really use the full 4 bytes for the priority? SO_PRIORITY could well truncate the value. > > >- skb_iff is either skb->dev->iff or 0. so it could be replaced with a > >single bit flag for the 0 case. > > ... and that one wouldn't work on ingress. Please explain. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Increasing skb->mark size
Lorenzo Colitti writes: > On Wed, Nov 25, 2015 at 5:32 AM, Matt Bennett > wrote: >> I'm emailing this list for feedback on the feasibility of increasing >> skb->mark or adding a new field for marking. Perhaps this extension >> could be done under a new CONFIG option. > > 64-bit marks (both skb->mark and sk->sk_mark) would be useful for > hosts doing complex policy routing as well. Current Android releases > use 20 of the 32 bits. If the mark were 64 bits, we could put the UID > in it, and stop using ip rules to implement per-UID routing. This would be be great. I've recently ran into some issues with the overhead of the Android firewall setup. So basically you need 4 extra bytes in sk_buff. How about: - shrinking skb->priority to 2 byte - skb_iff is either skb->dev->iff or 0. so it could be replaced with a single bit flag for the 0 case. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have > limited "Frontend" support. E.g. > > # perf record -g -a -e stalled-cycles-frontend > Error: > The stalled-cycles-frontend event is not supported. > > And AFAIK icache misses are part of "frontend". Ignore stalled-cycles-frontend. It is very unreliable and has never worked right. toplev gives you much more reliable output. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
> There is a recent patch that may help here, see below, but maybe its > just a matter of removing that :pp, as it ends with a /pp anyway, no > need to state that twice :) Yes the extra :pp was a regression in toplev. I fixed it now. Thanks. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
> My only problem left, is I want a perf measurement that pinpoint these > kind of spots. The difference in L1-icache-load-misses were significant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (even > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the bottle neck is and what to sample for if there is a suitable event and even prints the command line. https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev However frontend issues are difficult to sample, as they happen very far away from instruction retirement where the sampling happens. So you may have large skid and the sampling points may be far away. Skylake has new special FRONTEND_* PEBS events for this, but before it was often difficult. BTW if your main goal is icache; I wrote a gcc patch to help the kernel by enabling function splitting: Apply the patch in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5, make sure 9bebe9e5b0f (now in mainline) is applied and build with -freorder-blocks-and-partition. That will split all functions into statically predicted hot and cold parts and generally relieves icache pressure. Any testing of this on your workload welcome. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/8]: uninline & uninline
> Is it possible to catch this automatically, like, by re-defining > likely/unlikely to the raw form in specific file(s)? Sure it would be possible to define a IN_VDSO symbol in all vdso related files and then do that. Might be useful for other things too. vdso has some very specific requirements. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/8]: uninline & uninline
Andrew Morton <[EMAIL PROTECTED]> writes: >> -41525 2066 f, 3370 +, 44895 -, diff: -41525 IS_ERR > > This is a surprise. I expect that the -mm-only > profile-likely-unlikely-macros.patch is the cause of this and mainline > doesn't have this problem. Shouldn't they only have overhead when the respective CONFIG is enabled? > If true, then this likely/unlikely bloat has probably spread into a lot of > your other results and it all should be redone against mainline, sorry :( > > (I'm not aware of anyone having used profile-likely-unlikely-macros.patch > in quite some time. That's unfortunate because it has turned up some > fairly flagrant code deoptimisations) Is there any reason they couldn't just be merged to mainline? I think it's a useful facility. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/
Andrew Morton <[EMAIL PROTECTED]> writes: > > It should be possible to use a modular jhash.ko. The things which you > have identified as clients of the jhash library are usually loaded as modules. For very small functions like this own modules are quite expensive. First everything gets rounded up to at least one 4K page (or worse on architectures with larger pages). That just wastes some memory. But then since modules live in vmalloc space they also need an own TLB entry, which are notouriously scarce in the kernel because often user space wants to monopolize them all. So if you're unlucky and user space is thrashing the TLB just a single call to this hash function will be an own TLB miss and quite expensive. It would be better to just always link it in for this case. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Make sure sockets implement splice_read
Rémi Denis-Courmont <[EMAIL PROTECTED]> writes: > Le Friday 15 February 2008 18:30:11 Andi Kleen, vous avez écrit : >> David Miller <[EMAIL PROTECTED]> writes: >> > From: Rémi_Denis-Courmont <[EMAIL PROTECTED]> >> > Date: Thu, 14 Feb 2008 18:53:34 +0200 >> > >> >> Fixes a segmentation fault when trying to splice from a non-TCP socket. >> >> >> >> Signed-off-by: Rémi Denis-Courmont <[EMAIL PROTECTED]> >> > >> > Applied, thank you. >> >> That's also a stable candidate, isn't it? > > Should go to 2.6.25-rc2, but the offending code is not in any stable release. You're right. I misinterpreted git describe output for this one. Sorry. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Make sure sockets implement splice_read
David Miller <[EMAIL PROTECTED]> writes: > From: Rémi_Denis-Courmont <[EMAIL PROTECTED]> > Date: Thu, 14 Feb 2008 18:53:34 +0200 > >> Fixes a segmentation fault when trying to splice from a non-TCP socket. >> >> Signed-off-by: Rémi Denis-Courmont <[EMAIL PROTECTED]> > > Applied, thank you. That's also a stable candidate, isn't it? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ipvs: Make the synchronization interval controllable
Sven Wegener <[EMAIL PROTECTED]> writes: > The default synchronization interval of 1000 milliseconds is too high for a > heavily loaded director. Collecting the connection information from one second > and then sending it out in a burst will overflow the socket buffer and lead to > synchronization information being dropped. Make the interval controllable by a > sysctl variable so that users can tune it. It would be better if the defaults just worked under all circumstances. So why not just lower the default? Or the code could detect overflowing socket buffers and lower the value dynamically. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
openswan broken between 2.6.24rc8 and final
Hi, Since I updated a system here from 2.6.24rc8 (+ some unrelated patches) to 2.6.24 openswan in tunnel mode is rather unhappy now. It works initially for a few minutes, but then when it comes to renew a key (I think) the connection always hangs. This means on 2.6.24-final the kernel can now finally reboot again with active ipsec (previously there were always leaked reference counts), but it cannot hold them up for a longer time. I first thought it was caused by commit d4782c323d10d3698b71b6a6b3c7bdad33824658 Author: Patrick McHardy <[EMAIL PROTECTED]> Date: Sun Jan 20 17:24:29 2008 -0800 [AF_KEY]: Fix skb leak on pfkey_send_migrate() error but when I just revert just this patch it still doesn't work (and obviously the reboot problem is back) Before I do a full bisect; is there already a fix for this? Thanks, -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Thu, Feb 07, 2008 at 02:44:46PM -0500, Ross Vandegrift wrote: > On Wed, Feb 06, 2008 at 09:53:57AM +0100, Andi Kleen wrote: > > That would be useful yes -- for different bandwidths. > > My initial test is end-to-end 1000Mbps, but I've got a few different > packet rates. > > > If the young/old heuristics do not work well enough anymore most > > likely we should try readding RED to the syn queue again. That used > > to be pretty effective in the early days. I don't quite remember why > > Linux didn't end up using it in fact. > > I'm running juno-z with 2, 4, & 8 threads of syn flood to port 80. > wireshark measures 2 threads at 350pps, 4 threads at 750pps, and 8 What's the total bandwidth of the attack? > threads at 1200pps. Under no SYN flood, the server handles 750 HTTP > requests per second, measured via httping in flood mode. Thanks for the tests. Could you please do an additional experiment? Use sch_em or similar to add a jittering longer latency in the connection (as would be realistic in a real distributed DOS). Does it make a difference? > With a default tcp_max_syn_backlog of 1024, I can trivially prevent > any inbound client connections with 2 threads of syn flood. > Enabling tcp_syncookies brings the connection handling back up to 725 > fetches per second. Yes the defaults are probably too low. That's something that should be fixed. > > At these levels the CPU impact of tcp_syncookies is nothing. I can't CPU impact of syncookies was never a concern. The problems are rather missing flow control and disabling of valuable TCP features. > BUG: soft lockup detected on CPU#1! > [] softlockup_tick+0x96/0xa4 > [] update_process_times+0x39/0x5c > [] smp_apic_timer_interrupt+0x5b/0x6c > [] apic_timer_interrupt+0x1f/0x24 > [] taskstats_exit_send+0x152/0x371 > [] netlink_kernel_create+0x5/0x11c > [] reqsk_queue_alloc+0x32/0x81 > [] lock_sock+0x8e/0x96 I think the softirqs are starving user context through the socket lock. Probably should be fixed too. Something like softirq should detect when there is a user and it is looping too long and should give up the lock for some time. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Wed, Feb 06, 2008 at 09:36:11AM -0800, Glenn Griffin wrote: > > I didn't think a module could have multiple module_inits. Are you > > sure that works? > > Indeed. That will fail whenever ipv6 is compiled as a module. It's > been removed. It snuck in from the v4 implementation, where I'm still > having trouble understanding why it's needed there. s/needed/used/ ipv4 is never modular so it works. Arguably it would be cleaner if it was __initcall() -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
> I work at a hosting company and we see these kinds of issues in the > real world fairly frequently. I would guess maybe a monthly basis. > The servers where we have seen this are typically running RHEL 4 or 5 > kernels, so I can't really speak to how recent the kernel is in this > specific term. RHEL5 should be recent enough. > > If I can find a box that we could temporary get a kernel.org kernel > on, I'll see if I can get a real comparison together. We have > collected a few of the more effective attack tools that have been left > on compromised systems, so it wouldn't be too difficult to get some > numbers. That would be useful yes -- for different bandwidths. If the young/old heuristics do not work well enough anymore most likely we should try readding RED to the syn queue again. That used to be pretty effective in the early days. I don't quite remember why Linux didn't end up using it in fact. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
> +static __init int init_syncookies(void) > +{ > + get_random_bytes(syncookie_secret, sizeof(syncookie_secret)); > + return 0; > +} > +module_init(init_syncookies); I didn't think a module could have multiple module_inits. Are you sure that works? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 11:39:11PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 05, 2008 at 09:02:11PM +0100, Andi Kleen ([EMAIL PROTECTED]) > wrote: > > On Tue, Feb 05, 2008 at 10:29:28AM -0800, Glenn Griffin wrote: > > > > Syncookies are discouraged these days. They disable too many > > > > valuable TCP features (window scaling, SACK) and even without them > > > > the kernel is usually strong enough to defend against syn floods > > > > and systems have much more memory than they used to be. > > > > > > > > So I don't think it makes much sense to add more code to it, sorry. > > How does syncookies prevent windows from growing? Syncookies do not allow window scaling so you can't have any windows >64k > Most (if not all) distributions have them enabled and window growing > works just fine. Actually I do not see any reason why connection > establishment handshake should prevent any run-time operations at all, > even if it was setup during handshake. TCP only uses options negotiated during the hand shake and syncookies is incapable to do this. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
> The problem is that any reasonably recent PC can generate enough > forged SYN packets to overwhelm reasonable SYN queues on a much more > powerful server. Have you actually seen this with a recent kernel in the wild or are you just talking theoretically? Linux uses some heuristics to manage the syn queue that should still ensure reasonable service even without cookies under attack. Also SYN-RECV sockets are stored in a special data structure optimized to use minimal resources. It is far from the classical head drop method that was so vunerable to syn flooding. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 10:29:28AM -0800, Glenn Griffin wrote: > > Syncookies are discouraged these days. They disable too many > > valuable TCP features (window scaling, SACK) and even without them > > the kernel is usually strong enough to defend against syn floods > > and systems have much more memory than they used to be. > > > > So I don't think it makes much sense to add more code to it, sorry. > > As you say the kernel is usually strong enough to defend against syn flood > attacks, but what about the situations where it isn't? As valuable as the TCP > features are I would give them up if it means I'm able to connect to my sshd > port when I otherwise would be unable to. While increased synq sizes, better > dropping algorithms, and minisocks are a great way to mitigate the attacks and > in most cases are enough, there are situations where syncookies are nice. Have you seen such a case in practice with a modern kernel? They also cause problems unfortunately; e.g. there is no real flow control for connections anymore in the non DOS case. > Regardless, I would say as long as ipv4 has syncookie support it will > accurately be viewed as a deficiency of ipv6 if it lacks support. So perhaps > the discussion should be we whether all the other defenses are enough to > warrant the removal of syncookie support from ipv4. That topic may bring in > more opinions. That is essentially what I and Alan were discussing. > > > Besides you should really move it to the ipv6 module, right now the code > > would be always compiled in even for ipv4 only kernels. > > That is correct. I will gladly move it into it's own section within > net/ipv6/. > Do you have any problem using the same CONFIG and sysctl variables as the ipv4 > implementation? No. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 04:03:01PM +, Alan Cox wrote: > > Also your sub PC class appliances rarely run LISTEN servers anyways > > that are open to the world. > > Really. The ones that first come to mind often have exposed ports > including PDA devices and phones. (Ditto low end PC boxes - portscan an > EEPC some day ;)) What kind of LISTEN ports? And does it matter if they're DoS'ed? The only one I can think of right now would be ident and frankly nobody will really care if that one works or not. If it's just the management interface etc. (which should really be firewalled) then likely not. > Is the other stuff enough - good question, and can be measured easily > enough on a little dlink router or similar. My guess would be that it is. If it's not it would be probably better to look at improving the standard queue management again; e.g.readd RED. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 03:42:13PM +, Alan Cox wrote: > > Syncookies are discouraged these days. They disable too many > > valuable TCP features (window scaling, SACK) and even without them > > the kernel is usually strong enough to defend against syn floods > > and systems have much more memory than they used to be. > > Somewhat untrue. Network speeds have risen dramatically, the number of With strong I meant Linux has much better algorithms to handle the standard syn queue (syncookies was originally added when it had only dumb head drop) and there are minisocks which also require significantly less overhead to manage than full sockets (less memory etc.) When I wrote syncookies originally that all was not true. > appliances running Linux that are not PC class means memory has fallen > not risen and CPU has been pretty level. > > > So I don't think it makes much sense to add more code to it, sorry. > > I think it makes a lot of sense I have my doubts. It would be probably better to recheck everything and then remove syncookies. Also your sub PC class appliances rarely run LISTEN servers anyways that are open to the world. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Mon, Feb 04, 2008 at 03:01:01PM -0800, Glenn Griffin wrote: > Add IPv6 support to TCP SYN cookies. This is written and tested against > 2.6.24, and applies cleanly to linus' current HEAD (d2fc0b). Unfortunately > linus' HEAD breaks my sky2 card at the moment, so I'm unable to test against > that. I see no reason why it would be affected though. Comments/suggestions > are welcome. Syncookies are discouraged these days. They disable too many valuable TCP features (window scaling, SACK) and even without them the kernel is usually strong enough to defend against syn floods and systems have much more memory than they used to be. So I don't think it makes much sense to add more code to it, sorry. Besides you should really move it to the ipv6 module, right now the code would be always compiled in even for ipv4 only kernels. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow OOM in netif_RX function
> Nothing that looks like a struct net_device. All the dumped leaked slab > look the same until "45 20 05 d8" (the ascii 'E' on the 3rd line). 45 ... is often the start of an IP header (IPv4, 5*4=20 bytes length) You could dump them to a file (e.g. using a sial script) and then look at them with tcpdump or similar to get an idea what kinds of packets they are. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What were the reasons of having mandatory IP address at AX.25 interfaces ?
Matti Aarnio <[EMAIL PROTECTED]> writes: > .. the original reason was apparently that _ifconfig_ blew up > when it saw protocols that it didn't understand on network > interfaces. Possibly when there was no IP protocol on an > interface. It's not only ifconfig, a lot of programs use SIOCGIFCONF to query ip addresses. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
No pmtu probing on retransmits?
Hallo, While looking for something else in tcp_output.c I noticed that MTU probing seems to be only done in tcp_write_xmit (when packets come directly from process context), but not via the timer driven timer retransmit path (tcp_retransmit_skb). Is that intentional? It looks quite weird. I would normally assume PMTU blackholes get usually detected on retransmit timeouts. Or do I miss something? You seem to have assumed interrupt context at least because tcp_mtu_probe() uses GFP_ATOMIC which is only needed for interrupts. Currently it is only called in process context I think. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Sun, Feb 03, 2008 at 09:57:10AM +1100, Herbert Xu wrote: > Andi Kleen <[EMAIL PROTECTED]> wrote: > >> Then change TBF to use skb_gso_segment? Be careful, the fact that > > > > That doesn't help because it wants to interleave packets > > from different streams to get everything fair and smooth. The only > > good way to handle that is to split it up and the simplest way to do > > this is to just tell TCP to not do GSO in the first place. > > Actually if we're going to do this I'd prefer you to call skb_gso_segment > instead because that lets us at least bypass netfilter which is one of > the key benefits of software GSO. Ok. The only problem I see right now is that skb_segment() seems to overuse GFP_ATOMIC which would make it unreliable, but that is something that can be fixed. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> Turning TSO off at netdev registration time with a warning will be a > cleaner IMO. Or alternatively introducing a kernel-config "I know what You mean the qdisc should force TSO off on the underlying device? > TSO is" option which is then used at netdev registration. From a > usability perspective it would make more sense to just keep ethtool as > the only way to configure TSO. Ok, reasonable point. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Fri, Feb 01, 2008 at 01:58:30PM -0800, Rick Jones wrote: > >>Does this also imply that JumboFrames interacts badly with these qdiscs? > >>Or IPoIB with its 65000ish byte MTU? > > > > > >Correct. Of course it is always relative to the link speed. So if your > >link is 10x faster and your packets 10x bigger you can get similarly > >smooth shaping. > > If the later-in-thread mentioned person shaping for their DSL line > happens to have enabled JumboFrames on their GbE network, will/should > the qdisc negate that? I don't think so, mostly because jumbo frames are not enabled by default. I'm only concerned about usable defaults there -- if you set non default options you should certainly know what you're doing. There are other reasons to not use jumbo frames anyways; e.g. a lot of cards still do not support SG for them but only process them as a single continuous buffer in memory so you often run into memory fragmentation problems. > Or is the qdisc currently assuming that the > remote end of the DSL will have asked for a smaller MSS? First there are lots of different qdiscs that all do different things. Take a look at net/sched/*. Then they usually don't strictly require particular MTUs (or know anything about MSS), but tend to work better with smaller MTUs because that allows more choices in packet scheduling. Generally the larger your packets the less they can be scheduled. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> The TSO defer logic is based on your congestion window and current > window size. So the actual frame sizes hitting your NIC attached to > your DSL probably aren't anywhere near 64KB, but probably more in line > with whatever your window size is for DSL. DSL windows can be quite large because a lot of DSL lines have a quite long latency due to error correction. And with ADSL2 we have upto 16Mbit now. > I think we're having more of a disagreement of what is considered the > "normal case" user. If you are on a slow link, such as a DSL/cable > line, your TCP window/congestion window aren't going to be big enough to > generate large TSO's, so what is the issue? But disabling TSO, say on a 64k TSOs are likely even with DSL. Anyways even with smaller TSOs the change still makes sense because each increase makes packet scheduling less smooth. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow OOM in netif_RX function
On Fri, Feb 01, 2008 at 02:51:40PM +0200, Ivan Dichev wrote: > Arnaldo Carvalho de Melo wrote: > > Em Fri, Jan 25, 2008 at 02:21:08PM +0100, Andi Kleen escreveu: > > > >> "Ivan H. Dichev" <[EMAIL PROTECTED]> writes: > >> > >>> What could happen if I put different Lan card in every slot? > >>> In ex. to-private -> 3com > >>> to-inet-> VIA > >>> to-dmz -> rtl8139 > >>> And then to look which RX function is consuming the memory. > >>> (boomerang_rx, rtl8139_rx, ... etc) > >>> > >> The problem is unlikely to be in the driver (these are both > >> well tested ones) but more likely your complicated iptables setup somehow > >> triggers a skb leak. > >> > >> There are unfortunately no shrink wrapped debug mechanisms in the kernel > >> for leaks like this (ok you could enable CONFIG_NETFILTER_DEBUG > >> and see if it prints something interesting, but that's a long shot). > >> > >> If you wanted to write a custom debugging patch I would do something like > >> this: > >> > >> - Add two new integer fields to struct sk_buff: a time stamp and a integer > >> field > >> - Fill the time stamp with jiffies in alloc_skb and clear the integer field > >> - In __kfree_skb clear the time stamp > >> - For all the ipt target modules in net/ipv4/netfilter/*.c you use change > >> their > >> ->target functions to put an unique value into the integer field you added. > >> - Do the same for the pkt_to_tuple functions for all conntrack modules > >> > >> Then when you observe the leak take a crash dump using kdump on the router > >> and then use crash to dump all the slab objects for the sk_head_cache. > >> Then look for any that have an old time stamp and check what value they > >> have in the integer field. Then the netfilter function who set that unique > >> value > >> likely triggered the leak somehow. > >> > > > > I wrote some systemtap scripts that do parts of what you suggest, and at > > least for the timestamp there was no need to add a new field to struct > > sk_buff, I just reuse skb->timestamp, as it is only used when we use a > > packet sniffer. Here it is for reference, but it needs some tapsets I > > wrote, so I'll publish this git repo in git.kernel.org, perhaps it can > > be useful in this case as a starting point. Find another unused field > > (hint: I know that at least 4 bytes on 64 bits is present as a hole) and > > you're done, no need to rebuild the kernel :) > > > > http://git.kernel.org/?p=linux/kernel/git/acme/nettaps.git > > > > - Arnaldo > > > Thanks to everyone for the given ideas. > I am not kernel guru so writing patch is difficult. This is a production > server and it is quite difficult to debug (only at night) > I removed some iptables exotics - recent , ulog, string , but no effect. > Since we can reach OOM most of the memory is going to be filled with the > leak, and we are thinking to try to dump and analyze it. You could perhaps use crash to look for leaked packets and then see if you can see a pattern, as in what types of packets they are. Still I expect without modifying the kernel to add some more netfilter tracing it will be difficult to diagnose this. I suppose it would be possible to write a suitable systemtap script to also trace this without modifying the kernel, although it will be probably not easy and more complicated than just changing the C code. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> The problem with ethtool is that it's a non-obvious nerd knob. At > the least the ethtool documentation should be updated to indicate that > activating TSO effects tc accuracy. TSO tends to be activated by default in the driver; very few people who use it do even know that ethtool exist or what TSO is. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 09:33:44PM +0100, Jarek Poplawski wrote: > Andi Kleen wrote, On 01/31/2008 08:34 PM: > > >> TSO by nature is bursty. But disabling TSO without the option of having > >> it on or off to me seems to aggressive. If someone is using a qdisc > >> that TSO is interfering with the effectiveness of the traffic shaping, > >> then they should turn off TSO via ethtool on the target device. Some > > > > The philosophical problem I have with this suggestion is that I expect > > that the large majority of users will be more happy with disabled TSO > > if they use non standard qdiscs and defaults that do not fit > > the majority use case are bad. > > > If you mean the large majority of the large minority of users, who use > non standard qdiscs - I agree - this is really the philosophical problem! [] Sorry if you had a point in this email I missed it. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> Well, it could be just that when using such qdiscs TSO would be > disabled, but the user could override this by using ethtool after > loading the qdiscs. If anything TC, not ethtool. Do you have an useful scenario where GSO makes sense with TBF et.al.? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 03:42:54PM -0800, Waskiewicz Jr, Peter P wrote: > > Well, it could be just that when using such qdiscs TSO would be > > disabled, but the user could override this by using ethtool after > > loading the qdiscs. > > I still disagree with this. The qdisc should not cause anything to happen to > feature flags on the device. It's the scheduling layer and really shouldn't > care about what features the device supports or not. If someone has an issue > with a feature hurting performance or causing odd behavior when using a > qdisc, then they should disable the feature on the device using the > appropriate tools provided. If it's the qdisc causing issues, then either > the qdisc needs to be fixed, or it should be documented what features are > recommended to be on and off with the qdisc. I don't agree that the > scheduling layer should affect features on an underlying device. You seem to only look at this from a high level theoretical standpoint. But more down to earth: do you have a useful scenario where it makes sense to do shaping or another qdisc on GSO packets? My take is that when you decide to do any packet scheduling you really want to do it on wire packets, not some internal stack implementation implementation detail units. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 11:14:34AM -0800, Rick Jones wrote: > Sounds like the functionality needs to be in the DSL bridge :) (or the > "router" in the same case) Particularly since it might be getting used > by more than one host on the GbE switch. Possible, but it is not usually in the real world. Setups like WonderShaper which do this on the host side are pretty common. > then the qdisc could/should place a cap on the size of a 'TSO' based on > the bitrate (and perhaps input as to how much time any one "burst" of > data should be allowed to consume on the network) and pass that up the > stack? right now you seem to be proposing what is effectively a cap of > 1 MSS. Hmm, that would probably be possible for TBF, but I'm not sure this can be really done in a useful way for the more complicated qdiscs. Especially since they would likely need to turn on/off GSO regularly when dynamic circumstances change and there is not really a good way to affect a socket after it was created. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> TSO by nature is bursty. But disabling TSO without the option of having > it on or off to me seems to aggressive. If someone is using a qdisc > that TSO is interfering with the effectiveness of the traffic shaping, > then they should turn off TSO via ethtool on the target device. Some The philosophical problem I have with this suggestion is that I expect that the large majority of users will be more happy with disabled TSO if they use non standard qdiscs and defaults that do not fit the majority use case are bad. Basically you're suggesting that nearly everyone using tc should learn about another obscure command. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
On Thu, Jan 31, 2008 at 07:46:02PM +0100, Patrick McHardy wrote: > Patrick McHardy wrote: >> Andi Kleen wrote: >> >>> Can you please try with the above config? >> I'll give it a try later. > > > I took all options from that config that seemed possibly > related (qdiscs, no hrtimers, no nohz, slab, ...), but > still can't reproduce it. Ok I'll do bisect then later (not today anymore likely) > Does it also crash if you use more reasonable parameters? I managed to make it crash with different parameters too, but with good parameters it did set a qdisc successfully and appeared to work. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> So, at what timescale do people using these qdiscs expect things to > appear "smooth?" 64KB of data at GbE speeds is something just north of > half a millisecond unless I've botched my units somewhere. One typical use case for TBF is you talking to a DSL bridge that is connected using a GBit Ethernet switch. For these DSL connections it gives much better behaviour to shape the traffic to slightly below your external link speed so that you can e.g. prioritize packets properly. But the actual external link speed is much lower than GbE. A lot of GbE NICs enable TSO by default. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 10:26:19AM -0800, Rick Jones wrote: > Andi Kleen wrote: > >TSO interacts badly with many queueing disciplines because they rely on > >reordering packets from different streams and the large TSO packets can > >make this difficult. This patch disables TSO for sockets that send over > >devices with non standard queueing disciplines. That's anything but noop > >or pfifo_fast and pfifo right now. > > Does this also imply that JumboFrames interacts badly with these qdiscs? > Or IPoIB with its 65000ish byte MTU? Correct. Of course it is always relative to the link speed. So if your link is 10x faster and your packets 10x bigger you can get similarly smooth shaping. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 07:21:20PM +0100, Patrick McHardy wrote: > Andi Kleen wrote: > >>Then change TBF to use skb_gso_segment? Be careful, the fact that > > > >That doesn't help because it wants to interleave packets > >from different streams to get everything fair and smooth. The only > >good way to handle that is to split it up and the simplest way to do > >this is to just tell TCP to not do GSO in the first place. > > > Thats not correct, TBF keeps packets strictly ordered unless My point was that without TSO different submitters will interleave their streams (because they compete about the qdisc submission) and then you end up with a smooth rate over time for all of them. If you submit in large chunks only (as TSO does) it will always be more bursty and that works against the TBF goal. For a single submitter you would be correct. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> Then change TBF to use skb_gso_segment? Be careful, the fact that That doesn't help because it wants to interleave packets from different streams to get everything fair and smooth. The only good way to handle that is to split it up and the simplest way to do this is to just tell TCP to not do GSO in the first place. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 07:01:00PM +0100, Patrick McHardy wrote: > Andi Kleen wrote: > >>Fix the broken qdisc instead. > > > >What do you mean? I don't think the qdiscs are broken. > >I cannot think of any way how e.g. TBF can do anything useful > >with large TSO packets. > > > Someone posted a patch some time ago to calculate the amount > of tokens needed in max_size portions and use that, but IMO > people should just configure TBF with the proper MTU for TSO. TBF with 64k atomic units will always be chunky and uneven. I don't think that's a useful goal. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
> Fix the broken qdisc instead. What do you mean? I don't think the qdiscs are broken. I cannot think of any way how e.g. TBF can do anything useful with large TSO packets. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}
On Thu, Jan 31, 2008 at 08:41:38AM -0800, Ben Greear wrote: > I don't know exactly how the tcp_tw_recycle works, but it seems like it > could be made to only > take affect when all local ports are used up in TIME_WAIT. TIME-WAIT does not actually use up local ports; it uses up remote ports because it is done on the LISTEN socket which has always a fixed local port. And it has no idea how many ports the other end has left. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
> Works for me: > > qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms > Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 > > Packets are dropped as expected. I can still reproduce it on 64bit with http://halobates.de/config-qdisc (all qdiscs etc. compiled in for testing) with latest git tip (8af03e782cae1e0a0f530ddd22301cdd12cf9dc0) The command line above causes an instant hang. Also tried it with newer iproute2 (the original one was quite old), but it didn't make a difference. Perhaps it's related to what qdiscs are enabled? Can you please try with the above config? If everything fails I can do a bisect later. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
> - > lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 > lilsol:~# uname -a > Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686 Can you try it again with current git mainline? > GNU/Linux > lilsol:~# tc qdisc ls dev eth0 > qdisc tbf 8001: root rate 1000bit burst 10b lat 737.3ms > lilsol:~# > --- > > What do your patches do? Nothing really related to qdiscs. I suspect it came from the git mainline patch I had (but forgot to mention in the first email) -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Disable TSO for non standard qdiscs
TSO interacts badly with many queueing disciplines because they rely on reordering packets from different streams and the large TSO packets can make this difficult. This patch disables TSO for sockets that send over devices with non standard queueing disciplines. That's anything but noop or pfifo_fast and pfifo right now. Longer term other queueing disciplines could be checked if they are also ok with TSO. If yes they can set the TCQ_F_GSO_OK flag too. It is still enabled for the standard pfifo_fast because that will never reorder packets with the same type-of-service. This means 99+% of all users will still be able to use TSO just fine. The status is only set up at socket creation so a shifted route will not reenable TSO on a existing socket. I don't think that's a problem though. Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- include/net/sch_generic.h |1 + net/core/sock.c |3 +++ net/sched/sch_generic.c |5 +++-- 3 files changed, 7 insertions(+), 2 deletions(-) Index: linux/include/net/sch_generic.h === --- linux.orig/include/net/sch_generic.h +++ linux/include/net/sch_generic.h @@ -31,6 +31,7 @@ struct Qdisc #define TCQ_F_BUILTIN 1 #define TCQ_F_THROTTLED2 #define TCQ_F_INGRESS 4 +#define TCQ_F_GSO_OK 8 int padded; struct Qdisc_ops*ops; u32 handle; Index: linux/net/sched/sch_generic.c === --- linux.orig/net/sched/sch_generic.c +++ linux/net/sched/sch_generic.c @@ -307,7 +307,7 @@ struct Qdisc_ops noop_qdisc_ops __read_m struct Qdisc noop_qdisc = { .enqueue= noop_enqueue, .dequeue= noop_dequeue, - .flags = TCQ_F_BUILTIN, + .flags = TCQ_F_BUILTIN | TCQ_F_GSO_OK, .ops= &noop_qdisc_ops, .list = LIST_HEAD_INIT(noop_qdisc.list), }; @@ -325,7 +325,7 @@ static struct Qdisc_ops noqueue_qdisc_op static struct Qdisc noqueue_qdisc = { .enqueue= NULL, .dequeue= noop_dequeue, - .flags = TCQ_F_BUILTIN, + .flags = TCQ_F_BUILTIN | TCQ_F_GSO_OK, .ops= &noqueue_qdisc_ops, .list = LIST_HEAD_INIT(noqueue_qdisc.list), }; @@ -538,6 +538,7 @@ void dev_activate(struct net_device *dev printk(KERN_INFO "%s: activation failed\n", dev->name); return; } + qdisc->flags |= TCQ_F_GSO_OK; list_add_tail(&qdisc->list, &dev->qdisc_list); } else { qdisc = &noqueue_qdisc; Index: linux/net/core/sock.c === --- linux.orig/net/core/sock.c +++ linux/net/core/sock.c @@ -112,6 +112,7 @@ #include #include #include +#include #include #include @@ -1062,6 +1063,8 @@ void sk_setup_caps(struct sock *sk, stru { __sk_dst_set(sk, dst); sk->sk_route_caps = dst->dev->features; + if (!(dst->dev->qdisc->flags & TCQ_F_GSO_OK)) + sk->sk_route_caps &= ~NETIF_F_GSO_MASK; if (sk->sk_route_caps & NETIF_F_GSO) sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE; if (sk_can_gso(sk)) { Index: linux/net/sched/sch_fifo.c === --- linux.orig/net/sched/sch_fifo.c +++ linux/net/sched/sch_fifo.c @@ -62,6 +62,7 @@ static int fifo_init(struct Qdisc *sch, q->limit = ctl->limit; } + sch->flags |= TCQ_F_GSO_OK; return 0; } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc II
On Thursday 31 January 2008 13:21:00 Andi Kleen wrote: > > I just managed to hang a 2.6.24 (+ some non network patches) kernel > with the following (non sensical) command Correction: the kernel was actually a git linus kernel with David's recent merge included. I found it's pretty easy to hang the kernel with various tbf parameters. -Andi > tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 > > No oops or anything just hangs. While I understand root can > do bad things just hanging like this seems a little extreme. > > -Andi > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
hard hang through qdisc
I just managed to hang a 2.6.24 (+ some non network patches) kernel with the following (non sensical) command tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 No oops or anything just hangs. While I understand root can do bad things just hanging like this seems a little extreme. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bruce Allen <[EMAIL PROTECTED]> writes: > > Important note: we ARE able to get full duplex wire speed (over 900 > Mb/s simulaneously in both directions) using UDP. The problems occur > only with TCP connections. Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}
> I believe the problem was that all of my ports were used up with > TIME_WAIT sockets and so it couldn't create more. My test > case was similar to this: Ah that's simple to solve then :- use more IP addresses and bind to them in RR in your user program. Arguably the Linux TCP code should be able to do this by itself when enough IP addresses are available, but it's not very hard to do in user space using bind(2) BTW it's also an very unusual case -- in most cases there are more remote IP addresses > So, is there a better way to max out the connections per second without > having to use tcp_tw_recycle? Well did you profile where the bottle necks were? Perhaps also just increase the memory allowed for TCP sockets. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}
On Wednesday 30 January 2008 20:22, Ben Greear wrote: > We use these features to enable creating very high numbers of short-lived > TCP connections, primarily used as a test tool for other network > devices. Hopefully these other network devices don't do any NAT then or don't otherwise violate the IP-matches-PAWS assumption. Most likely they do actually, so enabling TW recycle for testing is probably not even safe for you. Modern systems have a lot of RAM so even without tw recycle you should be able to get a very high number of connections. An timewait socket is around 128 bytes on 64bit; this means with a GB of memory you can already support > 8 Million TW sockets. On 32bit it's even more. The optimization was originally written at a time when 64MB systems were common. If you don't care about data integrity have you considered just using some custom UDP based protocol or run one of the user space TCP stacks and disable all data integrity features? If you do care about data integrity then you should probably disable tw recycle anyways. The deprecation period will be some time (several months) so you'll have enough time to migrate to another method > Perhaps just document the adverse affects and/or have it print out a > warning on the console whenever the feature is enabled? "This feature is insecure and does not work on the internet or with NAT" ? Somehow this just does not seem right to me. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/2] Skip empty hash buckets faster in /proc/net/tcp
On Wed, Jan 30, 2008 at 09:03:16AM -0800, Roland Dreier wrote: > > On a 2GB Core2 system here I see a time cat /proc/net/tcp > /dev/null > > constently dropping from 0.44s to 0.4-0.8s system time with this change. > > Seems like there must be a typo in either the before or after times > you report here? Yes thanks, it was 0.44s down to 0.04s/0.08s here. Somehow the zeroes got lost. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}
We've recently had a long discussion about the CVE-2005-0356 time stamp denial-of-service attack. It turned out that Linux is only vunerable to this problem when tcp_tw_recycle is enabled (which it is not by default). In general these two options are not really usable in today's internet because they make the (often false) assumption that a single IP address has a single TCP time stamp / PAWS clock. This assumption breaks both NAT/masquerading and also opens Linux to denial of service attacks (see the CVE description) Due to these numerous problems I propose to remove this code for 2.6.26 Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> Index: linux/Documentation/feature-removal-schedule.txt === --- linux.orig/Documentation/feature-removal-schedule.txt +++ linux/Documentation/feature-removal-schedule.txt @@ -354,3 +354,15 @@ Why: The support code for the old firmwa and slightly hurts runtime performance. Bugfixes for the old firmware are not provided by Broadcom anymore. Who: Michael Buesch <[EMAIL PROTECTED]> + +--- + +What: Support for /proc/sys/net/ipv4/tcp_tw_{reuse,recycle} = 1 +When: 2.6.26 +Why:Enabling either of those makes Linux TCP incompatible with masquerading and +also opens Linux to the CVE-2005-0356 denial of service attack. And these +optimizations are explicitely disallowed by some benchmarks. They also have +been disabled by default for more than ten years so they're unlikely to be used + much. Due to these fatal flaws it doesn't make sense to keep the code. +Who:Andi Kleen <[EMAIL PROTECTED]> + -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [2/2] Remove some unnecessary gotos in established_get_first()
Oliver Neukum <[EMAIL PROTECTED]> writes: > Am Mittwoch, 30. Januar 2008 09:01:10 schrieb Andi Kleen: >> >> gcc does not generate different code for return foo vs bar = foo; goto x; >> x: return bar; So convert it all to direct returns for better readability. > > Now suppose somebody needs to change locking. He'll have to convert > it back. Please take a look at the overall /proc/net/tcp logic. Any locking change will be a major change to the code flow of the whole family of funtions. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [1/2] Skip empty hash buckets faster in /proc/net/tcp
On most systems most of the TCP established/time-wait hash buckets are empty. When walking the hash table for /proc/net/tcp their read locks would always be aquired just to find out they're empty. This patch changes the code to check first if the buckets have any entries before taking the lock, which is much cheaper than taking a lock. Since the hash tables are large this makes a measurable difference on processing /proc/net/tcp, especially on architectures with slow read_lock (e.g. PPC) On a 2GB Core2 system here I see a time cat /proc/net/tcp > /dev/null constently dropping from 0.44s to 0.4-0.8s system time with this change. This is with mostly empty hash tables. On systems with slower atomics (like P4 or POWER4) or larger hash tables (more RAM) the difference is much higher. This can be noticeable because there are some daemons around who regularly scan /proc/net/tcp. Original idea for this patch from Marcus Meissner, but redone by me. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- net/ipv4/tcp_ipv4.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) Index: linux/net/ipv4/tcp_ipv4.c === --- linux.orig/net/ipv4/tcp_ipv4.c +++ linux/net/ipv4/tcp_ipv4.c @@ -2039,6 +2039,12 @@ static void *listening_get_idx(struct se return rc; } +static inline int empty_bucket(struct tcp_iter_state *st) +{ + return hlist_empty(&tcp_hashinfo.ehash[st->bucket].chain) && + hlist_empty(&tcp_hashinfo.ehash[st->bucket].twchain); +} + static void *established_get_first(struct seq_file *seq) { struct tcp_iter_state* st = seq->private; @@ -2050,6 +2056,10 @@ static void *established_get_first(struc struct inet_timewait_sock *tw; rwlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, st->bucket); + /* Lockless fast path for the common case of empty buckets */ + if (empty_bucket(st)) + continue; + read_lock_bh(lock); sk_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) { if (sk->sk_family != st->family) { @@ -2097,13 +2107,15 @@ get_tw: read_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); st->state = TCP_SEQ_STATE_ESTABLISHED; - if (++st->bucket < tcp_hashinfo.ehash_size) { - read_lock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); - sk = sk_head(&tcp_hashinfo.ehash[st->bucket].chain); - } else { - cur = NULL; - goto out; - } + /* Look for next non empty bucket */ + while (++st->bucket < tcp_hashinfo.ehash_size && + empty_bucket(st)) + ; + if (st->bucket >= tcp_hashinfo.ehash_size) + return NULL; + + read_lock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); + sk = sk_head(&tcp_hashinfo.ehash[st->bucket].chain); } else sk = sk_next(sk); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [2/2] Remove some unnecessary gotos in established_get_first()
gcc does not generate different code for return foo vs bar = foo; goto x; x: return bar; So convert it all to direct returns for better readability. Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> Index: linux/net/ipv4/tcp_ipv4.c === --- linux.orig/net/ipv4/tcp_ipv4.c +++ linux/net/ipv4/tcp_ipv4.c @@ -2065,8 +2065,7 @@ static void *established_get_first(struc if (sk->sk_family != st->family) { continue; } - rc = sk; - goto out; + return sk; } st->state = TCP_SEQ_STATE_TIME_WAIT; inet_twsk_for_each(tw, node, @@ -2074,13 +2073,11 @@ static void *established_get_first(struc if (tw->tw_family != st->family) { continue; } - rc = tw; - goto out; + return tw; } read_unlock_bh(lock); st->state = TCP_SEQ_STATE_ESTABLISHED; } -out: return rc; } @@ -2100,10 +2097,8 @@ get_tw: while (tw && tw->tw_family != st->family) { tw = tw_next(tw); } - if (tw) { - cur = tw; - goto out; - } + if (tw) + return tw; read_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); st->state = TCP_SEQ_STATE_ESTABLISHED; @@ -2121,16 +2116,12 @@ get_tw: sk_for_each_from(sk, node) { if (sk->sk_family == st->family) - goto found; + return sk; } st->state = TCP_SEQ_STATE_TIME_WAIT; tw = tw_head(&tcp_hashinfo.ehash[st->bucket].twchain); goto get_tw; -found: - cur = sk; -out: - return cur; } static void *established_get_idx(struct seq_file *seq, loff_t pos) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow OOM in netif_RX function
"Ivan H. Dichev" <[EMAIL PROTECTED]> writes: > > What could happen if I put different Lan card in every slot? > In ex. to-private -> 3com > to-inet-> VIA > to-dmz -> rtl8139 > And then to look which RX function is consuming the memory. > (boomerang_rx, rtl8139_rx, ... etc) The problem is unlikely to be in the driver (these are both well tested ones) but more likely your complicated iptables setup somehow triggers a skb leak. There are unfortunately no shrink wrapped debug mechanisms in the kernel for leaks like this (ok you could enable CONFIG_NETFILTER_DEBUG and see if it prints something interesting, but that's a long shot). If you wanted to write a custom debugging patch I would do something like this: - Add two new integer fields to struct sk_buff: a time stamp and a integer field - Fill the time stamp with jiffies in alloc_skb and clear the integer field - In __kfree_skb clear the time stamp - For all the ipt target modules in net/ipv4/netfilter/*.c you use change their ->target functions to put an unique value into the integer field you added. - Do the same for the pkt_to_tuple functions for all conntrack modules Then when you observe the leak take a crash dump using kdump on the router and then use crash to dump all the slab objects for the sk_head_cache. Then look for any that have an old time stamp and check what value they have in the integer field. Then the netfilter function who set that unique value likely triggered the leak somehow. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.25 1/1]S2io: Multiqueue network device support implementation
> [Ram] I am assuming that this is with regards to msi-x interrupts. We Yes. And avoiding bouncing locks for device state between CPUs. > have done away with handling tx completion in the interrupt handler, and > are instead handling them in the context of the transmit. The slow path, > straggling transmit completions will be handled in the timer context. Ok -- hopefully you don't have bad corner cases from this when the pipe is not fully filled and then causing longer latencies on completion. Old NAPI sometimes suffered from such problems. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.25 1/1]S2io: Multiqueue network device support implementation
Sreenivasa Honnur <[EMAIL PROTECTED]> writes: > Multiqueue netwrok device support implementation. > - Added a loadable parameter "multiq" to enable/disable multiqueue support, > by default it is disabled. > - skb->queue_mapping is not used for queue/fifo selection. FIFO iselection is > based on IP-TOS value, 0x0-0xF TOS values are mapped to 8 FIFOs. Standard way to use that would be using skb->priority But I'm surprised you bother with TOS for multi queues at all. TOS isn't a too important mechanism. I would have thought the primary use case would be per CPU TX completion interrupts. With that the queue should be selected based on the the current CPU. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24 regression: reference count leak in PPPoE
> It seems to have stopped when i stopped using ipsec and started using > vpnc. Kernel ipsec was active yes. However I normally don't see it although it is often active, that was the first time I think (except long ago) > (but no firm info - this was sporadic - happened every few months > or so) Are you using ipsec and dynamic IPs by any chance? and the ISP uses dynamic IPs. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.24 regression: reference count leak in PPPoE
My workstation running 2.6.24-rc8 just hung during shutdown with an endless (or rather I didn't wait more than a few minutes) loop of unregister_netdev: waiting for ppp-device to become free. Usage count = 1 ppp-device was an active PPPoE device. No more information currently. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] [TCP]: Uninline tcp_set_state
"Ilpo Järvinen" <[EMAIL PROTECTED]> writes: > > Besides, it not always that obvious that gcc is able to determine "the > constant state", considering e.g., the complexity in the cases with > tcp_rcv_synsent_state_process, tcp_close_state, tcp_done. In such cases > uninlining should be done and gcc is probably not able to mix both cases > nicely for a single function? I think it would be cleanest to completely unswitch the function and split into tcp_set_closed() / tcp_set_established() / tcp_other_state() called by the callers directly. That would probably lose the state trace, but I never found that useful for anything. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] synchronize_rcu(): high latency on idle system
> I think it should be in netdev_unregister_kobject(). But that would > only get rid of one of the two calls to synchronize_rcu() in the > unregister_netdev. Would be already an improvement. > The other synchronize_rcu() is for qdisc's and not sure if that one can > be removed? The standard way to remove such calls is to set a "deleted" flag in the object, then check and ignore such objects in the reader and finally remove the object with call_rcu I have not checked if that is really feasible for qdiscs. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001/001] ipv4: enable use of 240/4 address space
Vince Fuller <[EMAIL PROTECTED]> writes: > from Vince Fuller <[EMAIL PROTECTED]> > > This set of diffs modify the 2.6.20 kernel to enable use of the 240/4 > (aka "class-E") address space as consistent with the Internet Draft > draft-fuller-240space-00.txt. Wouldn't it be wise to at least wait for it becoming an RFC first? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SACK scoreboard
> Postponing freeing of the skb has major drawbacks. Some time ago I Yes, the trick would be to make sure that it also does not tie up too much memory. e.g. it would need some throttling at least. Also the fast path of kmem_cache_free() is actually not that much different from just putting something on a list so perhaps it would not make that much difference. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SACK scoreboard
> It adds severe spikes in CPU utilization that are even moderate > line rates begins to affect RTTs. > > Or do you think it's OK to process 500,000 SKBs while locked > in a software interrupt. You can always push it into a work queue. Even put it to other cores if you want. In fact this is already done partly for the ->completion_queue. Wouldn't be a big change to queue it another level down. Also even freeing a lot of objects doesn't have to be that expensive. I suspect the most cost is in taking the slab locks, but that could be batched. Without that the kmem_free fast path isn't particularly expensive, as long as the headers are still in cache. Long ago I had sk_buff optimized for the routing case so that freeing can be all done with a single cache line. That is long gone, but could be probably restored. But asking for protocol changes just to work around such a internal implementation detail is ... > Perhaps you have another broken awk script to prove this :-) Your hand waved numbers on inline sizes there were definitely worse than mine. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SACK scoreboard
David Miller <[EMAIL PROTECTED]> writes: > > The big problem is that recovery from even a single packet loss in a > window makes us run kfree_skb() for a all the packets in a full > window's worth of data when recovery completes. Why exactly is it a problem to free them all at once? Are you worried about kernel preemption latencies? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Top 10 kernel oopses for the week ending January 5th, 2008
Linus Torvalds <[EMAIL PROTECTED]> writes: > > I usually just compile a small program like Just use scripts/decodecode and cat the Code line into that. > particularly good way to do it, and the old "ksymoops" program used to do > a pretty good job of this, but I'm used to that particular idiotic way > myself, since it's how I've basically always done it) > > After that, you still need to try to match up the assembly code with the > source code and figure out what variables the register contents actually > are all about. You can often try to do a > > make the/affected/file.s IMHO better is make the/file/xyz.lst which gives you a listing with binary data in there which can be grepped for. But you should install a very recent binutils because older objdump -S couldn't deal with unit-at-a-time compilers. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Tue, 8 Jan 2008 03:05:29 +0100 > > > On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote: > > > I explicitly left them out. > > > > > > Most of them are abstractions of common 2 or 3 instruction > > > calculations, and thus should stay inline. > > > > Definitely not in tcp.h. It has quite a lot of very long functions, of > > which very few really need to be inline: (AFAIK the only one where > > it makes really sense is tcp_set_state due to constant evaluation; > > although I never quite understood why the callers just didn't > > call explicit functions to do these actions) > > > > % awk ' { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += > > line-start-2; } ; END { print r/n }' < include/net/tcp.h > > 9.48889 > > > > The average function length is 9 lines. > > The vast majority of them are one, two, and three liners. % awk ' { line++ } ; /^{/ { total++; start = line } ; /^}/ { len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print total, l, l/total, k, k/total }' < include/net/tcp.h 68 28 0.411765 20 0.294118 41% are over 4 lines, 29% are >= 10 lines. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat II
> % awk ' { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; > } ; END { print r/n }' < include/net/tcp.h > 9.48889 > > The average function length is 9 lines. Actually 8 -- the awk hack had a off by one. Still too long. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Tue, 08 Jan 2008 01:23:11 +0100 > > > David Miller <[EMAIL PROTECTED]> writes: > > > > > Similarly I question just about any inline usage at all in *.c files > > > > Don't forget the .h files. Especially a lot of stuff in tcp.h should > > be probably in some .c file and not be inline. > > I explicitly left them out. > > Most of them are abstractions of common 2 or 3 instruction > calculations, and thus should stay inline. Definitely not in tcp.h. It has quite a lot of very long functions, of which very few really need to be inline: (AFAIK the only one where it makes really sense is tcp_set_state due to constant evaluation; although I never quite understood why the callers just didn't call explicit functions to do these actions) % awk ' { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; } ; END { print r/n }' < include/net/tcp.h 9.48889 The average function length is 9 lines. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html