Re: [RFC PATCH 0/7] Share events between metrics

2020-12-15 Thread Andi Kleen
> Or, is the concern more about trying to time-slice the results in a 
> fairly granular way and expecting accurate results then?

Usually the later. It's especially important for divisions. You want
both divisor and dividend to be in the same time slice, otherwise
the result usually doesn't make a lot of sense.

-Andi


Re: perf measure for stalled cycles per instruction on newer Intel processors

2020-10-18 Thread Andi Kleen
> > Don't use it. It's misleading on a out-of-order CPU because you don't
> > know if it's actually limiting anything.
> >
> > If you want useful bottleneck data use --topdown.
> 
> So running again, this time with the below params, I got this output
> where all the right most column is colored red. I wonder what can be
> said on the amount/ratio of stalls for this app - if you can maybe recommend
> some posts of yours to better understand that, I saw some comment in the
> perf-stat man page and some lwn article but wasn't really able to figure it 
> out.

TopDown determines what limits the execution the most.

The application is mostly backend bound (55-72%). This can be either memory
issues (more common), or sometimes also execution issues. Standard perf
doesn't support a further break down beyond these high level categories,
but there are alternative tools that do (e.g. mine is "toplev" in
https://github.com/andikleen/pmu-tools or VTune)

Some references on TopDown:
https://github.com/andikleen/pmu-tools/wiki/toplev-manual
http://bit.ly/tma-ispass14

The tools above would also allow you to sample where the stalls
are occuring.

-Andi



Re: perf measure for stalled cycles per instruction on newer Intel processors

2020-10-15 Thread Andi Kleen
On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote:
> Hi,
> 
> Earlier Intel processors (e.g E5-2650) support the more of classical
> two stall events (for backend and frontend [1]) and then perf shows
> the nice measure of stalled cycles per instruction - e.g here where we
> have IPC of 0.91 and CSPI (see [2]) of 0.68:

Don't use it. It's misleading on a out-of-order CPU because you don't
know if it's actually limiting anything.

If you want useful bottleneck data use --topdown.

-Andi


Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size

2020-05-08 Thread Andi Kleen
> > I'm not sure if size is that great an heuristic. The dedup algorithm should
> > work in any case even if you don't order by size, right?
> 
> Consider two metrics:
>  - metric 1 with events {A,B}
>  - metric 2 with events {A,B,C,D}
> If the list isn't sorted then as the matching takes the first group
> with all the events, metric 1 will match {A,B} and metric 2 {A,B,C,D}.
> If the order is sorted to {A,B,C,D},{A,B} then metric 1 matches within
> the {A,B,C,D} group as does metric 2. The events in metric 1 aren't
> used and are removed.

Ok. It's better for the longer metric if they stay together.

> 
> The dedup algorithm is very naive :-)

I guess what matters is that it gives reasonable results on the current
metrics. I assume it does?

How much deduping is happening if you run all metrics?

For toplev on my long term todo list was to compare it against
a hopefully better schedule generated by or-tools, but I never
got around to coding that up.

-Andi


Re: [RFC PATCH v3 06/14] perf evsel: fix 2 memory leaks

2020-05-08 Thread Andi Kleen
On Thu, May 07, 2020 at 10:36:21PM -0700, Ian Rogers wrote:
> If allocated, perf_pkg_mask and metric_events need freeing.

All these patches at the beginning look like straight forward
bug fixes and are really independent of the new features.

For them

Reviewed-by: Andi Kleen 

> 
> Signed-off-by: Ian Rogers 
> ---
>  tools/perf/util/evsel.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 28683b0eb738..05bb46baad6a 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1263,6 +1263,8 @@ void evsel__exit(struct evsel *evsel)
>   zfree(&evsel->group_name);
>   zfree(&evsel->name);
>   zfree(&evsel->pmu_name);
> + zfree(&evsel->per_pkg_mask);
> + zfree(&evsel->metric_events);
>   perf_evsel__object.fini(evsel);
>  }
>  
> -- 
> 2.26.2.645.ge9eca65c58-goog
> 


Re: [RFC PATCH v3 13/14] perf metricgroup: remove duped metric group events

2020-05-08 Thread Andi Kleen
>  static struct evsel *find_evsel_group(struct evlist *perf_evlist,
> struct expr_parse_ctx *pctx,
> +   bool has_constraint,
> struct evsel **metric_events,
> unsigned long *evlist_used)
>  {
> - struct evsel *ev;
> - bool leader_found;
> - const size_t idnum = hashmap__size(&pctx->ids);
> - size_t i = 0;
> - int j = 0;
> + struct evsel *ev, *current_leader = NULL;
>   double *val_ptr;
> + int i = 0, matched_events = 0, events_to_match;
> + const int idnum = (int)hashmap__size(&pctx->ids);

BTW standard perf data structure would be a rblist or strlist

I think it would be really better to do the deduping in a separate
pass than trying to add it to find_evsel_group. This leads
to very complicated logic.

This will likely make it easier to implement more sophisticated
algorithms too.

-Andi



Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size

2020-05-08 Thread Andi Kleen
On Thu, May 07, 2020 at 10:36:27PM -0700, Ian Rogers wrote:
> When adding event groups to the group list, insert them in size order.
> This performs an insertion sort on the group list. By placing the
> largest groups at the front of the group list it is possible to see if a
> larger group contains the same events as a later group. This can make
> the later group redundant - it can reuse the events from the large group.
> A later patch will add this sharing.

I'm not sure if size is that great an heuristic. The dedup algorithm should
work in any case even if you don't order by size, right?

I suppose in theory some kind of topological sort would be better.

-Andi
> 
> Signed-off-by: Ian Rogers 
> ---
>  tools/perf/util/metricgroup.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index 2a6456fa178b..69fbff47089f 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -520,7 +520,21 @@ static int __metricgroup__add_metric(struct list_head 
> *group_list,
>   return -EINVAL;
>   }
>  
> - list_add_tail(&eg->nd, group_list);
> + if (list_empty(group_list))
> + list_add(&eg->nd, group_list);
> + else {
> + struct list_head *pos;
> +
> + /* Place the largest groups at the front. */
> + list_for_each_prev(pos, group_list) {
> + struct egroup *old = list_entry(pos, struct egroup, nd);
> +
> + if (hashmap__size(&eg->pctx.ids) <=
> + hashmap__size(&old->pctx.ids))
> + break;
> + }
> + list_add(&eg->nd, pos);
> + }
>  
>   return 0;
>  }
> -- 
> 2.26.2.645.ge9eca65c58-goog
> 


Re: [RFC PATCH v3 01/14] perf parse-events: expand add PMU error/verbose messages

2020-05-08 Thread Andi Kleen


This seems like a independent bug fix and should probably
be pushed independently of the rest.

Perhaps add a Fixes: tag.

Reviewed-by: Andi Kleen 

On Thu, May 07, 2020 at 10:36:16PM -0700, Ian Rogers wrote:
> On a CPU like skylakex an uncore_iio_0 PMU may alias with
> uncore_iio_free_running_0. The latter PMU doesn't support fc_mask
> as a parameter and so pmu_config_term fails. Typically
> parse_events_add_pmu is called in a loop where if one alias succeeds
> errors are ignored, however, if multiple errors occur
> parse_events__handle_error will currently give a WARN_ONCE.
> 
> This change removes the WARN_ONCE in parse_events__handle_error and
> makes it a pr_debug. It adds verbose messages to parse_events_add_pmu
> warning that non-fatal errors may occur, while giving details on the pmu
> and config terms for useful context. pmu_config_term is altered so the
> failing term and pmu are present in the case of the 'unknown term'
> error which makes spotting the free_running case more straightforward.
> 
> Before:
> $ perf --debug verbose=3 stat -M llc_misses.pcie_read sleep 1
> Using CPUID GenuineIntel-6-55-4
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> adding 
> {unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W,{unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W
> intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
> WARNING: multiple event parsing errors
> ...
> Invalid event/parameter 'fc_mask'
> ...
> 
> After:
> $ perf --debug verbose=3 stat -M llc_misses.pcie_read sleep 1
> Using CPUID GenuineIntel-6-55-4
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> adding 
> {unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W,{unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W
> intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
> Attempting to add event pmu 'uncore_iio_free_running_5' with 
> 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors
> After aliases, add event pmu 'uncore_iio_free_running_5' with 
> 'fc_mask,ch_mask,umask,event,' that may result in non-fatal errors
> Attempting to add event pmu 'uncore_iio_free_running_3' with 
> 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors
> After aliases, add event pmu 'uncore_iio_free_running_3' with 
> 'fc_mask,ch_mask,umask,event,' that may result in non-fatal errors
> Attempting to add event pmu 'uncore_iio_free_running_1' with 
> 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors
> After aliases, add event pmu 'uncore_iio_free_running_1' with 
> 'fc_mask,ch_mask,umask,event,

Re: [RFC PATCH v3 00/14] Share events between metrics

2020-05-08 Thread Andi Kleen
On Thu, May 07, 2020 at 10:36:15PM -0700, Ian Rogers wrote:
> Metric groups contain metrics. Metrics create groups of events to
> ideally be scheduled together. Often metrics refer to the same events,
> for example, a cache hit and cache miss rate. Using separate event
> groups means these metrics are multiplexed at different times and the
> counts don't sum to 100%. More multiplexing also decreases the
> accuracy of the measurement.
> 
> This change orders metrics from groups or the command line, so that
> the ones with the most events are set up first. Later metrics see if
> groups already provide their events, and reuse them if
> possible. Unnecessary events and groups are eliminated.

Yes some improvements here are great.

> 
> The option --metric-no-group is added so that metrics aren't placed in
> groups. This affects multiplexing and may increase sharing.
> 
> The option --metric-mo-merge is added and with this option the
> existing grouping behavior is preserved.

Could we also make this a per metric option, like

-M foo:nomerge,... 

or somesuch? Okay i suppose this could be a followon.

Ultimatively like you said we probably want to configure
defaults in the event file.

-Andi


Re: [RFC PATCH 0/7] Share events between metrics

2020-05-07 Thread Andi Kleen
> > - without this change events within a metric may get scheduled
> >   together, after they may appear as part of a larger group and be
> >   multiplexed at different times, lowering accuracy - however, less
> >   multiplexing may compensate for this.
> 
> I agree the heuristic in this patch set is naive and would welcome to
> improve it from your toplev experience. I think this change is
> progress on TopDownL1 - would you agree?

TopdownL1 in non SMT mode should always fit. Inside a group
deduping always makes sense. 

The problem is SMT mode where it doesn't fit. toplev tries
to group each node and each level together.

> 
> I'm wondering if what is needed are flags to control behavior. For
> example, avoiding the use of groups altogether. For TopDownL1 I see.

Yes the current situation isn't great.

For Topdown your patch clearly is an improvement, I'm not sure
it's for everything though.

Probably the advanced heuristics are only useful for a few
formulas, most are very simple. So maybe it's ok. I guess
would need some testing over the existing formulas.


-Andi


Re: [RFC PATCH 0/7] Share events between metrics

2020-05-07 Thread Andi Kleen
On Thu, May 07, 2020 at 01:14:29AM -0700, Ian Rogers wrote:
> Metric groups contain metrics. Metrics create groups of events to
> ideally be scheduled together. Often metrics refer to the same events,
> for example, a cache hit and cache miss rate. Using separate event
> groups means these metrics are multiplexed at different times and the
> counts don't sum to 100%. More multiplexing also decreases the
> accuracy of the measurement.
> 
> This change orders metrics from groups or the command line, so that
> the ones with the most events are set up first. Later metrics see if
> groups already provide their events, and reuse them if
> possible. Unnecessary events and groups are eliminated.

Note this actually may make multiplexing errors worse.

For metrics it is often important that all the input values to
the metric run at the same time. 

So e.g. if you have two metrics and they each fit into a group,
but not together, even though you have more multiplexing it
will give more accurate results for each metric.

I think you change can make sense for metrics that don't fit 
into single groups anyways. perf currently doesn't quite know
this but some heuristic could be added. 

But I wouldn't do it for simple metrics that fit into groups.
The result may well be worse.

My toplev tool has some heuristics for this, also some more
sophisticated ones that tracks subexpressions. That would
be far too complicated for perf likely.

-Andi


Re: [RESEND] test_verifier #13 fails on arm64: "retval 65507 != -29"

2019-07-10 Thread Andi Kleen
> > Reading csum_partial/csum_fold, seems like after calculation of
> > checksum (so-called unfolded checksum), it is supposed to be passed
> > into csum_fold() to convert it into 16-bit one and invert.

Yes, you always need to fold at the end.

The low level code does fold sometimes, but not always.

-Andi


[PATCH] ipv4 ping: Fix __init* attributes

2019-03-29 Thread Andi Kleen
From: Andi Kleen 

ping_v4_net_ops references init functions, so needs to be __initdata.
ping_proc_exit is then referenced from __initdata, so also needs
to be __init.

Signed-off-by: Andi Kleen 
---
 net/ipv4/ping.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 7ccb5f87f70b..070e078612a0 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -1160,7 +1160,7 @@ static void __net_exit ping_v4_proc_exit_net(struct net 
*net)
remove_proc_entry("icmp", net->proc_net);
 }
 
-static struct pernet_operations ping_v4_net_ops = {
+static __initdata struct pernet_operations ping_v4_net_ops = {
.init = ping_v4_proc_init_net,
.exit = ping_v4_proc_exit_net,
 };
@@ -1170,7 +1170,7 @@ int __init ping_proc_init(void)
return register_pernet_subsys(&ping_v4_net_ops);
 }
 
-void ping_proc_exit(void)
+void __init ping_proc_exit(void)
 {
unregister_pernet_subsys(&ping_v4_net_ops);
 }
-- 
2.20.1



Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
> I wonder if there is some way to use the split format for the
> intermediate files, but then for the very final link bring them all in
> and make the end result be a traditional single binary. I'm not
> talking the separate "dwp" package that packs multiple dwo files into
> one, but to actually link them all in at the one.
> 
> Sure, it would lose some of the advantage, but I think a large portion
> of the -gsplit-dwarf advantage is about the intermediate state. No?

Not sure it's worth to do complicated workarounds. I assume
it will be not that difficult to fix binutils (after all
gdb works), and disabling the option is a reasonable workaround.

> 
> I tried to google for it, but couldn't find anything. But apparently
> elfutils doesn't support dwo files either. It seems mainly the linker
> and gdb itself that supports it.

The original design document is

https://gcc.gnu.org/wiki/DebugFission

-Andi


Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
On Mon, Nov 13, 2017 at 12:56:31PM -0800, Linus Torvalds wrote:
> On Mon, Nov 13, 2017 at 12:10 PM, Andi Kleen  wrote:
> >
> > You're right. It works for line information, but strangely not for
> > inlines. I assume it can be fixed.
> 
> So I'm not 100% sure it's strictly a addr2line bug.

It seems to be broken for normal programs too

$ cat tinline.c

int i;

static inline int finline(void)
{
i++;
}

main()
{
finline();
}

$ gcc -O2 -gsplit-dwarf tinline.c
$ addr2line -i -e a.out 0x4003b0
/home/ak/tsrc/tinline.c:6
$ gcc -O2  -g tinline.c
$ addr2line -i -e a.out 0x4003b0
/home/ak/tsrc/tinline.c:6
/home/ak/tsrc/tinline.c:12
$

I filed https://sourceware.org/bugzilla/show_bug.cgi?id=22434


-Andi



Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
> Put another way: the CONFIG_DEBUG_INFO_SPLIT option is useless. Yes,
> it saves time and disk space, but does so at the expense of making all
> the debug information unavailable to basic tools.

You're right. It works for line information, but strangely not for
inlines. I assume it can be fixed.

-Andi


Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
> > It's the "CONFIG_DEBUG_INFO_SPLIT" thing that makes faddr2line unable
> > to see the inlining information,
> > 
> > Using OPTIMIZE_INLINING is fine.
> 
> Good to know that!

It works for me. Perhaps your binutils is too old? It was
added at some point. Can you try upgrading?

% ./linux/scripts/faddr2line obj/vmlinux schedule+10
schedule+10/0x80:
schedule at arch/x86/include/asm/current.h:15

% addr2line --version
GNU addr2line version 2.27-24.fc26

-Andi


Re: Page allocator bottleneck

2017-09-14 Thread Andi Kleen
Tariq Toukan  writes:
>
> Congestion in this case is very clear.
> When monitored in perf top:
> 85.58% [kernel] [k] queued_spin_lock_slowpath

Please look at the callers. Spinlock profiles without callers
are usually useless because it's just blaming the messenger.

Most likely the PCP lists are too small for your extreme allocation
rate, so it goes back too often to the shared pool.

You can play with the vm.percpu_pagelist_fraction setting.

-Andi


Re: [PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables

2016-12-21 Thread Andi Kleen
> 64-bit x86_64:
> [0.509409] test_siphash: SipHash2-4 cycles: 4049181
> [0.510650] test_siphash: SipHash1-3 cycles: 2512884
> [0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
> [0.512904] test_siphash:JenkinsHash cycles:  978267

I'm not sure what these numbers mean. Surely a single siphash2-4
does not take 4+ million cycles? 

If you run them in a loop please divide by the iterations.

But generally running small code in a loop is often an unrealistic
benchmark strategy because it hides cache misses, primes
predictors, changes frequencies and changes memory costs,
but also can overload pipelines and oversubscribe
resources.

[see also page 46+ in http://halobates.de/applicative-mental-models.pdf]

So the numbers you get there are at least somewhat
dubious. It would be good to have at least some test which
is not just a tiny micro benchmark to compare before making
conclusions.

-Andi


Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)

2016-09-25 Thread Andi Kleen
Eric Dumazet  writes:

> On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order 
>> fallback.
>> 
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of 
>> the
>> syscall, so it's unlikely to stress vmalloc too much.
>
> vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>
> So I guess allowing vmalloc() being called from an innocent application
> doing a select() might be dangerous, especially if this select() happens
> thousands of time per second.

Yes it seems like a bad idea because of all the scaling problems here.

The right solution would be to fix select to use multiple
non virtually contiguous pages.

-Andi



Re: [RFC] bridge: MAC learning uevents

2016-09-08 Thread Andi Kleen
"D. Herrendoerfer"  writes:
>
> I may be missing something here - I'm pretty sure there I am, but is
> there any conceptual
>
> reason why this should not be done this way ?

What happens if someone floods the network with random mac addresses?
Sounds like an easy way to do a DoS attack against your host?

-Andi


Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue

2016-08-04 Thread Andi Kleen
> +1, I tried to bring this up here [1] in the last spin. I think only very
> few changes would be needed, f.e. on eBPF side to add a queue setting
> helper function which is probably straight forward ~10loc patch; and with
> regards to actually picking it up after clsact egress, we'd need to adapt
> __netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it.

You're proposing to rewrite the whole net policy manager as EBPF and run
it in a crappy JITer? Is that a serious proposal? It just sounds crazy
to me.

Especially since we already have a perfectly good compiler and
programming language to write system code in.

EBPF is ok for temporal instrumentation (if you somehow can accept
its security challenges), but using it to replace core 
kernel functionality (which network policy IMHO is) with some bizarre
JITed setup and multiple languages doesn't really make any sense.

Especially it doesn't make sense for anything with shared state,
which is the core part of network policy: it negotiates with multiple
users.

After all we're writing Linux here and not some research toy.

Thanks,

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Andi Kleen
> > So where is your policy for power saving?  From past experience I can tell 
> > you
> 
> There is no policy for power saving yet. I will add it to my todo list.

Yes it's interesting to consider. The main goal here is to maximize CPU
idle residency? I wonder if that's that much different from the CPU policy.

-Andi


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Andi Kleen
> I wonder if this can be attacked from a different angle. What would be
> missing to add support for this in user space? The first possibility
> that came to my mind is to just multiplex those hints in the kernel.

"just" is the handwaving part here -- you're proposing a micro kernel
approach where part of the multiplexing job that the kernel is doing
is farmed out to a message passing user space component.

I suspect this would be far more complicated to get right and
perform well than a straight forward monolithic kernel subsystem --
which is traditionally how Linux has approached things.

The daemon would always need to work with out of date state
compared to the latest, because it cannot do any locking with the
kernel state.  So you end up with a complex distributed system with multiple
agents "fighting" with each other, and the tuning agent
never being able to keep up with the actual work.

Also of course it would be fundamentally less efficient than
kernel code doing that, just because of the additional context
switches needed.

-Andi


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Andi Kleen
> It seems strange to me to add such policies to the kernel.
> Addmittingly, documentation of some settings is non-existent and one needs
> various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).

The problem is that different applications need different policies.

The only entity which can efficiently negotiate between different
applications' conflicting requests is the kernel. And that is pretty 
much the basic job description of a kernel: multiplex hardware
efficiently between different users.

So yes the user space tuning approach works for simple cases
("only run workloads that require the same tuning"), but is ultimately not
very interesting nor scalable.

-Andi


Re: [PATCH] net:ppp: replace too strict capability restriction on opening /dev/ppp

2016-06-19 Thread Andi Kleen
Shanker Wang  writes:

> This patch removes the check for CAP_NET_ADMIN in the initial namespace
> when opening /dev/open. Instead, CAP_NET_ADMIN is checked in the user
> namespace the net namespace was created so that /dev/ppp cat get opened
> in a unprivileged container.

Seems dangerous. From a quick look at the PPP ioctl there is no limit
how many PPP devices this can create. So a container having access to
this would be able to fill all kernel memory. Probably needs more
auditing and hardening first.

In general there seems to be a lot of attack surface for root
in PPP.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH 00/33] Compile-time stack metadata validation

2016-02-15 Thread Andi Kleen
> > There are two ways to fix the warnings:
> >
> > 1. get rid of the thunks and call the C functions directly; or
> 
> No. Not until gcc learns about per-function callibg conventions (so that it 
> can
> be marked as not clobbering registers).

It does already for static functions in 5.x (with -fipa-ra).
And with LTO it can be used even somewhat globally.

Even older version supported it, for only for x86->SSE on 32bit, which
is useless for the kernel. But the new IPA-RA propagates which registers
are clobbered.

That said it will probably be a long time until we can drop support for
older compilers.  So for now the manual method is still needed.

-Andi



Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Andi Kleen
Tom Herbert  writes:

> Also, we don't do anything special for alignment, unaligned
> accesses on x86 do not appear to be a performance issue.

This is not true on Atom CPUs.

Also on most CPUs there is still a larger penalty when crossing
cache lines.

> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Checksum performance:
>
> Isolating old and new implementation for some common cases:

You forgot to state the CPU. The results likely depend heavily
on the micro architecture.

The original C code was optimized for K8 FWIW.

Overall your assembler looks similar to the C code, except for the jump
table. Jump table has the disadvantage that it is much harder to branch
predict, with a large penalty if it's mispredicted.

I would expect it to be slower for cases where the length
changes frequently. Did you benchmark that case?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Increasing skb->mark size

2015-12-01 Thread Andi Kleen
> >This would be be great. I've recently ran into some issues with
> >the overhead of the Android firewall setup.
> >
> >So basically you need 4 extra bytes in sk_buff. How about:
> >
> >- shrinking skb->priority to 2 byte
> 
> That wouldn't work, see SO_PRIORITY and such (4 bytes) ...

But does anybody really use the full 4 bytes for the priority?
SO_PRIORITY could well truncate the value.

> 
> >- skb_iff is either skb->dev->iff or 0. so it could be replaced with a
> >single bit flag for the 0 case.
> 
> ... and that one wouldn't work on ingress.

Please explain.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Increasing skb->mark size

2015-12-01 Thread Andi Kleen
Lorenzo Colitti  writes:

> On Wed, Nov 25, 2015 at 5:32 AM, Matt Bennett
>  wrote:
>> I'm emailing this list for feedback on the feasibility of increasing
>> skb->mark or adding a new field for marking. Perhaps this extension
>> could be done under a new CONFIG option.
>
> 64-bit marks (both skb->mark and sk->sk_mark) would be useful for
> hosts doing complex policy routing as well. Current Android releases
> use 20 of the 32 bits. If the mark were 64 bits, we could put the UID
> in it, and stop using ip rules to implement per-UID routing.

This would be be great. I've recently ran into some issues with
the overhead of the Android firewall setup.

So basically you need 4 extra bytes in sk_buff. How about:

- shrinking skb->priority to 2 byte
- skb_iff is either skb->dev->iff or 0. so it could be replaced with a
single bit flag for the 0 case.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-07 Thread Andi Kleen
> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
> limited "Frontend" support. E.g. 
> 
>  # perf record -g -a -e stalled-cycles-frontend
>  Error:
>  The stalled-cycles-frontend event is not supported.
> 
> And AFAIK icache misses are part of "frontend".

Ignore stalled-cycles-frontend. It is very unreliable and has never worked 
right.
toplev gives you much more reliable output.


-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-07 Thread Andi Kleen
> There is a recent patch that may help here, see below, but maybe its
> just a matter of removing that :pp, as it ends with a /pp anyway, no
> need to state that twice :)

Yes the extra :pp was a regression in toplev. I fixed it now. Thanks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-05 Thread Andi Kleen
> My only problem left, is I want a perf measurement that pinpoint these
> kind of spots.  The difference in L1-icache-load-misses were significant
> (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> different perf events without being able to pinpoint the location (even
> though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> will know what event I should try?

Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
bottle neck is and what to sample for if there is a suitable event and
even prints the command line.

https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev

However frontend issues are difficult to sample, as they happen very far
away from instruction retirement where the sampling happens. So you may
have large skid and the sampling points may be far away. Skylake has new
special FRONTEND_* PEBS events for this, but before it was often difficult. 

BTW if your main goal is icache; I wrote a gcc patch to help the kernel
by enabling function splitting: Apply the patch in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5,
make sure 9bebe9e5b0f (now in mainline) is applied and build with
-freorder-blocks-and-partition. That will split all functions into
statically predicted hot and cold parts and generally relieves
icache pressure. Any testing of this on your workload welcome.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/8]: uninline & uninline

2008-02-23 Thread Andi Kleen
> Is it possible to catch this automatically, like, by re-defining
> likely/unlikely to the raw form in specific file(s)?

Sure it would be possible to define a IN_VDSO symbol in all vdso
related files and then do that. Might be useful for other things
too. vdso has some very specific requirements.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/8]: uninline & uninline

2008-02-23 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:


>> -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
>
> This is a surprise.  I expect that the -mm-only
> profile-likely-unlikely-macros.patch is the cause of this and mainline
> doesn't have this problem.

Shouldn't they only have overhead when the respective CONFIG is enabled?

> If true, then this likely/unlikely bloat has probably spread into a lot of
> your other results and it all should be redone against mainline, sorry :(
>
> (I'm not aware of anyone having used profile-likely-unlikely-macros.patch
> in quite some time.  That's unfortunate because it has turned up some
> fairly flagrant code deoptimisations)

Is there any reason they couldn't just be merged to mainline? 

I think it's a useful facility.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-23 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:
>
> It should be possible to use a modular jhash.ko.  The things which you
> have identified as clients of the jhash library are usually loaded as modules.

For very small functions like this own modules are quite expensive. First  
everything gets rounded up to at least one 4K page (or worse on architectures
with larger pages). That just wastes some memory. 

But then since modules live in vmalloc space they also need an own 
TLB entry, which are notouriously scarce in the kernel because often user space
wants to monopolize them all. So if you're unlucky and user space
is thrashing the TLB just a single call to this hash function will be an own 
TLB miss and quite expensive.

It would be better to just always link it in for this case.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make sure sockets implement splice_read

2008-02-15 Thread Andi Kleen
Rémi Denis-Courmont <[EMAIL PROTECTED]> writes:

> Le Friday 15 February 2008 18:30:11 Andi Kleen, vous avez écrit :
>> David Miller <[EMAIL PROTECTED]> writes:
>> > From: Rémi_Denis-Courmont <[EMAIL PROTECTED]>
>> > Date: Thu, 14 Feb 2008 18:53:34 +0200
>> >
>> >> Fixes a segmentation fault when trying to splice from a non-TCP socket.
>> >>
>> >> Signed-off-by: Rémi Denis-Courmont <[EMAIL PROTECTED]>
>> >
>> > Applied, thank you.
>>
>> That's also a stable candidate, isn't it?
>
> Should go to 2.6.25-rc2, but the offending code is not in any stable release.

You're right. I misinterpreted git describe output for this one. Sorry.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make sure sockets implement splice_read

2008-02-15 Thread Andi Kleen
David Miller <[EMAIL PROTECTED]> writes:

> From: Rémi_Denis-Courmont <[EMAIL PROTECTED]>
> Date: Thu, 14 Feb 2008 18:53:34 +0200
>
>> Fixes a segmentation fault when trying to splice from a non-TCP socket.
>> 
>> Signed-off-by: Rémi Denis-Courmont <[EMAIL PROTECTED]>
>
> Applied, thank you.

That's also a stable candidate, isn't it? 

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ipvs: Make the synchronization interval controllable

2008-02-08 Thread Andi Kleen
Sven Wegener <[EMAIL PROTECTED]> writes:

> The default synchronization interval of 1000 milliseconds is too high for a
> heavily loaded director. Collecting the connection information from one second
> and then sending it out in a burst will overflow the socket buffer and lead to
> synchronization information being dropped. Make the interval controllable by a
> sysctl variable so that users can tune it.

It would be better if the defaults just worked under all circumstances.
So why not just lower the default?

Or the code could detect overflowing socket buffers and lower the 
value dynamically.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


openswan broken between 2.6.24rc8 and final

2008-02-08 Thread Andi Kleen

Hi,

Since I updated a system here from 2.6.24rc8 (+ some unrelated patches)
to 2.6.24 openswan in tunnel mode is rather unhappy now. It works 
initially for a few minutes, but then when it comes to renew a key (I think) 
the connection always hangs.  This means on 2.6.24-final the kernel
can now finally reboot again with active ipsec (previously
there were always leaked reference counts), but it cannot 
hold them up for a longer time.

I first thought it was caused by

commit d4782c323d10d3698b71b6a6b3c7bdad33824658
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Sun Jan 20 17:24:29 2008 -0800

[AF_KEY]: Fix skb leak on pfkey_send_migrate() error

but when I just revert just this patch it still doesn't work
(and obviously the reboot problem is back) 

Before I do a full bisect; is there already a fix for this? 

Thanks,

-Andi 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-08 Thread Andi Kleen
On Thu, Feb 07, 2008 at 02:44:46PM -0500, Ross Vandegrift wrote:
> On Wed, Feb 06, 2008 at 09:53:57AM +0100, Andi Kleen wrote:
> > That would be useful yes -- for different bandwidths.
> 
> My initial test is end-to-end 1000Mbps, but I've got a few different
> packet rates.
> 
> > If the young/old heuristics do not work well enough anymore most
> > likely we should try readding RED to the syn queue again. That used
> > to be pretty effective in the early days. I don't quite remember why
> > Linux didn't end up using it in fact.
> 
> I'm running juno-z with 2, 4, & 8 threads of syn flood to port 80.
> wireshark measures 2 threads at 350pps, 4 threads at 750pps, and 8

What's the total bandwidth of the attack?

> threads at 1200pps.  Under no SYN flood, the server handles 750 HTTP
> requests per second, measured via httping in flood mode.

Thanks for the tests. Could you please do an additional experiment? 
Use sch_em or similar to add a jittering longer latency in the connection
(as would be realistic in a real distributed DOS). Does it make a
difference? 

> With a default tcp_max_syn_backlog of 1024, I can trivially prevent
> any inbound client connections with 2 threads of syn flood.
> Enabling tcp_syncookies brings the connection handling back up to 725
> fetches per second.

Yes the defaults are probably too low. That's something that should
be fixed.

> 
> At these levels the CPU impact of tcp_syncookies is nothing.  I can't

CPU impact of syncookies was never a concern. The problems are rather
missing flow control and disabling of valuable TCP features.

> BUG: soft lockup detected on CPU#1!
>  [] softlockup_tick+0x96/0xa4
>  [] update_process_times+0x39/0x5c
>  [] smp_apic_timer_interrupt+0x5b/0x6c
>  [] apic_timer_interrupt+0x1f/0x24
>  [] taskstats_exit_send+0x152/0x371
>  [] netlink_kernel_create+0x5/0x11c
>  [] reqsk_queue_alloc+0x32/0x81
>  [] lock_sock+0x8e/0x96

I think the softirqs are starving user context through the socket
lock.  Probably should be fixed too. Something like softirq should
detect when there is a user and it is looping too long and should
give up the lock for some time.


-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-06 Thread Andi Kleen
On Wed, Feb 06, 2008 at 09:36:11AM -0800, Glenn Griffin wrote:
> > I didn't think a module could have multiple module_inits. Are you
> > sure that works?
> 
> Indeed.  That will fail whenever ipv6 is compiled as a module.  It's
> been removed.  It snuck in from the v4 implementation, where I'm still
> having trouble understanding why it's needed there.

s/needed/used/

ipv4 is never modular so it works.

Arguably it would be cleaner if it was __initcall()

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-06 Thread Andi Kleen
> I work at a hosting company and we see these kinds of issues in the
> real world fairly frequently.  I would guess maybe a monthly basis.
> The servers where we have seen this are typically running RHEL 4 or 5
> kernels, so I can't really speak to how recent the kernel is in this
> specific term.

RHEL5 should be recent enough.

> 
> If I can find a box that we could temporary get a kernel.org kernel
> on, I'll see if I can get a real comparison together.  We have
> collected a few of the more effective attack tools that have been left
> on compromised systems, so it wouldn't be too difficult to get some
> numbers.

That would be useful yes -- for different bandwidths.

If the young/old heuristics do not work well enough anymore most likely we 
should
try readding RED to the syn queue again. That used to be pretty effective
in the early days. I don't quite remember why Linux didn't end up using it
in fact.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
> +static __init int init_syncookies(void)
> +{
> + get_random_bytes(syncookie_secret, sizeof(syncookie_secret));
> + return 0;
> +}
> +module_init(init_syncookies);

I didn't think a module could have multiple module_inits. Are you
sure that works?

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
On Tue, Feb 05, 2008 at 11:39:11PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 05, 2008 at 09:02:11PM +0100, Andi Kleen ([EMAIL PROTECTED]) 
> wrote:
> > On Tue, Feb 05, 2008 at 10:29:28AM -0800, Glenn Griffin wrote:
> > > > Syncookies are discouraged these days. They disable too many
> > > > valuable TCP features (window scaling, SACK) and even without them
> > > > the kernel is usually strong enough to defend against syn floods
> > > > and systems have much more memory than they used to be.
> > > >
> > > > So I don't think it makes much sense to add more code to it, sorry.
> 
> How does syncookies prevent windows from growing?

Syncookies do not allow window scaling so you can't have any windows >64k

> Most (if not all) distributions have them enabled and window growing
> works just fine. Actually I do not see any reason why connection
> establishment handshake should prevent any run-time operations at all,
> even if it was setup during handshake.

TCP only uses options negotiated during the hand shake and syncookies
is incapable to do this.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
> The problem is that any reasonably recent PC can generate enough
> forged SYN packets to overwhelm reasonable SYN queues on a much more
> powerful server.

Have you actually seen this with a recent kernel in the wild or are
you just talking theoretically?

Linux uses some heuristics to manage the syn queue that should
still ensure reasonable service even without cookies under attack.
Also SYN-RECV sockets are stored in a special data structure optimized
to use minimal resources.

It is far from the classical head drop method that was so vunerable
to syn flooding.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
On Tue, Feb 05, 2008 at 10:29:28AM -0800, Glenn Griffin wrote:
> > Syncookies are discouraged these days. They disable too many
> > valuable TCP features (window scaling, SACK) and even without them
> > the kernel is usually strong enough to defend against syn floods
> > and systems have much more memory than they used to be.
> >
> > So I don't think it makes much sense to add more code to it, sorry.
> 
> As you say the kernel is usually strong enough to defend against syn flood
> attacks, but what about the situations where it isn't?  As valuable as the TCP
> features are I would give them up if it means I'm able to connect to my sshd
> port when I otherwise would be unable to.  While increased synq sizes, better
> dropping algorithms, and minisocks are a great way to mitigate the attacks and
> in most cases are enough, there are situations where syncookies are nice.

Have you seen such a case in practice with a modern kernel? 

They also cause problems unfortunately; e.g. there is no real flow control for 
connections
anymore in the non DOS case.

> Regardless, I would say as long as ipv4 has syncookie support it will
> accurately be viewed as a deficiency of ipv6 if it lacks support.  So perhaps
> the discussion should be we whether all the other defenses are enough to
> warrant the removal of syncookie support from ipv4.  That topic may bring in
> more opinions.

That is essentially what I and Alan were discussing.
> 
> > Besides you should really move it to the ipv6 module, right now the code
> > would be always compiled in even for ipv4 only kernels.
> 
> That is correct.  I will gladly move it into it's own section within 
> net/ipv6/.
> Do you have any problem using the same CONFIG and sysctl variables as the ipv4
> implementation?

No.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
On Tue, Feb 05, 2008 at 04:03:01PM +, Alan Cox wrote:
> > Also your sub PC class appliances rarely run LISTEN servers anyways
> > that are open to the world.
> 
> Really. The ones that first come to mind often have exposed ports
> including PDA devices and phones. (Ditto low end PC boxes - portscan an
> EEPC some day ;))

What kind of LISTEN ports? And does it matter if they're DoS'ed?

The only one I can think of right now would be ident and frankly nobody
will really care if that one works or not.

If it's just the management interface etc. (which should really be firewalled) 
then likely not.


> Is the other stuff enough - good question, and can be measured easily
> enough on a little dlink router or similar.

My guess would be that it is.

If it's not it would be probably better to look at improving the standard queue 
management again; e.g.readd RED.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
On Tue, Feb 05, 2008 at 03:42:13PM +, Alan Cox wrote:
> > Syncookies are discouraged these days. They disable too many 
> > valuable TCP features (window scaling, SACK) and even without them 
> > the kernel is usually strong enough to defend against syn floods
> > and systems have much more memory than they used to be.
> 
> Somewhat untrue. Network speeds have risen dramatically, the number of

With strong I meant Linux has much better algorithms to handle the standard
syn queue (syncookies was originally added when it had only dumb head drop) 
and there are minisocks which also require significantly less overhead
to manage than full sockets (less memory etc.) 

When I wrote syncookies originally that all was not true.

> appliances running Linux that are not PC class means memory has fallen
> not risen and CPU has been pretty level.
> 
> > So I don't think it makes much sense to add more code to it, sorry.
> 
> I think it makes a lot of sense 

I have my doubts. It would be probably better to recheck everything
and then remove syncookies.

Also your sub PC class appliances rarely run LISTEN servers anyways
that are open to the world.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Andi Kleen
On Mon, Feb 04, 2008 at 03:01:01PM -0800, Glenn Griffin wrote:
> Add IPv6 support to TCP SYN cookies.  This is written and tested against
> 2.6.24, and applies cleanly to linus' current HEAD (d2fc0b).  Unfortunately
> linus' HEAD breaks my sky2 card at the moment, so I'm unable to test against
> that.  I see no reason why it would be affected though.  Comments/suggestions
> are welcome.

Syncookies are discouraged these days. They disable too many 
valuable TCP features (window scaling, SACK) and even without them 
the kernel is usually strong enough to defend against syn floods
and systems have much more memory than they used to be.

So I don't think it makes much sense to add more code to it, sorry.

Besides you should really move it to the ipv6 module, right now the code
would be always compiled in even for ipv4 only kernels.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow OOM in netif_RX function

2008-02-04 Thread Andi Kleen
> Nothing that looks like a struct net_device. All the dumped leaked slab
> look the same until "45 20 05 d8" (the ascii 'E' on the 3rd line).

45 ... is often the start of an IP header (IPv4, 5*4=20 bytes length)

You could dump them to a file (e.g. using a sial script) and then
look at them with tcpdump or similar to get an idea what kinds 
of packets they are.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What were the reasons of having mandatory IP address at AX.25 interfaces ?

2008-02-04 Thread Andi Kleen
Matti Aarnio <[EMAIL PROTECTED]> writes:

> .. the original reason was apparently that _ifconfig_ blew up
> when it saw protocols that it didn't understand on network
> interfaces.  Possibly when there was no IP protocol on an
> interface.

It's not only ifconfig, a lot of programs use SIOCGIFCONF
to query ip addresses.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


No pmtu probing on retransmits?

2008-02-03 Thread Andi Kleen

Hallo,

While looking for something else in tcp_output.c I noticed that
MTU probing seems to be only done in tcp_write_xmit (when
packets come directly from process context), but not via the timer
driven timer retransmit path (tcp_retransmit_skb). Is that intentional?
It looks quite weird. I would normally assume PMTU blackholes get usually
detected on retransmit timeouts. Or do I miss something? 

You seem to have assumed interrupt context at least
because tcp_mtu_probe() uses GFP_ATOMIC which is only needed for
interrupts. Currently it is only called in process context I think.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-02-03 Thread Andi Kleen
On Sun, Feb 03, 2008 at 09:57:10AM +1100, Herbert Xu wrote:
> Andi Kleen <[EMAIL PROTECTED]> wrote:
> >> Then change TBF to use skb_gso_segment?  Be careful, the fact that
> > 
> > That doesn't help because it wants to interleave packets
> > from different streams to get everything fair and smooth. The only 
> > good way to handle that is to split it up and the simplest way to do 
> > this is to just tell TCP to not do GSO in the first place.
> 
> Actually if we're going to do this I'd prefer you to call skb_gso_segment
> instead because that lets us at least bypass netfilter which is one of
> the key benefits of software GSO.

Ok. The only problem I see right now is that skb_segment() seems to overuse
GFP_ATOMIC which would make it unreliable, but that is something that can be 
fixed.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Andi Kleen
> Turning TSO off at netdev registration time with a warning will be a
> cleaner IMO. Or alternatively introducing a kernel-config "I know what

You mean the qdisc should force TSO off on the underlying device? 

> TSO is" option which is then used at netdev registration. From a
> usability perspective it would make more sense to just keep ethtool as
> the only way to configure TSO. 

Ok, reasonable point.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Andi Kleen
On Fri, Feb 01, 2008 at 01:58:30PM -0800, Rick Jones wrote:
> >>Does this also imply that JumboFrames interacts badly with these qdiscs? 
> >>Or IPoIB with its 65000ish byte MTU?
> >
> >
> >Correct. Of course it is always relative to the link speed. So if your
> >link is 10x faster and your packets 10x bigger you can get similarly
> >smooth shaping.
> 
> If the later-in-thread mentioned person shaping for their DSL line 
> happens to have enabled JumboFrames on their GbE network, will/should 
> the qdisc negate that? 

I don't think so, mostly because jumbo frames are not enabled by
default.  I'm only concerned about usable defaults there -- if you
set non default options you should certainly know what you're doing.

There are other reasons to not use jumbo frames anyways; e.g. a lot 
of cards still do not support SG for them but only process 
them as a single continuous buffer in memory so you often run 
into memory fragmentation problems.

> Or is the qdisc currently assuming that the 
> remote end of the DSL will have asked for a smaller MSS?

First there are lots of different qdiscs that all do different things.
Take a look at net/sched/*. Then they usually don't strictly require particular
MTUs (or know anything about MSS), but tend to work better with smaller
MTUs because that allows more choices in packet scheduling.  Generally
the larger your packets the less they can be scheduled.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Andi Kleen
> The TSO defer logic is based on your congestion window and current
> window size.  So the actual frame sizes hitting your NIC attached to
> your DSL probably aren't anywhere near 64KB, but probably more in line
> with whatever your window size is for DSL.

DSL windows can be quite large because a lot of DSL lines have a quite
long latency due to error correction. And with ADSL2 we have upto 16Mbit
now.

> I think we're having more of a disagreement of what is considered the
> "normal case" user.  If you are on a slow link, such as a DSL/cable
> line, your TCP window/congestion window aren't going to be big enough to
> generate large TSO's, so what is the issue?  But disabling TSO, say on a

64k TSOs are likely even with DSL. Anyways even with smaller TSOs the
change still makes sense because each increase makes packet scheduling
less smooth.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow OOM in netif_RX function

2008-02-01 Thread Andi Kleen
On Fri, Feb 01, 2008 at 02:51:40PM +0200, Ivan Dichev wrote:
> Arnaldo Carvalho de Melo wrote:
> > Em Fri, Jan 25, 2008 at 02:21:08PM +0100, Andi Kleen escreveu:
> >   
> >> "Ivan H. Dichev" <[EMAIL PROTECTED]> writes:
> >> 
> >>> What could happen if I put different Lan card in every slot?
> >>> In ex. to-private -> 3com
> >>>   to-inet-> VIA
> >>>   to-dmz -> rtl8139
> >>> And then to look which RX function is consuming the memory.
> >>> (boomerang_rx, rtl8139_rx, ... etc) 
> >>>   
> >> The problem is unlikely to be in the driver (these are both
> >> well tested ones) but more likely your complicated iptables setup somehow
> >> triggers a skb leak.
> >>
> >> There are unfortunately no shrink wrapped debug mechanisms in the kernel
> >> for leaks like this (ok you could enable CONFIG_NETFILTER_DEBUG 
> >> and see if it prints something interesting, but that's a long shot).
> >>
> >> If you wanted to write a custom debugging patch I would do something like 
> >> this:
> >>
> >> - Add two new integer fields to struct sk_buff: a time stamp and a integer 
> >> field
> >> - Fill the time stamp with jiffies in alloc_skb and clear the integer field
> >> - In __kfree_skb clear the time stamp
> >> - For all the ipt target modules in net/ipv4/netfilter/*.c you use change 
> >> their 
> >> ->target functions to put an unique value into the integer field you added.
> >> - Do the same for the pkt_to_tuple functions for all conntrack modules
> >>
> >> Then when you observe the leak take a crash dump using kdump on the router 
> >> and then use crash to dump all the slab objects for the sk_head_cache.
> >> Then look for any that have an old time stamp and check what value they
> >> have in the integer field. Then the netfilter function who set that unique 
> >> value 
> >> likely triggered the leak somehow.
> >> 
> >
> > I wrote some systemtap scripts that do parts of what you suggest, and at
> > least for the timestamp there was no need to add a new field to struct
> > sk_buff, I just reuse skb->timestamp, as it is only used when we use a
> > packet sniffer. Here it is for reference, but it needs some tapsets I
> > wrote, so I'll publish this git repo in git.kernel.org, perhaps it can
> > be useful in this case as a starting point. Find another unused field
> > (hint: I know that at least 4 bytes on 64 bits is present as a hole) and
> > you're done, no need to rebuild the kernel :)
> >
> > http://git.kernel.org/?p=linux/kernel/git/acme/nettaps.git
> >
> > - Arnaldo
> >   
> Thanks to everyone for the given ideas.
> I am not kernel guru so writing patch is difficult. This is a production
> server and it is quite difficult to debug (only at night)
> I removed some iptables exotics -  recent , ulog, string , but no effect.
> Since we can reach OOM most of the memory is going to be filled with the
> leak, and we are thinking to try to dump and analyze it.

You could perhaps use crash to look for leaked packets and then 
see if you can see a pattern, as in what types of packets they are.

Still I expect without modifying the kernel to add some more
netfilter tracing it will be difficult to diagnose this.

I suppose it would be possible to write a suitable systemtap script
to also trace this without modifying the kernel, although it will be 
probably not easy and more complicated than just changing the C code.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
> The problem with ethtool is that it's a non-obvious nerd knob.  At
> the least the ethtool documentation should be updated to indicate that 
> activating TSO effects tc accuracy.

TSO tends to be activated by default in the driver; very few people who use it
do even know that ethtool exist or what TSO is.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 09:33:44PM +0100, Jarek Poplawski wrote:
> Andi Kleen wrote, On 01/31/2008 08:34 PM:
> 
> >> TSO by nature is bursty.  But disabling TSO without the option of having
> >> it on or off to me seems to aggressive.  If someone is using a qdisc
> >> that TSO is interfering with the effectiveness of the traffic shaping,
> >> then they should turn off TSO via ethtool on the target device.  Some
> > 
> > The philosophical problem I have with this suggestion is that I expect
> > that the large majority of users will be more happy with disabled TSO
> > if they use non standard qdiscs and defaults that do not fit 
> > the majority use case are bad.
> 
> 
> If you mean the large majority of the large minority of users, who use
> non standard qdiscs - I agree - this is really the philosophical problem!

[] Sorry if you had a point in this email I missed it.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
> Well, it could be just that when using such qdiscs TSO would be
> disabled, but the user could override this by using ethtool after
> loading the qdiscs.

If anything TC, not ethtool. Do you have an useful scenario where
GSO makes sense with TBF et.al.?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 03:42:54PM -0800, Waskiewicz Jr, Peter P wrote:
> > Well, it could be just that when using such qdiscs TSO would be
> > disabled, but the user could override this by using ethtool after
> > loading the qdiscs.
> 
> I still disagree with this.  The qdisc should not cause anything to happen to 
> feature flags on the device. It's the scheduling layer and really shouldn't 
> care about what features the device supports or not.  If someone has an issue 
> with a feature hurting performance or causing odd behavior when using a 
> qdisc, then they should disable the feature on the device using the 
> appropriate tools provided.  If it's the qdisc causing issues, then either 
> the qdisc needs to be fixed, or it should be documented what features are 
> recommended to be on and off with the qdisc.  I don't agree that the 
> scheduling layer should affect features on an underlying device.

You seem to only look at this from a high level theoretical standpoint.

But more down to earth: do you have a useful scenario where it makes
sense to do shaping or another qdisc on GSO packets? My take is that
when you decide to do any packet scheduling you really want to do 
it on wire packets, not some internal stack implementation implementation
detail units.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 11:14:34AM -0800, Rick Jones wrote:
> Sounds like the functionality needs to be in the DSL bridge :) (or the 
> "router" in the same case) Particularly since it might be getting used 
> by more than one host on the GbE switch.

Possible, but it is not usually in the real world. Setups like
WonderShaper which do this on the host side are pretty common.

> then the qdisc could/should place a cap on the size of a 'TSO' based on 
> the bitrate (and perhaps input as to how much time any one "burst" of 
> data should be allowed to consume on the network) and pass that up the 
> stack?  right now you seem to be proposing what is effectively a cap of 
> 1 MSS.

Hmm, that would probably be possible for TBF, but I'm not sure this can be
really done in a useful way for the more complicated qdiscs. Especially
since they would likely need to turn on/off GSO regularly when dynamic
circumstances change and there is not really a good way to affect a socket 
after it was created.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
> TSO by nature is bursty.  But disabling TSO without the option of having
> it on or off to me seems to aggressive.  If someone is using a qdisc
> that TSO is interfering with the effectiveness of the traffic shaping,
> then they should turn off TSO via ethtool on the target device.  Some

The philosophical problem I have with this suggestion is that I expect
that the large majority of users will be more happy with disabled TSO
if they use non standard qdiscs and defaults that do not fit 
the majority use case are bad.

Basically you're suggesting that nearly everyone using tc should learn about
another obscure command.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 07:46:02PM +0100, Patrick McHardy wrote:
> Patrick McHardy wrote:
>> Andi Kleen wrote:
>>
>>> Can you please try with the above config?
>> I'll give it a try later.
>
>
> I took all options from that config that seemed possibly
> related (qdiscs, no hrtimers, no nohz, slab, ...), but
> still can't reproduce it.

Ok I'll do bisect then later (not today anymore likely) 

> Does it also crash if you use more reasonable parameters?

I managed to make it crash with different parameters too,
but with good parameters it did set a qdisc successfully
and appeared to work.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
> So, at what timescale do people using these qdiscs expect things to 
> appear "smooth?"  64KB of data at GbE speeds is something just north of 
> half a millisecond unless I've botched my units somewhere.

One typical use case for TBF is you talking to a DSL bridge that 
is connected using a GBit Ethernet switch. For these DSL connections it gives
much better behaviour to shape the traffic to slightly below
your external link speed so that you can e.g. prioritize packets properly.
But the actual external link speed is much lower than GbE.
A lot of GbE NICs enable TSO by default.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 10:26:19AM -0800, Rick Jones wrote:
> Andi Kleen wrote:
> >TSO interacts badly with many queueing disciplines because they rely on 
> >reordering packets from different streams and the large TSO packets can 
> >make this difficult. This patch disables TSO for sockets that send over 
> >devices with non standard queueing disciplines. That's anything but noop 
> >or pfifo_fast and pfifo right now.
> 
> Does this also imply that JumboFrames interacts badly with these qdiscs? 
>  Or IPoIB with its 65000ish byte MTU?

Correct. Of course it is always relative to the link speed. So if your
link is 10x faster and your packets 10x bigger you can get similarly
smooth shaping.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 07:21:20PM +0100, Patrick McHardy wrote:
> Andi Kleen wrote:
> >>Then change TBF to use skb_gso_segment?  Be careful, the fact that
> >
> >That doesn't help because it wants to interleave packets
> >from different streams to get everything fair and smooth. The only 
> >good way to handle that is to split it up and the simplest way to do 
> >this is to just tell TCP to not do GSO in the first place.
> 
> 
> Thats not correct, TBF keeps packets strictly ordered unless

My point was that without TSO different submitters will interleave
their streams (because they compete about the qdisc submission) 
and then you end up with a smooth rate over time for all of them.

If you submit in large chunks only (as TSO does) it will always 
be more bursty and that works against the TBF goal.

For a single submitter you would be correct.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
> Then change TBF to use skb_gso_segment?  Be careful, the fact that

That doesn't help because it wants to interleave packets
from different streams to get everything fair and smooth. The only 
good way to handle that is to split it up and the simplest way to do 
this is to just tell TCP to not do GSO in the first place.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 07:01:00PM +0100, Patrick McHardy wrote:
> Andi Kleen wrote:
> >>Fix the broken qdisc instead.
> >
> >What do you mean? I don't think the qdiscs are broken.
> >I cannot think of any way how e.g. TBF can do anything useful
> >with large TSO packets.
> 
> 
> Someone posted a patch some time ago to calculate the amount
> of tokens needed in max_size portions and use that, but IMO
> people should just configure TBF with the proper MTU for TSO.

TBF with 64k atomic units will always be chunky and uneven. I don't
think that's a useful goal. 

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
> Fix the broken qdisc instead.

What do you mean? I don't think the qdiscs are broken.
I cannot think of any way how e.g. TBF can do anything useful
with large TSO packets.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 08:41:38AM -0800, Ben Greear wrote:
> I don't know exactly how the tcp_tw_recycle works, but it seems like it 
> could be made to only
> take affect when all local ports are used up in TIME_WAIT.  

TIME-WAIT does not actually use up local ports; it uses up remote ports
because it is done on the LISTEN socket which has always a fixed
local port. And it has no idea how many ports the other end has left.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Andi Kleen

> Works for me:
>
> qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms
>   Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0)
>   rate 0bit 0pps backlog 0b 0p requeues 0
>
> Packets are dropped as expected.

I can still reproduce it on 64bit with http://halobates.de/config-qdisc
(all qdiscs etc. compiled in for testing) 
with latest git tip (8af03e782cae1e0a0f530ddd22301cdd12cf9dc0)

The command line above causes an instant hang. Also tried it with
newer iproute2 (the original one was quite old), but it didn't make
a difference.

Perhaps it's related to what qdiscs are enabled? Can you please
try with the above config?

If everything fails I can do a bisect later.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Andi Kleen

> -
> lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
> lilsol:~# uname -a
> Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686

Can you try it again with current git mainline?

> GNU/Linux
> lilsol:~# tc qdisc ls dev eth0
> qdisc tbf 8001: root rate 1000bit burst 10b lat 737.3ms
> lilsol:~#
> ---
> 
> What do your patches do?

Nothing really related to qdiscs.  I suspect it came from the git mainline patch
I had (but forgot to mention in the first email) 

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen

TSO interacts badly with many queueing disciplines because they rely on 
reordering packets from different streams and the large TSO packets can 
make this difficult. This patch disables TSO for sockets that send over 
devices with non standard queueing disciplines. That's anything but noop 
or pfifo_fast and pfifo right now.

Longer term other queueing disciplines could be checked if they
are also ok with TSO. If yes they can set the TCQ_F_GSO_OK flag too.

It is still enabled for the standard pfifo_fast because that will never
reorder packets with the same type-of-service. This means 99+% of all users
will still be able to use TSO just fine.

The status is only set up at socket creation so a shifted route
will not reenable TSO on a existing socket. I don't think that's a 
problem though.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 include/net/sch_generic.h |1 +
 net/core/sock.c   |3 +++
 net/sched/sch_generic.c   |5 +++--
 3 files changed, 7 insertions(+), 2 deletions(-)

Index: linux/include/net/sch_generic.h
===
--- linux.orig/include/net/sch_generic.h
+++ linux/include/net/sch_generic.h
@@ -31,6 +31,7 @@ struct Qdisc
 #define TCQ_F_BUILTIN  1
 #define TCQ_F_THROTTLED2
 #define TCQ_F_INGRESS  4
+#define TCQ_F_GSO_OK   8
int padded;
struct Qdisc_ops*ops;
u32 handle;
Index: linux/net/sched/sch_generic.c
===
--- linux.orig/net/sched/sch_generic.c
+++ linux/net/sched/sch_generic.c
@@ -307,7 +307,7 @@ struct Qdisc_ops noop_qdisc_ops __read_m
 struct Qdisc noop_qdisc = {
.enqueue=   noop_enqueue,
.dequeue=   noop_dequeue,
-   .flags  =   TCQ_F_BUILTIN,
+   .flags  =   TCQ_F_BUILTIN | TCQ_F_GSO_OK,
.ops=   &noop_qdisc_ops,
.list   =   LIST_HEAD_INIT(noop_qdisc.list),
 };
@@ -325,7 +325,7 @@ static struct Qdisc_ops noqueue_qdisc_op
 static struct Qdisc noqueue_qdisc = {
.enqueue=   NULL,
.dequeue=   noop_dequeue,
-   .flags  =   TCQ_F_BUILTIN,
+   .flags  =   TCQ_F_BUILTIN | TCQ_F_GSO_OK,
.ops=   &noqueue_qdisc_ops,
.list   =   LIST_HEAD_INIT(noqueue_qdisc.list),
 };
@@ -538,6 +538,7 @@ void dev_activate(struct net_device *dev
printk(KERN_INFO "%s: activation failed\n", 
dev->name);
return;
}
+   qdisc->flags |= TCQ_F_GSO_OK;
list_add_tail(&qdisc->list, &dev->qdisc_list);
} else {
qdisc =  &noqueue_qdisc;
Index: linux/net/core/sock.c
===
--- linux.orig/net/core/sock.c
+++ linux/net/core/sock.c
@@ -112,6 +112,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1062,6 +1063,8 @@ void sk_setup_caps(struct sock *sk, stru
 {
__sk_dst_set(sk, dst);
sk->sk_route_caps = dst->dev->features;
+   if (!(dst->dev->qdisc->flags & TCQ_F_GSO_OK))
+   sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
if (sk_can_gso(sk)) {
Index: linux/net/sched/sch_fifo.c
===
--- linux.orig/net/sched/sch_fifo.c
+++ linux/net/sched/sch_fifo.c
@@ -62,6 +62,7 @@ static int fifo_init(struct Qdisc *sch, 
 
q->limit = ctl->limit;
}
+   sch->flags |= TCQ_F_GSO_OK;
 
return 0;
 }
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc II

2008-01-31 Thread Andi Kleen
On Thursday 31 January 2008 13:21:00 Andi Kleen wrote:
> 
> I just managed to hang a 2.6.24 (+ some non network patches) kernel 
> with the following (non sensical) command

Correction: the kernel was actually a git linus kernel with David's 
recent merge included.

I found it's pretty easy to hang the kernel with various tbf parameters.

-Andi

> tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
> 
> No oops or anything just hangs. While I understand root can
> do bad things just hanging like this seems a little extreme.
> 
> -Andi
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hard hang through qdisc

2008-01-31 Thread Andi Kleen


I just managed to hang a 2.6.24 (+ some non network patches) kernel 
with the following (non sensical) command

tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100

No oops or anything just hangs. While I understand root can
do bad things just hanging like this seems a little extreme.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Andi Kleen
Bruce Allen <[EMAIL PROTECTED]> writes:
>
> Important note: we ARE able to get full duplex wire speed (over 900
> Mb/s simulaneously in both directions) using UDP.  The problems occur
> only with TCP connections.

Another issue with full duplex TCP not mentioned yet is that if TSO is used 
the output  will be somewhat bursty and might cause problems with the 
TCP ACK clock of the other direction because the ACKs would need 
to squeeze in between full TSO bursts.

You could try disabling TSO with ethtool.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}

2008-01-30 Thread Andi Kleen

> I believe the problem was that all of my ports were used up with
> TIME_WAIT sockets and so it couldn't create more.  My test
> case was similar to this:

Ah that's simple to solve then :- use more IP addresses and bind 
to them in RR in your user program.

Arguably the Linux TCP code should be able to do this by itself
when enough IP addresses are available, but it's not very hard
to do in user space using bind(2)

BTW it's also an very unusual case -- in most cases there are more
remote IP addresses

> So, is there a better way to max out the connections per second without 
> having to use tcp_tw_recycle?

Well did you profile where the bottle necks were?

Perhaps also just increase the memory allowed for TCP sockets.

-Andi


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}

2008-01-30 Thread Andi Kleen
On Wednesday 30 January 2008 20:22, Ben Greear wrote:

> We use these features to enable creating very high numbers of short-lived
> TCP connections, primarily used as a test tool for other network
> devices.

Hopefully these other network devices don't do any NAT then
or don't otherwise violate the IP-matches-PAWS assumption.
Most likely they do actually, so enabling TW recycle
for testing is probably not even safe for you.

Modern systems have a lot of RAM so even without tw recycle
you should be able to get a very high number of connections.
An timewait socket is around 128 bytes on 64bit; this means
with a GB of memory you can already support > 8 Million TW sockets.
On 32bit it's even more.

The optimization was originally written at a time when 64MB systems
were common.

If you don't care about data integrity have you considered just 
using some custom UDP based protocol or run one of the user space
TCP stacks and disable all data integrity features? If you do care about
data integrity then you should probably disable tw recycle anyways.

The deprecation period will be some time (several months) so you'll have 
enough time to migrate to another method

> Perhaps just document the adverse affects and/or have it print out a
> warning on the console whenever the feature is enabled?

"This feature is insecure and does not work on the internet or with NAT" ? 

Somehow this just does not seem right to me. 

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/2] Skip empty hash buckets faster in /proc/net/tcp

2008-01-30 Thread Andi Kleen
On Wed, Jan 30, 2008 at 09:03:16AM -0800, Roland Dreier wrote:
>  > On a 2GB Core2 system here I see a time cat /proc/net/tcp > /dev/null
>  > constently dropping from 0.44s to 0.4-0.8s system time with this change.
> 
> Seems like there must be a typo in either the before or after times
> you report here?

Yes thanks, it was 0.44s down to 0.04s/0.08s here. Somehow the zeroes
got lost.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}

2008-01-30 Thread Andi Kleen

We've recently had a long discussion about the CVE-2005-0356 time stamp 
denial-of-service
attack. It turned out that Linux is only vunerable to this problem when 
tcp_tw_recycle
is enabled (which it is not by default).

In general these two options are not really usable in today's internet because 
they
make the (often false) assumption that a single IP address has a single TCP 
time stamp /
PAWS clock. This assumption breaks both NAT/masquerading and also opens Linux 
to denial
of service attacks (see the CVE description) 

Due to these numerous problems I propose to remove this code for 2.6.26

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

Index: linux/Documentation/feature-removal-schedule.txt
===
--- linux.orig/Documentation/feature-removal-schedule.txt
+++ linux/Documentation/feature-removal-schedule.txt
@@ -354,3 +354,15 @@ Why:   The support code for the old firmwa
and slightly hurts runtime performance. Bugfixes for the old firmware
are not provided by Broadcom anymore.
 Who:   Michael Buesch <[EMAIL PROTECTED]>
+
+---
+
+What:   Support for /proc/sys/net/ipv4/tcp_tw_{reuse,recycle} = 1
+When:   2.6.26
+Why:Enabling either of those makes Linux TCP incompatible with 
masquerading and
+also opens Linux to the CVE-2005-0356 denial of service attack.  And 
these
+optimizations are explicitely disallowed by some benchmarks. They also 
have
+been disabled by default for more than ten years so they're unlikely 
to be used
+   much. Due to these fatal flaws it doesn't make sense to keep the code.
+Who:Andi Kleen <[EMAIL PROTECTED]>
+
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [2/2] Remove some unnecessary gotos in established_get_first()

2008-01-30 Thread Andi Kleen
Oliver Neukum <[EMAIL PROTECTED]> writes:

> Am Mittwoch, 30. Januar 2008 09:01:10 schrieb Andi Kleen:
>> 
>> gcc does not generate different code for return foo vs bar = foo; goto x;
>> x: return bar; So convert it all to direct returns for better readability.
>
> Now suppose somebody needs to change locking. He'll have to convert
> it back. 

Please take a look at the overall /proc/net/tcp logic. Any locking 
change will be a major change to the code flow of the whole family
of funtions.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [1/2] Skip empty hash buckets faster in /proc/net/tcp

2008-01-30 Thread Andi Kleen

On most systems most of the TCP established/time-wait hash buckets are empty.
When walking the hash table for /proc/net/tcp their read locks would
always be aquired just to find out they're empty. This patch changes the code
to check first if the buckets have any entries before taking the lock, which
is much cheaper than taking a lock. Since the hash tables are large
this makes a measurable difference on processing /proc/net/tcp, 
especially on architectures with slow read_lock (e.g. PPC) 

On a 2GB Core2 system here I see a time cat /proc/net/tcp > /dev/null
constently dropping from 0.44s to 0.4-0.8s system time with this change.
This is with mostly empty hash tables.

On systems with slower atomics (like P4 or POWER4) or larger hash tables
(more RAM) the difference is much higher.

This can be noticeable because there are some daemons around who regularly
scan /proc/net/tcp.

Original idea for this patch from Marcus Meissner, but redone by me.

Cc: [EMAIL PROTECTED]
Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 net/ipv4/tcp_ipv4.c |   30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

Index: linux/net/ipv4/tcp_ipv4.c
===
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -2039,6 +2039,12 @@ static void *listening_get_idx(struct se
return rc;
 }
 
+static inline int empty_bucket(struct tcp_iter_state *st)
+{
+   return hlist_empty(&tcp_hashinfo.ehash[st->bucket].chain) &&
+   hlist_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
+}
+
 static void *established_get_first(struct seq_file *seq)
 {
struct tcp_iter_state* st = seq->private;
@@ -2050,6 +2056,10 @@ static void *established_get_first(struc
struct inet_timewait_sock *tw;
rwlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, st->bucket);
 
+   /* Lockless fast path for the common case of empty buckets */
+   if (empty_bucket(st))
+   continue;
+
read_lock_bh(lock);
sk_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
if (sk->sk_family != st->family) {
@@ -2097,13 +2107,15 @@ get_tw:
read_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
st->state = TCP_SEQ_STATE_ESTABLISHED;
 
-   if (++st->bucket < tcp_hashinfo.ehash_size) {
-   read_lock_bh(inet_ehash_lockp(&tcp_hashinfo, 
st->bucket));
-   sk = sk_head(&tcp_hashinfo.ehash[st->bucket].chain);
-   } else {
-   cur = NULL;
-   goto out;
-   }
+   /* Look for next non empty bucket */
+   while (++st->bucket < tcp_hashinfo.ehash_size &&
+   empty_bucket(st))
+   ;
+   if (st->bucket >= tcp_hashinfo.ehash_size)
+   return NULL;
+
+   read_lock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+   sk = sk_head(&tcp_hashinfo.ehash[st->bucket].chain);
} else
sk = sk_next(sk);
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [2/2] Remove some unnecessary gotos in established_get_first()

2008-01-30 Thread Andi Kleen

gcc does not generate different code for return foo vs bar = foo; goto x;
x: return bar; So convert it all to direct returns for better readability.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

Index: linux/net/ipv4/tcp_ipv4.c
===
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -2065,8 +2065,7 @@ static void *established_get_first(struc
if (sk->sk_family != st->family) {
continue;
}
-   rc = sk;
-   goto out;
+   return sk;
}
st->state = TCP_SEQ_STATE_TIME_WAIT;
inet_twsk_for_each(tw, node,
@@ -2074,13 +2073,11 @@ static void *established_get_first(struc
if (tw->tw_family != st->family) {
continue;
}
-   rc = tw;
-   goto out;
+   return tw;
}
read_unlock_bh(lock);
st->state = TCP_SEQ_STATE_ESTABLISHED;
}
-out:
return rc;
 }
 
@@ -2100,10 +2097,8 @@ get_tw:
while (tw && tw->tw_family != st->family) {
tw = tw_next(tw);
}
-   if (tw) {
-   cur = tw;
-   goto out;
-   }
+   if (tw)
+   return tw;
read_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
st->state = TCP_SEQ_STATE_ESTABLISHED;
 
@@ -2121,16 +2116,12 @@ get_tw:
 
sk_for_each_from(sk, node) {
if (sk->sk_family == st->family)
-   goto found;
+   return sk;
}
 
st->state = TCP_SEQ_STATE_TIME_WAIT;
tw = tw_head(&tcp_hashinfo.ehash[st->bucket].twchain);
goto get_tw;
-found:
-   cur = sk;
-out:
-   return cur;
 }
 
 static void *established_get_idx(struct seq_file *seq, loff_t pos)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow OOM in netif_RX function

2008-01-25 Thread Andi Kleen
"Ivan H. Dichev" <[EMAIL PROTECTED]> writes:
>
> What could happen if I put different Lan card in every slot?
> In ex. to-private -> 3com
>   to-inet-> VIA
>   to-dmz -> rtl8139
> And then to look which RX function is consuming the memory.
> (boomerang_rx, rtl8139_rx, ... etc) 

The problem is unlikely to be in the driver (these are both
well tested ones) but more likely your complicated iptables setup somehow
triggers a skb leak.

There are unfortunately no shrink wrapped debug mechanisms in the kernel
for leaks like this (ok you could enable CONFIG_NETFILTER_DEBUG 
and see if it prints something interesting, but that's a long shot).

If you wanted to write a custom debugging patch I would do something like this:

- Add two new integer fields to struct sk_buff: a time stamp and a integer field
- Fill the time stamp with jiffies in alloc_skb and clear the integer field
- In __kfree_skb clear the time stamp
- For all the ipt target modules in net/ipv4/netfilter/*.c you use change their 
->target functions to put an unique value into the integer field you added.
- Do the same for the pkt_to_tuple functions for all conntrack modules

Then when you observe the leak take a crash dump using kdump on the router 
and then use crash to dump all the slab objects for the sk_head_cache.
Then look for any that have an old time stamp and check what value they
have in the integer field. Then the netfilter function who set that unique 
value 
likely triggered the leak somehow.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 1/1]S2io: Multiqueue network device support implementation

2008-01-23 Thread Andi Kleen
> [Ram] I am assuming that this is with regards to msi-x interrupts. We

Yes. 

And avoiding bouncing locks for device state between CPUs.

> have done away with handling tx completion in the interrupt handler, and
> are instead handling them in the context of the transmit. The slow path,
> straggling transmit completions will be handled in the timer context.

Ok -- hopefully you don't have bad corner cases from this when the pipe
is not fully filled and then causing longer latencies on completion. 
Old NAPI sometimes suffered from such problems. 

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 1/1]S2io: Multiqueue network device support implementation

2008-01-23 Thread Andi Kleen
Sreenivasa Honnur <[EMAIL PROTECTED]> writes:

> Multiqueue netwrok device support implementation.
> - Added a loadable parameter "multiq" to enable/disable multiqueue support,
>   by default it is disabled.
> - skb->queue_mapping is not used for queue/fifo selection. FIFO iselection is
>   based on IP-TOS value, 0x0-0xF TOS values are mapped to 8 FIFOs.

Standard way to use that would be using skb->priority

But I'm surprised you bother with TOS for multi queues at all. TOS
isn't a too important mechanism.

I would have thought the primary use case would be per CPU TX completion
interrupts. With that the queue should be selected based on
the the current CPU.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24 regression: reference count leak in PPPoE

2008-01-20 Thread Andi Kleen

> It seems to have stopped when i stopped using ipsec and started using
> vpnc.

Kernel ipsec was active yes. However I normally don't see it although
it is often active, that was the first time I think

(except long ago) 

> (but no firm info - this was sporadic - happened every few months 
> or so) Are you using ipsec and dynamic IPs by any chance?

and the ISP uses dynamic IPs.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.24 regression: reference count leak in PPPoE

2008-01-20 Thread Andi Kleen

My workstation running 2.6.24-rc8 just hung during shutdown with an endless 
(or rather I didn't wait more than a few minutes) loop of 

unregister_netdev: waiting for ppp-device to become free. Usage count = 1

ppp-device was an active PPPoE device.

No more information currently.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] [TCP]: Uninline tcp_set_state

2008-01-17 Thread Andi Kleen
"Ilpo Järvinen" <[EMAIL PROTECTED]> writes:
>
> Besides, it not always that obvious that gcc is able to determine "the
> constant state", considering e.g., the complexity in the cases with
> tcp_rcv_synsent_state_process, tcp_close_state, tcp_done. In such cases
> uninlining should be done and gcc is probably not able to mix both cases
> nicely for a single function?

I think it would be cleanest to completely unswitch the function 
and split into tcp_set_closed()  / tcp_set_established() / tcp_other_state() 
called by the callers directly.

That would probably lose the state trace, but I never found 
that useful for anything.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] synchronize_rcu(): high latency on idle system

2008-01-13 Thread Andi Kleen

> I think it should be in netdev_unregister_kobject().  But that would
> only get rid of one of the two calls to synchronize_rcu() in the 
> unregister_netdev.

Would be already an improvement.

> The other synchronize_rcu() is for qdisc's and not sure if that one can
> be removed?

The standard way to remove such calls is to set a "deleted" flag in the object,
then check and ignore such objects in the reader and finally remove the object 
with
call_rcu

I have not checked if that is really feasible for qdiscs.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001/001] ipv4: enable use of 240/4 address space

2008-01-11 Thread Andi Kleen
Vince Fuller <[EMAIL PROTECTED]> writes:

> from Vince Fuller <[EMAIL PROTECTED]>
>
> This set of diffs modify the 2.6.20 kernel to enable use of the 240/4
> (aka "class-E") address space as consistent with the Internet Draft
> draft-fuller-240space-00.txt.

Wouldn't it be wise to at least wait for it becoming an RFC first? 

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-09 Thread Andi Kleen
> Postponing freeing of the skb has major drawbacks. Some time ago I

Yes, the trick would be to make sure that it also does not tie up
too much memory. e.g. it would need some throttling at least.

Also the fast path of kmem_cache_free() is actually not that
much different from just putting something on a list so perhaps
it would not make that much difference.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread Andi Kleen
> It adds severe spikes in CPU utilization that are even moderate
> line rates begins to affect RTTs.
> 
> Or do you think it's OK to process 500,000 SKBs while locked
> in a software interrupt.

You can always push it into a work queue.  Even put it to
other cores if you want. 

In fact this is already done partly for the ->completion_queue.
Wouldn't be a big change to queue it another level down.

Also even freeing a lot of objects doesn't have to be
that expensive. I suspect the most cost is in taking
the slab locks, but that could be batched. Without
that the kmem_free fast path isn't particularly
expensive, as long as the headers are still in cache.

Long ago I had sk_buff optimized for the routing case so that freeing
can be all done with a single cache line. That is 
long gone, but could be probably restored.

But asking for protocol changes just to work around such a
internal implementation detail is ...  

> Perhaps you have another broken awk script to prove this :-)

Your hand waved numbers on inline sizes there were definitely worse 
than mine. 

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread Andi Kleen
David Miller <[EMAIL PROTECTED]> writes:
>
> The big problem is that recovery from even a single packet loss in a
> window makes us run kfree_skb() for a all the packets in a full
> window's worth of data when recovery completes.

Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Andi Kleen
Linus Torvalds <[EMAIL PROTECTED]> writes:
>
> I usually just compile a small program like

Just use scripts/decodecode and cat the Code line into that.

> particularly good way to do it, and the old "ksymoops" program used to do 
> a pretty good job of this, but I'm used to that particular idiotic way 
> myself, since it's how I've basically always done it)
>
> After that, you still need to try to match up the assembly code with the 
> source code and figure out what variables the register contents actually 
> are all about. You can often try to do a
>
>   make the/affected/file.s


IMHO better is  

make the/file/xyz.lst

which gives you a listing with binary data in there which can be
grepped for.

But you should install a very recent binutils because older objdump -S
couldn't deal with unit-at-a-time compilers.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-07 Thread Andi Kleen
On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Tue, 8 Jan 2008 03:05:29 +0100
> 
> > On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote:
> > > I explicitly left them out.
> > > 
> > > Most of them are abstractions of common 2 or 3 instruction
> > > calculations, and thus should stay inline.
> > 
> > Definitely not in tcp.h. It has quite a lot of very long functions, of
> > which very few really need to be inline: (AFAIK the only one where 
> > it makes really sense is tcp_set_state due to constant evaluation; 
> > although I never quite understood why the callers just didn't 
> > call explicit functions to do these actions) 
> > 
> > % awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += 
> > line-start-2; } ; END { print r/n }' < include/net/tcp.h 
> > 9.48889
> > 
> > The average function length is 9 lines.
> 
> The vast majority of them are one, two, and three liners.

% awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { len=line-start-3; 
if (len > 4) l++; if (len >= 10) k++; } ; END { print total, l, l/total, k, 
k/total }' < include/net/tcp.h
68 28 0.411765 20 0.294118

41% are over 4 lines, 29% are >= 10 lines.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] [XFRM]: Kill some bloat II

2008-01-07 Thread Andi Kleen
> % awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; 
> } ; END { print r/n }' < include/net/tcp.h 
> 9.48889
> 
> The average function length is 9 lines.

Actually 8 -- the awk hack had a off by one. Still too long.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-07 Thread Andi Kleen
On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Tue, 08 Jan 2008 01:23:11 +0100
> 
> > David Miller <[EMAIL PROTECTED]> writes:
> > 
> > > Similarly I question just about any inline usage at all in *.c files
> > 
> > Don't forget the .h files. Especially a lot of stuff in tcp.h should
> > be probably in some .c file and not be inline.
> 
> I explicitly left them out.
> 
> Most of them are abstractions of common 2 or 3 instruction
> calculations, and thus should stay inline.

Definitely not in tcp.h. It has quite a lot of very long functions, of
which very few really need to be inline: (AFAIK the only one where 
it makes really sense is tcp_set_state due to constant evaluation; 
although I never quite understood why the callers just didn't 
call explicit functions to do these actions) 

% awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; } 
; END { print r/n }' < include/net/tcp.h 
9.48889

The average function length is 9 lines.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   >