Re: [PATCH 1/2] pcnet32: only allocate init_block dma consistent

2007-03-07 Thread Michael K. Edwards

On 3/7/07, Ralf Baechle <[EMAIL PROTECTED]> wrote:

GFP_* flags have no influence on caching or prefetching.


The zone modifier flags (like GFP_DMA) can in principle affect the
cache/prefetch policy, since they affect what physical address range
the memory is allocated from.  I don't know whether Linux uses this
effect intentionally, but I would not be the least bit surprised if
virtual machines commonly treat the traditional 16MB of low physical
memory differently from the rest of the address space.

Looking at the i386 implementation of dma_alloc_coherent (underlying
pci_alloc_consistent), it's not clear to me how this interferes with
cacheability of the allocated block.  Maybe it doesn't.  The alpha
version of pci_alloc_consistent, on the other hand, does interesting
things to make the memory visible to PCI.  Don, what arches did you
have in mind when you commented on caching issues?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] pcnet32: only allocate init_block dma consistent

2007-03-07 Thread Michael K. Edwards

On 3/6/07, Ralf Baechle <[EMAIL PROTECTED]> wrote:

Price question: why would this patch make a difference under VMware? :-)


Moving the struct pcnet32_private from the GFP_DMA32 init_block to the
GFP_KERNEL netdev allocation may be a win even on systems where
GFP_DMA32 is normally cached, because the private area will get read
ahead into cache when the netdev is touched.  (This could be a bigger
win if the most often accessed members were moved to the beginning of
the pcnet32_private struct.)

On the other hand, VMWare may engage in some sort of sleight of hand
to keep the low 16MB or more of the VM's memory contiguously allocated
and warm in the real MMU (locked hugepage TLB entries?  I'm
speculating).  So having allocated the private area as part of a
DMA-able page may have silently spared you a page fault on access.

On the third hand, the new layout will rarely be a problem if the
whole netdev (including private area) fits in one page, since if you
were going to take a page fault you took it when you looked into the
netdev.  So it's hard to see how it could cause a performance
regression unless VMWare loses its timeslice (and the TLB entry for
the page containing the netdev) in the middle of pcnet32_rx, etc.

Lennart is of course right that most VMWare VMs are using vmxnet
instead, but they're also using distro kernels.  :-)  I find VMWare
useful for certain kinds of kernel experimentation, and don't want to
fool with vmxnet every time I flip a kernel config switch.  Linus
kernels run just fine on VMWare Workstation using piix, mptspi, and
pcnet32 (I'm running vanilla 2.6.20.1 right now).  I would think that
changes to those drivers should be regression tested under VMWare, and
I'm happy to help.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] pcnet32: only allocate init_block dma consistent

2007-03-06 Thread Michael K. Edwards

On 3/6/07, Ralf Baechle <[EMAIL PROTECTED]> wrote:

This small change btw. delivers about ~ 3% extra performance on a very
slow test system.


Has this change been tested / benchmarked under VMWare?  pcnet32 is
the (default?) virtual device presented by VMWare Workstation, and
that's probably a large fraction of its use in the field these days.
But then Don probably already knows that.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-03-04 Thread Michael K. Edwards

On 3/4/07, David Miller <[EMAIL PROTECTED]> wrote:

From: "Michael K. Edwards" <[EMAIL PROTECTED]>
> Before I implement, I design.  Before I design, I analyze.  Before I
> analyze, I prototype.  Before I prototype, I gather requirements.

How the heck do you ever get to writing ANYTHING if you work that way?


Oh, c'mon, the whole design process takes maybe 3 weeks for something
I'm going to spend 3 months implementing.  And half the time the
client mistakes the prototype for the implementation and tries to run
with it -- with the foreseeable consequences.  Smaller projects are a
lot less formal but the principle still holds: every time I've
implemented without actually doing the homework leading up to a
design, I've had cause to regret it.  That goes double for problems
that already have off-the-shelf solutions and only need improvement in
ease of use and robustness in hostile environments.


I certainly would never have written one single line of Linux kernel
code if I had to go through that kind of sequence to actually get to
writing code.


Lots of _code_ gets written as a side effect of each stage of that
sequence.  But none of that code should be mistaken for the
_implementation_.  Not when the implementation is going to ship inside
hundreds of thousands of widgets that people are going to stick in
their ears, or is going to monitor the integrity of an
intercontinental telecoms network.  Most shipping code, including most
code that I have written and large swathes of the Linux kernel, has
never really gone beyond the prototype stage.  There's no shame in
that; the shame would be in refusing to admit it.


And that's definitely not the "Linux way".  You code up ideas as soon
as you come up with one that has a chance of working, and you see what
happens.  Sure, you'll throw a lot away, but at least you will "know"
instead of "think".


I call that process "prototyping".  I used to do a lot of it; but I
have since found that thinking first is actually more efficient.
There is very little point in prototyping the square wheel again and
again and again.  And especially given that Linux already has plenty
of nice, round wheels, there aren't that many places left where you
can impress the world by replacing a square wheel with a hexagonal
one.  Replacing wheels with maglev tracks requires a real design
phase.


You have to try things, "DO" stuff, not just sit around and theorize
and design things and shoot down ideas on every negative minute detail
you can come up with before you type any code in.  That mode of
development doesn't inspire people and get a lot of code written.


I'll cop to the "negative" part of this, at least to some degree, but
not the rest.  "Naive hashes DDoS easily" is not a minute detail.
"Hash tables don't provide the operations characteristic of priority
queues" is not a minute detail.  "The benchmarks offered do not
accurately model system impact" is not a minute detail.  If I were not
sufficiently familiar with some of Evgeniy's other contributions to
the kernel to think him capable of responding to these critiques with
a better design, I would not have bothered.


I definitely do not think others should use this
design/prototype/analyze/blah/balh way of developing as an example,
instead I think folks should use people like Ingo Molnar as an example
of a good Linux developer.  People like Ingo rewrite the scheduler one
night because of a tiny cool idea, and even if only 1 out of 10 hacks
like that turn out to be useful, his work is invaluable and since he's
actually trying to do things and writing lots of code this inspires
other people.


Ingo Molnar is certainly a good, nay an excellent, Linux developer.
His prototypes are brilliant even when they're under-designed by my
way of thinking.  By all means, readers should go thou and do
likewise.  And when someone presses an alternative design, "show me
the code" is a fair response.

At present, I am not offering code, nor design, nor even a prototype.
But I am starting to figure out which bits of the Linux kernel really
are implementations based on a solid design.  Any prototyping I wind
up doing is going to be bolted firmly onto those implementations, not
floating in the surrounding sea of prototype code; and any analysis I
offer based on that prototype will reflect, to the best of my ability,
an understanding of its weaknesses as well as its strengths.  If that
analysis passes community scrutiny, and if I remain interested in the
project, perhaps I will go through with design and implementation as
well.

This, incidentally, seems very similar to the process that Robert
Olsson and Stefan Nilsson have gone through with their trie/hash
project.  Although I haven't tried it out yet and don't have any basis
for an independe

Re: Extensible hashing and RCU

2007-03-04 Thread Michael K. Edwards

On 3/3/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

Btw, you could try to implement something you have written above to show
its merits, so that it would not be an empty words :)


Before I implement, I design.  Before I design, I analyze.  Before I
analyze, I prototype.  Before I prototype, I gather requirements.
Before I gather requirements, I think -- and the only way I know how
to think about technical matters is to write down my intuitions and
compare them against the sea of published research on the topic.  I'm
only partway through thinking about RCU and DDoS, especially as this
is on the fringe of my professional expertise and the appropriate
literature is not currently at my fingertips.

The only times that I make exceptions to the above sequence are 1,
when someone is paying me well to do so (usually to retrofit some kind
of sanity onto a pile of crap someone else wrote) and 2, when I really
feel like it.  At present neither exception applies here, although I
may yet get so het up about threadlets that I go into a coding binge
(which may or may not produce an RCU splay tree as a side effect).  I
wouldn't hold my breath if I were you, though; it's the first of what
promises to be a string of fine weekends, and if I binge on anything
this spring it's likely to be gardening.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-03-02 Thread Michael K. Edwards

On 3/2/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

Thank you for this report. (Still avoiding cache misses studies, while they
obviously are the limiting factor)


1)  The entire point of going to a tree-like structure would be to
allow the leaves to age out of cache (or even forcibly evict them)
when the structure bloats (generally under DDoS attack), on the theory
that most of them are bogus and won't be referenced again.  It's not
about the speed of the data structure -- it's about managing its
impact on the rest of the system.

2)  The other entire point of going to a tree-like structure is that
they're drastically simpler to RCU than hashes, and more generally
they don't involve individual atomic operations (RCU reaping passes,
resizing, etc.) that cause big latency hiccups and evict a bunch of
other stuff from cache.

3)  The third entire point of going to a tree-like structure is to
have a richer set of efficient operations, since you can give them a
second "priority"-type index and have "pluck-highest-priority-item",
three-sided search, and bulk delete operations.  These aren't that
much harder to RCU than the basic modify-existing-node operation.

Now can we give these idiotic micro-benchmarks a rest until Robert's
implementation is tuned and ready for stress-testing?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-22 Thread Michael K. Edwards

On 22 Feb 2007 18:49:00 -0500, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

The rehash-every-10-minutes detail is theoretically unnecessary,
but does cover the case where a would-be attacker *does* get a chance
to look at a machine, such as by using routing delays to measure the
effectiveness of a collision attempt.


AOL -- and that's one of the reasons why RCUing hashes is a nightmare
in practice, if you care about smooth peristalsis.  Again, the
resizing headache is just the canary in the coal mine.


If you want to test your distribution for randomness, go have a look
at Knuth volume 2 (seminumerical algorithms) and see the discussion of
the Kolmogorov-Smirnov test.  Some lumpiness is *expected* in a truly
random distribution.


Ah, Kolmogorov-Smirnov, the astronomer's friend.  Somewhere on a DAT
in my garage lies a rather nice implementation of K-S tests on
censored data for the Mirella/Mirage Forth image analysis package.  If
you want something industrial-strength, easy to use, and pretty
besides, I recommend SM (http://www.astro.princeton.edu/~rhl/sm/),
which will cost you $500 U.S. for a department-scale license.  SM so
utterly exceeds the capabilities of gnuplot it isn't even funny.  But
then, while you don't always get what you pay for, you rarely fail to
pay for what you get, sooner or later, in one currency or another.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-22 Thread Michael K. Edwards

On 2/22/07, David Miller <[EMAIL PROTECTED]> wrote:

From: "Michael K. Edwards" <[EMAIL PROTECTED]>
Date: Wed, 21 Feb 2007 13:15:51 -0800

> it had better be a 2-left hash with a compact
> overflow pool for the rare case of second collision.

Just like Evgeniy, you should do your research too :-


Fair enough.  :-)


The gain for multiple-hash schemes, such as 2-left, exists only if you
can parallelize the N lookups in hardware.  This is an assumption
built into all the papers you will read on this subject, including the
one you reference by Mitzenmacher.


Er, no.  The principal gain for the 2-left hash is that the only time
you ever look at the right hash is when you have a collision in the
left.  The right hash is therefore substantially less occupied and the
odds of second collision are lower.  If you happen to be able to
prefetch the left and right buckets in parallel, that's gravy.

Roughly speaking, for a single hash, P(0) = (1 - 1/k)^n \approx
e^{-n/k}, where P(0) is the probability that a given bucket has 0
entries, k is the number of buckets, and n is the number of elements.
One entry in the bucket replaces a factor (1-1/k) with a factor n/k,
the second entry replaces another (1-1/k) with (n-1)/2k, and so forth.
It's the good old birthday problem.

Take the simple example of an equal number of items and of one-entry
buckets, and compare the simple and 2-left hashes.  When n/k \approx
1, you have about 37% empty buckets, but when n/k \approx 2 you only
have about 13% empty buckets.  On the other hand, when n/k \approx 1
and the buckets only hold one entry, about 37% of the items overflow;
when n/k \approx 2, about 57% do.  Result: a 2-left hash with exactly
as many 1-item buckets as there are items will only see about 20% of
the items overflow (collide in the left hash and then collide again in
the right hash), versus the 37% that would have overflowed in a single
hash of the same size.

In practical applications you try to make your hash buckets cache-line
sized and fit 2 or 4 items into them so that overflows are farther out
on the tail of the distribution.  The result can be a hash that is
only oversized by a factor of two or so relative to the space occupied
by the filled slots, yet has an amortized cost of less than 1.5 cache
misses and has a fraction of a percent of items that overflow the
right hash.  That fraction of a percent can go into a single overflow
pool (a way oversized hash or an rbtree or something else you've
already coded; it's a rare case) rather than having to chain from each
hash bucket.


If you read any paper on IP route lookup algorithms, prefix
your reading of that paper with something that reads like:

And by the way we are targeting implementations in
hardware that can parallelize and pipeline memory
accesses to fast SRAM.  Do not consider this algorithm
seriously for running on general purpose processors
in software.


I generally agree with this statement.  However, the 2-left hash is a
very useful data structure for lots of things other than IP route
lookups, especially in memory-constrained systems where you don't want
to oversize the hash table by _too_ much but you still want actual
overflow chaining to be quite rare.  It's much simpler than
next-empty-bucket types of schemes, simple enough that I've coded it
for little DSP cores, and almost as good at minimizing overflows.


Probably a few read's of Network Algorithmics would help your
understanding in this area as well.


Hey, I don't think a hash is the right solution in the first place.
If you're on the open Internet you have to assume a DDoS half-open
connection load a thousandfold higher than the number of real traffic
flows.  If you don't have the memory bandwidth and cache architecture
to bend over and take it (and on a commodity processor, you don't, not
unless you're a hell of a lot cleverer about hinting to the cache
controller than I am), you want a data structure that rotates the real
connections up into a frequently traversed set.

Besides, RCUing a hash of any sort sounds like a nightmare, whereas
something like a splay tree is reputed to be very straightforward to
make "persistent" in the functional-programming sense.  I haven't
coded that one myself but surely it's in the current edition of Knuth.
Network Algorithmics is fine as far as it goes, but if you want to
distribute the networking stack across a NUMA box and still not have
it go pear-shaped under sophisticated DDoS, I think you probably have
to think it out for yourself.

In any case, I am not setting myself up as some kind of networking or
data structure guru, just trying to articulate what's wrong with
Evgeniy's analysis and why RCUing a naive hash is a complete waste of
time.


Tries and similar structures do help in the case of our routing
because we're goin

Re: Extensible hashing and RCU

2007-02-21 Thread Michael K. Edwards

Look, Evgeniy.  Eric and I may be morons but davem is not.  He's
telling you, again and again, that DoS attacks do happen, that to
survive them you need for the distribution of tuples within hash
buckets to vary unpredictably from system to system and boot to boot,
and that XOR hash does not accomplish this.  He has told you that the
Jenkins hash works, for reasons discussed exhaustively in the link I
gave you, which you can follow to statistical analysis that is beyond
your expertise or mine.  He has told you that your performance
analysis is fundamentally flawed.  Do you think you might need to drop
back five and punt?

As for the distribution issues:  who cares?  You fundamentally can't
do any better, for random input drawn from a tuple space much larger
than the number of hash buckets, than a Poisson distribution.  And
that's enough to kill a naive hash table design, because chaining is
going to cost you another cache miss for every link, and you couldn't
afford the first cache miss in the first place.  If you absolutely
insist on a hash table (on the theory that you can't keep the hot
connections warm in cache anyway, which isn't necessarily true if you
use a splay tree), it had better be a 2-left hash with a compact
overflow pool for the rare case of second collision.

I recommend Michael Mitzenmacher's paper to you, or rather to
whoever's reading along with the intention of improving the kernel
without reinventing the square wheel.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-20 Thread Michael K. Edwards

On 2/20/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:

Correct.  That's called a "weak hash", and Jenkins is known to be a
thoroughly weak hash.  That's why you never, ever use it without a
salt, and you don't let an attacker inspect the hash output either.


Weak in a cryptographic sense, of course.  Excellent avalanche
behavior, though, which is what you care about in a salted hash.
http://en.wikipedia.org/wiki/Hash_table

I know nothing about data structures and algorithms except what I read
on the Internet.  But you'd be amazed what's on the Internet.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-20 Thread Michael K. Edwards

On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

How I like personal insults - it is always fun to read about myself from
people who never knew me :)


On this occasion, I did not set out to insult you.  I set out to
suggest an explanation for why cooler and grayer heads than mine are
not falling over themselves to merge your designs into the mainline
kernel.  Did I succeed?


I just shown a problem in jenkins hash - it is not how to find a
differnet input for the same output - it is a _law_ which allows to
break a hash. You will add some constant, and that law will be turned
into something different (getting into account what was written, it will
end up with the same law).


Correct.  That's called a "weak hash", and Jenkins is known to be a
thoroughly weak hash.  That's why you never, ever use it without a
salt, and you don't let an attacker inspect the hash output either.


Using jenkins hash is equal to the situation, when part of you hash
chains will be 5 times longer than median square value, with XOR one
there is no such distribution.


Show us the numbers.  Salt properly this time to reduce the artifacts
that come of applying a weak hash to a poor PRNG, and histogram your
results.  If you don't get a Poisson distribution you probably don't
know how to use gnuplot either.  :-)


Added somthing into permutations will not endup in different
distribution, since it is permutations which are broken, not its result
(which can be xored with something).


I can't parse this.  Care to try again?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-20 Thread Michael K. Edwards

All right, Eric, you and me so clevvah, let's embarrass our own selves
designing this thing in public instead of picking on poor Evgeniy.
Which would you rather RCU, a 2-left hash or a splay tree?  2-left
hash gets you excellent occupation fraction before the first real
collision, so you can be utterly stupid about collisions in the second
hash table (just toss all double collisions in an overflow radix
tree).  Splay tree is a lot bigger bite to chew initially, but it gets
you a hot set that's at least sometimes warm in cache, and it's easier
to think about how to RCU it, and it doubles as a priority queue.  You
pick.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-20 Thread Michael K. Edwards

On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

And here are another ones which produce the same hash value.
Of course searching for pair for jhash('jhash is broken')
will require more steps, but it is doable.

That means that if attacker has a full control over one host, it can
create a chain of maximum 4 entries in socket table (if jhash is used).
If it is udp, that means that attacker control addresses too without
syn cookies, which in turn means that below list can be increased to
infinite.


Evgeniy, you're not gettin' it.  Have you ever heard of a Poisson
distribution?  That's what you have to aim for in a (bog-standard)
hash table -- a hash function that scrambles the key space until you
have a Poisson distribution in the hash buckets no matter whan pile of
keys the attacker throws at you.  That's a "perfect" hash function in
the statistician's sense (not a "perfect hash" function in the
compiler designer's sense).  Yes there will be collisions, and they
will start happening much earlier than you will think, if you have
never heard of Poisson distributions before; that's called the
"birthday problem" and it is a chestnut of a mathematical puzzle
generally encountered by bright twelve-year-olds in extra-curricular
math classes.  The only sensible way to deal with them is to use a
better data structure -- like a 2-left hash (Google it) or something
tree-ish.

That is not, however, what you're not getting.  What you're not
getting is the use of a "salt" or "secret" or whatever you want to
call it to turn a weak hash into an impenetrable defense against
chosen-plaintext collision attacks.  You can run your PRNG until you
slag your poor botnet's little CPUs searching for a set of tuples that
collide in the same bucket on your test system-under-attack.  But they
won't collide on mine, and they won't collide on Eric's, and they
won't collide on yours the next time it's rebooted.  Because even a
weak hash (in the cryptographer's sense) is good enough to scramble
the key space radically differently with two different salts.  XOR
doesn't cut it -- the salt will permute the buckets but not toss each
one's contents up into the air to land in a bunch of different
buckets.  But Jenkins does cut it, as discussed in the source code
Eric pointed out to you.

Of course you don't change the salt except when resizing the hash
(which is a dumb thing to do anyway, and a sure sign that a hash table
is not the right data structure).  That would kinda defeat the purpose
of having a hash table in the first place.

If you can't assimilate this and articulate it back to us in your own
words instead of repeating "any hash that doesn't totally, utterly
suck slows my meaningless artificial benchmark by 50%", you may not be
the right person to design and implement a new RCU data structure in a
kernel that thousands of us trust to withstand exposure to the open
Internet to the best of a commodity processor's ability.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-20 Thread Michael K. Edwards

On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

Jenkins _does_ have them, I showed tests half a year ago and in this
thread too. Actually _any_ hash has them it is just a matter of time
to find one.


I think you misunderstood me.  If you are trying to DoS me from
outside with a hash collision attack, you are trying to feed me
packets that fall into the same hash bucket.  The Jenkins hash does
not have to be artifact-free, and does not have to be
cryptographically strong.  It just has to do a passable job of mixing
a random salt into the tuple, so you don't know which string of
packets to feed me in order to fill one (or a few) of my buckets.
XORing salt into a folded tuple doesn't help; it just permutes the
buckets.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-20 Thread Michael K. Edwards

On 2/20/07, David Miller <[EMAIL PROTECTED]> wrote:

Actually someone (I think it was Evgeniy in fact) made such
comparisons and found in his studies that not only does the current
ehash xor hash function distribute about as well as jenkins, it's
significantly cheaper to calculate :-)


However, it's vulnerable to hash collision attacks, which the Jenkins
hash (used as a poor man's HMAC with a boot-time random salt) is not.
But if you don't care about DoS 

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-19 Thread Michael K. Edwards

On 19 Feb 2007 13:04:12 +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:

LRU tends to be hell for caches in MP systems, because it writes to
the cache lines too and makes them exclusive and more expensive.


That's why you let the hardware worry about LRU.  You don't write to
the upper layers of the splay tree when you don't have to.  It's the
mere traversal of the upper layers that keeps them in cache, causing
the cache hierarchy to mimic the data structure hierarchy.

RCU changes the whole game, of course, because you don't write to the
old copy at all; you have to clone the altered node and all its
ancestors and swap out the root node itself under a spinlock.  Except
you don't use a spinlock; you have a ring buffer of root nodes and
atomically increment the writer index.  That atomically incremented
index is the only thing on which there's any write contention.
(Obviously you need a completion flag on the new root node for the
next writer to poll on, so the sequence is atomic-increment ... copy
and alter from leaf to root ... wmb() ... mark new root complete.)

When you share TCP sessions among CPUs, and packets associated with
the same session may hit softirq in any CPU, you are going to eat a
lot of interconnect bandwidth keeping the sessions coherent.  (The
only way out of this is to partition the tuple space by CPU at the NIC
layer with separate per-core, or perhaps per-cache, receive queues; at
which point the NIC is so smart that you might as well put the DDoS
handling there.)  But at least it's cache coherency protocol bandwidth
and not bandwidth to and from DRAM, which has much nastier latencies.
The only reason the data structure matters _at_all_ is that DDoS
attacks threaten to evict the working set of real sessions out of
cache.  That's why you add new sessions at the leaves and don't rotate
them up until they're hit a second time.

Of course the leaf layer can't be RCU, but it doesn't have to be; it's
just a bucket of tuples.  You need an auxiliary structure to hold the
session handshake trackers for the leaf layer, but you assume that
you're always hitting cold cache when diving into this structure and
ration accesses accordingly.  Maybe you even explicitly evict entries
from cache after sending the SYNACK, so they don't crowd other stuff
out; they go to DRAM and get pulled into the new CPU (and rotated up)
if and when the next packet in the session arrives.  (I'm assuming
T/TCP here, so you can't skimp much on session tracker size during the
handshake.)

Every software firewall I've seen yet falls over under DDoS.  If you
want to change that, you're going to need more than the
back-of-the-napkin calculations that show that session lookup
bandwidth exceeds frame throughput for min-size packets.  You're going
to need to strategize around exploiting the cache hierarchy already
present in your commodity processor to implicitly partition real
traffic from the DDoS storm.  It's not a trivial problem, even in the
mathematician's sense (in which all problems are either trivial or
unsolved).

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-18 Thread Michael K. Edwards

On 2/18/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:

...  Much less vulnerable to cache eviction DDoS
than a hash, because the hot connections get rotated up into non-leaf
layers and get traversed enough to keep them in the LRU set.


Let me enlarge on this a bit.  I used to work for a company that built
a custom firewall/VPN ASIC with all sorts of special sauce in it,
mostly focused on dealing with DDoS.  Some really smart guys, some
really good technology, I hope they grab the brass ring someday.  On
the scale they were dealing with, there's only one thing to do about
DDoS: bend over and take it.  Provision enough memory bandwidth to
cold-cache every packet, every session lookup, and every
packet-processing-progress structure.  Massively parallelize, spinlock
on on-chip SRAM, tune for the cold-cache case.  If you can't afford to
do that -- and if you haven't designed your own chip, with separate
cache windows for each of these use cases, you can't, because they're
all retrograde loads for an LRU cache -- then a hash is not the right
answer.  The interaction between resizing and RCU is just the canary
in the coal mine.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-18 Thread Michael K. Edwards

A better data structure for RCU, even with a fixed key space, is
probably a splay tree.  Much less vulnerable to cache eviction DDoS
than a hash, because the hot connections get rotated up into non-leaf
layers and get traversed enough to keep them in the LRU set.  But I am
not a data structures guru; a stratified tree or van Emde-Boas
priority queue might be a better choice.  RCU splay tree has lots of
other potential applications inside the kernel, so it would make a fun
project for someone with time to burn.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html