Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-27 Thread Andy Lutomirski
On Mon, Jun 27, 2016 at 9:51 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> Some arches have virtually mapped kernel stacks, or will soon have.
>
> tcp_md5_hash_header() uses an automatic variable to copy tcp header
> before mangling th->check and calling crypto function, which might
> be problematic on such arches.
>
> David says that using percpu storage is also problematic on non SMP
> builds.
>
> Just use kmalloc() to allocate scratch areas.

Seems reasonable.

I wonder if it's worth switching from ahash to shash, though.  It
would probably be simpler and faster.

--Andy


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-27 Thread Cong Wang
On Mon, Jun 27, 2016 at 9:51 AM, Eric Dumazet  wrote:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> 5c7ed147449c1b7ba029b12e033ad779a631460a..fddc0ab76c1df82cb05dba03271b773e3b2d
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2969,8 +2969,18 @@ static void __tcp_alloc_md5sig_pool(void)
> return;
>
> for_each_possible_cpu(cpu) {
> +   void *scratch = per_cpu(tcp_md5sig_pool, cpu).scratch;
> struct ahash_request *req;
>
> +   if (!scratch) {
> +   scratch = kmalloc_node(sizeof(union tcp_md5sum_block) 
> +
> +  sizeof(struct tcphdr),
> +  GFP_KERNEL,
> +  cpu_to_node(cpu));
> +   if (!scratch)
> +   return;
> +   per_cpu(tcp_md5sig_pool, cpu).scratch = scratch;
> +   }
> if (per_cpu(tcp_md5sig_pool, cpu).md5_req)
> continue;

Not a problem of your patch, but it seems these allocations never
get freed once we start using tcp md5. Maybe we should free them
when the last socket using tcp md5 is gone?


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-27 Thread Eric Dumazet
On Mon, 2016-06-27 at 10:58 -0700, Andy Lutomirski wrote:

> Seems reasonable.
> 
> I wonder if it's worth switching from ahash to shash, though.  It
> would probably be simpler and faster.

Well, I have no opinion on this, I will let a crypto guy doing this
change if he cares ;)

Thanks.




Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-27 Thread Eric Dumazet
On Mon, 2016-06-27 at 11:31 -0700, Cong Wang wrote:

> Not a problem of your patch, but it seems these allocations never
> get freed once we start using tcp md5. Maybe we should free them
> when the last socket using tcp md5 is gone?

If we constantly allocate-deallocate these tiny blocks for occasional
TCP MD5 use, it becomes quite expensive.

With current code, only first TCP MD5 usage trigger an extra setup cost.




Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-27 Thread Herbert Xu
On Mon, Jun 27, 2016 at 10:58:42AM -0700, Andy Lutomirski wrote:
>
> I wonder if it's worth switching from ahash to shash, though.  It
> would probably be simpler and faster.

No shash is not appropriate here because it needs to hash skb
frags which are SG lists.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-28 Thread Andy Lutomirski
On Mon, Jun 27, 2016 at 8:41 PM, Herbert Xu  wrote:
> On Mon, Jun 27, 2016 at 10:58:42AM -0700, Andy Lutomirski wrote:
>>
>> I wonder if it's worth switching from ahash to shash, though.  It
>> would probably be simpler and faster.
>
> No shash is not appropriate here because it needs to hash skb
> frags which are SG lists.
>

Do you mean this code:

for (i = 0; i < shi->nr_frags; ++i) {
const struct skb_frag_struct *f = &shi->frags[i];
unsigned int offset = f->page_offset;
struct page *page = skb_frag_page(f) + (offset >> PAGE_SHIFT);

sg_set_page(&sg, page, skb_frag_size(f),
offset_in_page(offset));
ahash_request_set_crypt(req, &sg, NULL, skb_frag_size(f));
if (crypto_ahash_update(req))
return 1;
}

I'm wondering why support for scatterlists is all-or-nothing.  Why
can't we initialize a hash object and then alternate between passing
it scatterlists and pointers?

I'm guessing that ahash enables async operation and shash is
synchronous only.  If I'm right, I understand why ahash requires a
scatterlist.  What I don't understand is why shash can't also accept a
scatterlist.  It appears that most of the ahash users in the tree
actually want synchronous crypto and are presumably using ahash for
some other reason such as ahash's ability to hash via scatterlist (in
this case, struct page *).

--Andy


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-28 Thread Herbert Xu
On Tue, Jun 28, 2016 at 10:35:31AM -0700, Andy Lutomirski wrote:
>
> Do you mean this code:

Yes.
 
> I'm wondering why support for scatterlists is all-or-nothing.  Why
> can't we initialize a hash object and then alternate between passing
> it scatterlists and pointers?

Because once you have started hashing the hash state is not stored
in a consistent format.  Our software code may maintain one format
while a hardware implementation could do something else altogether.
So you have to stick with one implementation throughout a particular
hashing session.

> I'm guessing that ahash enables async operation and shash is
> synchronous only.  If I'm right, I understand why ahash requires a
> scatterlist.  What I don't understand is why shash can't also accept a
> scatterlist.  It appears that most of the ahash users in the tree
> actually want synchronous crypto and are presumably using ahash for
> some other reason such as ahash's ability to hash via scatterlist (in
> this case, struct page *).

ahash is meant to be the interface everyone uses regardless of
whether they want sync-only or async.  shash should only be used
for small amounts of hashing on virtual addresses.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Andy Lutomirski
On Tue, Jun 28, 2016 at 7:23 PM, Herbert Xu  wrote:
> On Tue, Jun 28, 2016 at 10:35:31AM -0700, Andy Lutomirski wrote:
>>
>> Do you mean this code:
>
> Yes.
>
>> I'm wondering why support for scatterlists is all-or-nothing.  Why
>> can't we initialize a hash object and then alternate between passing
>> it scatterlists and pointers?
>
> Because once you have started hashing the hash state is not stored
> in a consistent format.  Our software code may maintain one format
> while a hardware implementation could do something else altogether.
> So you have to stick with one implementation throughout a particular
> hashing session.
>
>> I'm guessing that ahash enables async operation and shash is
>> synchronous only.  If I'm right, I understand why ahash requires a
>> scatterlist.  What I don't understand is why shash can't also accept a
>> scatterlist.  It appears that most of the ahash users in the tree
>> actually want synchronous crypto and are presumably using ahash for
>> some other reason such as ahash's ability to hash via scatterlist (in
>> this case, struct page *).
>
> ahash is meant to be the interface everyone uses regardless of
> whether they want sync-only or async.  shash should only be used
> for small amounts of hashing on virtual addresses.

I suspect that, if you compare a synchronous implementation that can
use virtual addresses to a DMA based implementation that can't, you'll
find that, for short messages like tcp md5 uses, the synchronous
implementation would win every time.  I'm wondering if shash should
gain the ability to use scatterlists and, if it doesn't already have
it, the ability to use synchronous hardware implementations (like
SHA-NI, for example, although I don't think that's useful for MD5).

--Andy


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Herbert Xu
On Wed, Jun 29, 2016 at 07:59:22AM -0700, Andy Lutomirski wrote:
>> I suspect that, if you compare a synchronous implementation that can
> use virtual addresses to a DMA based implementation that can't, you'll
> find that, for short messages like tcp md5 uses, the synchronous
> implementation would win every time.  I'm wondering if shash should
> gain the ability to use scatterlists and, if it doesn't already have
> it, the ability to use synchronous hardware implementations (like
> SHA-NI, for example, although I don't think that's useful for MD5).

I don't understand, if you add SGs to shash you get ahash.  So
why wouldn't you just use ahash?

AFAICS tcp md5 already uses ahash in sync mode so there is nothing
asynchronous here at all.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Andy Lutomirski
On Wed, Jun 29, 2016 at 8:02 AM, Herbert Xu  wrote:
> On Wed, Jun 29, 2016 at 07:59:22AM -0700, Andy Lutomirski wrote:
>>> I suspect that, if you compare a synchronous implementation that can
>> use virtual addresses to a DMA based implementation that can't, you'll
>> find that, for short messages like tcp md5 uses, the synchronous
>> implementation would win every time.  I'm wondering if shash should
>> gain the ability to use scatterlists and, if it doesn't already have
>> it, the ability to use synchronous hardware implementations (like
>> SHA-NI, for example, although I don't think that's useful for MD5).
>
> I don't understand, if you add SGs to shash you get ahash.  So
> why wouldn't you just use ahash?

Two reasons:

1. Code like tcp md5 would be simpler if it could pass a scatterlist
to hash the skb but use a virtual address for the header.

2. The actual calling sequence and the amount of indirection is much
less for shash, so hashing short buffer is probably *much* faster.

ahash is very featureful, but it's also quite heavyweight and it's
missing the ability to use virtual addresses directly (for good
reason).  shash is simpler and probably much faster on short buffers,
but the only feature it's missing for most callers (the ones that want
synchronous operation) is the ability to use a scatterlist.  Given
that the crypto code already has the ability to walk a scatterlist,
map it, and hash it, it seems like it might be a nice addition to let
shash objects invoke that code path if they want
(crypto_shash_update_sg?).  This would have no overhead for users that
don't call it, and I bet it would both speed up and reduce the amount
of code in users like tcp md5.

--Andy


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Herbert Xu
On Wed, Jun 29, 2016 at 08:26:43AM -0700, Andy Lutomirski wrote:
> 
> Two reasons:
> 
> 1. Code like tcp md5 would be simpler if it could pass a scatterlist
> to hash the skb but use a virtual address for the header.

True.  But I bet we can make it simpler in other ways without
creating special interfaces for it.  Look at how we do IPsec
encryption/hashing, this is what TCP md5 should look like.  But
nobody in their right mind would bother doing this because this
is dead code.

> 2. The actual calling sequence and the amount of indirection is much
> less for shash, so hashing short buffer is probably *much* faster.

Really?

Have you measured the speed difference between the ahash and shash
interfaces? Are you sure that this would matter when compared
against the speed of hashing a single MD5 block?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Andy Lutomirski
On Wed, Jun 29, 2016 at 8:38 AM, Herbert Xu  wrote:
> On Wed, Jun 29, 2016 at 08:26:43AM -0700, Andy Lutomirski wrote:
>>
>> Two reasons:
>>
>> 1. Code like tcp md5 would be simpler if it could pass a scatterlist
>> to hash the skb but use a virtual address for the header.
>
> True.  But I bet we can make it simpler in other ways without
> creating special interfaces for it.  Look at how we do IPsec
> encryption/hashing, this is what TCP md5 should look like.  But
> nobody in their right mind would bother doing this because this
> is dead code.
>
>> 2. The actual calling sequence and the amount of indirection is much
>> less for shash, so hashing short buffer is probably *much* faster.
>
> Really?
>
> Have you measured the speed difference between the ahash and shash
> interfaces? Are you sure that this would matter when compared
> against the speed of hashing a single MD5 block?
>

It's non-negligible.  If I had SHA-NI, it would probably be a bigger
relative difference.

[0.535717] SHA1 ahash: warming up
[0.737471] SHA1 ahash: 50 loops, 386 ns/loop
[0.737937] SHA1 shash: warming up
[0.903131] SHA1 shash: 50 loops, 323 ns/loop
[1.055409] SHA1 shash_digest: 50 loops, 303 ns/loop

[0.574410] SHA1 ahash: warming up
[0.796927] SHA1 ahash: 50 loops, 414 ns/loop
[0.797695] SHA1 shash: warming up
[0.959772] SHA1 shash: 50 loops, 316 ns/loop
[1.100886] SHA1 shash_digest: 50 loops, 280 ns/loop

Here's MD5:

[0.504708] MD5 ahash: warming up
[0.688155] MD5 ahash: 50 loops, 344 ns/loop
[0.688971] MD5 shash: warming up
[0.834953] MD5 shash: 50 loops, 285 ns/loop
[0.982021] MD5 shash_digest: 50 loops, 292 ns/loop

[0.570780] MD5 ahash: warming up
[0.754325] MD5 ahash: 50 loops, 357 ns/loop
[0.754807] MD5 shash: warming up
[0.906861] MD5 shash: 50 loops, 297 ns/loop
[1.059057] MD5 shash_digest: 50 loops, 303 ns/loop

If you throw in sub-one-block requests, it'll be worse, because those
requests don't do real work.

Overall, it looks like there's overhead of something like 50ns for
each ahash invocation vs the shash equivalent.  It's not huge, but
it's there.  (This is cache-hot.  I bet it's considerably worse if
cache-cold, because ahash will require a lot more code cache lines as
well as the extra cache lines involved in the scatterlist and whatever
arch stuff is needed to map back and forth between virtual and
physical addresses.

Here's the code:

static void test_ahash(void)
{
struct crypto_ahash *hash;
struct ahash_request *req;
struct scatterlist sg_in;
char *buf, *out;
int i;
u64 end, start;
unsigned loops = 50;

hash = crypto_alloc_ahash("sha1", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(hash))
return;

req = ahash_request_alloc(hash, GFP_KERNEL);
if (!req)
return;

ahash_request_set_callback(req, 0, NULL, NULL);

buf = kmalloc(64, GFP_KERNEL);
memset(buf, 0, 64);
out = kmalloc(20, GFP_KERNEL);

pr_err("SHA1 ahash: warming up\n");

for (i = 0; i < 1; i++) {
crypto_ahash_init(req);
sg_init_one(&sg_in, buf, 64);
ahash_request_set_crypt(req, &sg_in, NULL, 64);
if (crypto_ahash_update(req))
return;
ahash_request_set_crypt(req, NULL, out, 0);
if (crypto_ahash_final(req))
return;
}

start = ktime_get_ns();
for (i = 0; i < loops; i++) {
crypto_ahash_init(req);
sg_init_one(&sg_in, buf, 64);
ahash_request_set_crypt(req, &sg_in, NULL, 64);
if (crypto_ahash_update(req))
return;
ahash_request_set_crypt(req, NULL, out, 0);
if (crypto_ahash_final(req))
return;
}
end = ktime_get_ns();

pr_err("SHA1 ahash: %u loops, %llu ns/loop\n", loops, (unsigned
long long)((end-start)/loops));
}

static void test_shash(void)
{
struct crypto_shash *hash;
struct shash_desc *desc;
char *buf, *out;
int i;
u64 end, start;
unsigned loops = 50;

hash = crypto_alloc_shash("sha1", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(hash))
return;

desc = kmalloc(crypto_shash_descsize(hash), GFP_KERNEL);
if (!desc)
return;
desc->tfm = hash;
desc->flags = 0;

/*
 * Hmm.  This interface isn't conducive to the use of hardware that
 * needs state beyond what lives in a single memory buffer -- how would
 * the shash implementation know when to free the state?
 */

buf = kmalloc(64, GFP_KERNEL);
memset(buf, 0, 64);
out = kmalloc(20, GFP_KERNEL);

pr_err("SHA1 shash: warming up\n");

for (i = 0; i < 1; i++) {
crypto_shash_init(desc);
if (crypto_shash_update(desc, buf, 64))
return;
if (crypto_shash_final(desc, out))
return;
}

start = ktime_get_ns();
for (i = 0; i < loops; i++) {
/*
 * For fairness, this does init; update;

Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Eric Dumazet
On Wed, 2016-06-29 at 09:41 -0700, Andy Lutomirski wrote:

> Overall, it looks like there's overhead of something like 50ns for
> each ahash invocation vs the shash equivalent.  It's not huge, but
> it's there.  (This is cache-hot.  I bet it's considerably worse if
> cache-cold, because ahash will require a lot more code cache lines as
> well as the extra cache lines involved in the scatterlist and whatever
> arch stuff is needed to map back and forth between virtual and
> physical addresses.

I am kind of mystified seeing someone caring about TCP MD5, other than
just making sure it wont crash the host when it needs to be used ;)

The real useful work would be to use a jump label so that we can avoid
spending cycles for non TCP MD5 sessions, when a host never had to use
any MD5 negotiation.





Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Andy Lutomirski
On Wed, Jun 29, 2016 at 2:44 PM, Eric Dumazet  wrote:
> On Wed, 2016-06-29 at 09:41 -0700, Andy Lutomirski wrote:
>
>> Overall, it looks like there's overhead of something like 50ns for
>> each ahash invocation vs the shash equivalent.  It's not huge, but
>> it's there.  (This is cache-hot.  I bet it's considerably worse if
>> cache-cold, because ahash will require a lot more code cache lines as
>> well as the extra cache lines involved in the scatterlist and whatever
>> arch stuff is needed to map back and forth between virtual and
>> physical addresses.
>
> I am kind of mystified seeing someone caring about TCP MD5, other than
> just making sure it wont crash the host when it needs to be used ;)
>
> The real useful work would be to use a jump label so that we can avoid
> spending cycles for non TCP MD5 sessions, when a host never had to use
> any MD5 negotiation.
>
>
>

I don't care about TCP MD5 performance at all.  Ease of maintenance is
nice, though, and maybe there are other places in the kernel where
performance does matter.

--Andy


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Herbert Xu
On Wed, Jun 29, 2016 at 03:39:37PM -0700, Andy Lutomirski wrote:
>
> I don't care about TCP MD5 performance at all.  Ease of maintenance is
> nice, though, and maybe there are other places in the kernel where
> performance does matter.

TCP MD5 is using ahash because it touches SG lists.  Touching
SG lists is a pretty reliable indicator for hashing a large amount
of data.  In fact TCP MD5 hashes the entire packet content so we're
talking about ~1000 bytes, just like IPsec.

Therefore it is completely pointless to use something like shash
for it as shash is only meant to be used for those hashing less
one block of data or less.

If you're aware of any other user in the kernel that is using
ahash and is hashing a small amount of data in aggregate (not
per update) please let me know.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-07-01 Thread David Miller
From: Eric Dumazet 
Date: Mon, 27 Jun 2016 18:51:53 +0200

> From: Eric Dumazet 
> 
> Some arches have virtually mapped kernel stacks, or will soon have.
> 
> tcp_md5_hash_header() uses an automatic variable to copy tcp header
> before mangling th->check and calling crypto function, which might
> be problematic on such arches.
> 
> David says that using percpu storage is also problematic on non SMP
> builds.
> 
> Just use kmalloc() to allocate scratch areas.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Andy Lutomirski 

Applied, thanks.