Re: [PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h

2017-03-31 Thread Eric Biggers
On Fri, Mar 31, 2017 at 11:27:03AM +0200, Ondrej Mosnacek wrote:
> The gf128mul_x_ble function is currently defined in gf128mul.c, because
> it depends on the gf128mul_table_be multiplication table.
> 
> However, since the function is very small and only uses two values from
> the table, it is better for it to be defined as inline function in
> gf128mul.h. That way, the function can be inlined by the compiler for
> better performance.
> 
> For consistency, the other gf128mul_x_* functions are also moved to the
> header file. In addition, the code is rewritten to be constant-time.
> 
> After this change, the speed of the generic 'xts(aes)' implementation
> increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup
> benchmark -c aes-xts-plain64' on an Intel system with CRYPTO_AES_X86_64
> and CRYPTO_AES_NI_INTEL disabled).
> 
> Signed-off-by: Ondrej Mosnacek 
> Cc: Eric Biggers 

Reviewed-by: Eric Biggers 

Also, I realized that for gf128mul_x_lle() now that we aren't using the table we
don't need to shift '_tt' but rather can use the constant 0xe100:

/* equivalent to (u64)gf128mul_table_le[(b << 7) & 0xff] << 48
 * (see crypto/gf128mul.c): */
u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe100;

r->b = cpu_to_be64((b >> 1) | (a << 63));
r->a = cpu_to_be64((a >> 1) ^ _tt);

I think that would be better and you could send a v4 to do it that way if you
want.  It's not a huge deal though.

Thanks!

- Eric


Re: [PATCH] OF: mark released devices as no longer populated

2017-03-31 Thread Rob Herring
On Fri, Mar 31, 2017 at 10:23 AM, Horia Geantă  wrote:
> On 3/31/2017 1:40 PM, Russell King - ARM Linux wrote:
>> Ping, this issue still exists with 4.11-rc4 - and there's been no
>> reaction from the alleged CAAM maintainers.
>>
> Sorry, this somehow slipped through (Cc vs. To, no linux-crypto).
>
>> On Tue, Aug 09, 2016 at 11:48:38AM -0500, Rob Herring wrote:
>>> On Tue, Aug 9, 2016 at 4:33 AM, Russell King  
>>> wrote:
 When a Linux device is released and cleaned up, we left the OF device
 node marked as populated.  This causes the Freescale CAAM driver
 (drivers/crypto/caam) problems when the module is removed and re-
 inserted:

 JR0 Platform device creation error
 JR0 Platform device creation error
 caam 210.caam: no queues configured, terminating
 caam: probe of 210.caam failed with error -12

 The reason is that CAAM creates platform devices for each job ring:

 for_each_available_child_of_node(nprop, np)
 if (of_device_is_compatible(np, "fsl,sec-v4.0-job-ring") ||
 of_device_is_compatible(np, "fsl,sec4.0-job-ring")) {
 ctrlpriv->jrpdev[ring] =
 of_platform_device_create(np, NULL, dev);

 which sets OF_POPULATED on the device node, but then it cleans these
 up:

 /* Remove platform devices for JobRs */
 for (ring = 0; ring < ctrlpriv->total_jobrs; ring++) {
 if (ctrlpriv->jrpdev[ring])
 of_device_unregister(ctrlpriv->jrpdev[ring]);
>>>
>>> This looks a bit asymmetrical to me with a of_platform_device_* call
>>> and a of_device_* call.
>>>
>>> I think you could use of_platform_{de}populate here instead. That
>>> would simplify things in the driver a bit too as you wouldn't need to
>>> store jrpdev. It wouldn't work if there are other child nodes with
> Indeed, this would clean-up the driver a bit. However, the driver needs
> to know how many of the devices probed successfully - to print the
> number and more importantly to exit in case total_jobrs = 0.

The only thing you are guaranteed is the OF code created some platform
devices. That's it. Whether any driver probed successfully is separate
and a lot more things can go wrong there. The only thing you are
checking is that your dtb is not crap.

> Thus, I would keep the one-by-one probing of the devices.
> What options are there in this case?
> Should a function symmetric to of_platform_device_create() be added - to
> replace of_device_unregister() - or rely on an open-coded solution?

Certainly not the latter. We don't want drivers mucking with flags
internal to the DT code.

Rob


Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-31 Thread Michael S. Tsirkin
On Fri, Mar 31, 2017 at 12:04:55PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月30日 22:32, Michael S. Tsirkin wrote:
> > On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote:
> > > 
> > > On 2017年03月30日 04:48, Michael S. Tsirkin wrote:
> > > > We are going to add more parameters to find_vqs, let's wrap the call so
> > > > we don't need to tweak all drivers every time.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin
> > > > ---
> > > A quick glance and it looks ok, but what the benefit of this series, is it
> > > required by other changes?
> > > 
> > > Thanks
> > Yes - to avoid touching all devices when doing the rest of
> > the patchset.
> 
> Maybe I'm not clear. I mean the benefit of this series not this single
> patch. I guess it may be used by you proposal that avoid reset when set XDP?

In particular, yes. It generally simplifies things significantly if
we can get the true buffer size back.

> If yes, do we really want to drop some packets after XDP is set?
> 
> Thanks

We would rather not drop packets. We could detect and copy them to make
XDP work.

-- 
MST


Re: [PATCH] OF: mark released devices as no longer populated

2017-03-31 Thread Horia Geantă
On 3/31/2017 1:40 PM, Russell King - ARM Linux wrote:
> Ping, this issue still exists with 4.11-rc4 - and there's been no
> reaction from the alleged CAAM maintainers.
> 
Sorry, this somehow slipped through (Cc vs. To, no linux-crypto).

> On Tue, Aug 09, 2016 at 11:48:38AM -0500, Rob Herring wrote:
>> On Tue, Aug 9, 2016 at 4:33 AM, Russell King  
>> wrote:
>>> When a Linux device is released and cleaned up, we left the OF device
>>> node marked as populated.  This causes the Freescale CAAM driver
>>> (drivers/crypto/caam) problems when the module is removed and re-
>>> inserted:
>>>
>>> JR0 Platform device creation error
>>> JR0 Platform device creation error
>>> caam 210.caam: no queues configured, terminating
>>> caam: probe of 210.caam failed with error -12
>>>
>>> The reason is that CAAM creates platform devices for each job ring:
>>>
>>> for_each_available_child_of_node(nprop, np)
>>> if (of_device_is_compatible(np, "fsl,sec-v4.0-job-ring") ||
>>> of_device_is_compatible(np, "fsl,sec4.0-job-ring")) {
>>> ctrlpriv->jrpdev[ring] =
>>> of_platform_device_create(np, NULL, dev);
>>>
>>> which sets OF_POPULATED on the device node, but then it cleans these
>>> up:
>>>
>>> /* Remove platform devices for JobRs */
>>> for (ring = 0; ring < ctrlpriv->total_jobrs; ring++) {
>>> if (ctrlpriv->jrpdev[ring])
>>> of_device_unregister(ctrlpriv->jrpdev[ring]);
>>
>> This looks a bit asymmetrical to me with a of_platform_device_* call
>> and a of_device_* call.
>>
>> I think you could use of_platform_{de}populate here instead. That
>> would simplify things in the driver a bit too as you wouldn't need to
>> store jrpdev. It wouldn't work if there are other child nodes with
Indeed, this would clean-up the driver a bit. However, the driver needs
to know how many of the devices probed successfully - to print the
number and more importantly to exit in case total_jobrs = 0.

Thus, I would keep the one-by-one probing of the devices.
What options are there in this case?
Should a function symmetric to of_platform_device_create() be added - to
replace of_device_unregister() - or rely on an open-coded solution?

Thanks,
Horia

>> compatible strings which you don't want devices created.
>>
>>> }
>>>
>>> which leaves OF_POPULATED set.
>>>
>>> Arrange for platform devices with a device node to clear the
>>> OF_POPULATED bit when they are released.
>>>
>>> Signed-off-by: Russell King 
>>> ---
>>> Please check this carefully - it may have issues where an of_node
>>> pointer is copied from one platform device to another, but IMHO
>>> doing that is itself buggy behaviour.
>>
>> Agreed, that is wrong.
>>
>>>
>>> Resending due to wrong list address, sorry.
>>>
>>>  include/linux/of_device.h | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/include/linux/of_device.h b/include/linux/of_device.h
>>> index cc7dd687a89d..7a8362d0c6d2 100644
>>> --- a/include/linux/of_device.h
>>> +++ b/include/linux/of_device.h
>>> @@ -43,6 +43,7 @@ extern int of_device_uevent_modalias(struct device *dev, 
>>> struct kobj_uevent_env
>>>
>>>  static inline void of_device_node_put(struct device *dev)
>>>  {
>>> +   of_node_clear_flag(dev->of_node, OF_POPULATED);
>>
>> This would result in clearing the flag twice in the
>> of_platform_populate/of_platform_depopulate case. It would do the same
>> for other bus types like i2c as well. That doesn't really hurt
>> anything that I can think of, but just not the best implementation. I
>> think adding a of_platform_device_unregister() call that wraps
>> of_platform_device_destroy would be more balanced.
>>
>> I looked thru all the callers of of_platform_device_create. The only
>> other ones affected by this are:
>>
>> drivers/macintosh/ams/ams-core.c
>> drivers/macintosh/therm_adt746x.c
>> drivers/macintosh/therm_windtunnel.c
>>
>> The others either have no remove path or a buggy remove path.
>>
>> Rob
> 


Re: [PATCH v5 1/2] crypto: skcipher AF_ALG - overhaul memory management

2017-03-31 Thread Herbert Xu
On Thu, Mar 16, 2017 at 11:18:33AM +0100, Stephan Müller wrote:
> Am Donnerstag, 16. März 2017, 10:52:48 CET schrieb Herbert Xu:
> 
> Hi Herbert,
> 
> > First of all you're only limiting the amount of memory occupied
> > by the SG list which is not the same thing as the memory pinned
> > down by the actual recvmsg.
> 
> I am fully aware of that. As this was present in the code, I thought I could 
> reuse that approach.
> 
> Are you saying that you want to stop this approach? 

No you're confusing things.  Previously there was an explicit
limit on the number of pages that can be pinned.  Now you're
only indirectly limiting it by limiting the size of the metadata
through sock_kmalloc.

The end result is that you're now allowing a huge amount of user
memory to be pinned down by the system call.  This is *unacceptable*.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Crypto Fixes for 4.11

2017-03-31 Thread Herbert Xu
Hi Linus:

This push fixes the following issues:

- Memory corruption when kmalloc fails in xts/lrw.
- Mark some CCP DMA channels as private.
- Fix reordering race in padata.
- Regression in omap-rng DT description.


Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus


Eric Biggers (1):
  crypto: xts,lrw - fix out-of-bounds write after kmalloc failure

Gary R Hook (1):
  crypto: ccp - Make some CCP DMA channels private

Jason A. Donenfeld (1):
  padata: avoid race in reordering

Thomas Petazzoni (1):
  dt-bindings: rng: clocks property on omap_rng not always mandatory

 Documentation/devicetree/bindings/rng/omap_rng.txt |3 +-
 crypto/lrw.c   |7 +++-
 crypto/xts.c   |7 +++-
 drivers/crypto/ccp/ccp-dev-v5.c|1 +
 drivers/crypto/ccp/ccp-dev.h   |5 +++
 drivers/crypto/ccp/ccp-dmaengine.c |   41 
 kernel/padata.c|5 ++-
 7 files changed, 62 insertions(+), 7 deletions(-)

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h

2017-03-31 Thread Ondrej Mosnacek
The gf128mul_x_ble function is currently defined in gf128mul.c, because
it depends on the gf128mul_table_be multiplication table.

However, since the function is very small and only uses two values from
the table, it is better for it to be defined as inline function in
gf128mul.h. That way, the function can be inlined by the compiler for
better performance.

For consistency, the other gf128mul_x_* functions are also moved to the
header file. In addition, the code is rewritten to be constant-time.

After this change, the speed of the generic 'xts(aes)' implementation
increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup
benchmark -c aes-xts-plain64' on an Intel system with CRYPTO_AES_X86_64
and CRYPTO_AES_NI_INTEL disabled).

Signed-off-by: Ondrej Mosnacek 
Cc: Eric Biggers 
---
v2 -> v3: constant-time implementation
v1 -> v2: move all _x_ functions to the header, not just gf128mul_x_ble

 crypto/gf128mul.c | 33 +---
 include/crypto/gf128mul.h | 55 +--
 2 files changed, 54 insertions(+), 34 deletions(-)

diff --git a/crypto/gf128mul.c b/crypto/gf128mul.c
index 04facc0..dc01212 100644
--- a/crypto/gf128mul.c
+++ b/crypto/gf128mul.c
@@ -130,43 +130,12 @@ static const u16 gf128mul_table_le[256] = 
gf128mul_dat(xda_le);
 static const u16 gf128mul_table_be[256] = gf128mul_dat(xda_be);
 
 /*
- * The following functions multiply a field element by x or by x^8 in
+ * The following functions multiply a field element by x^8 in
  * the polynomial field representation.  They use 64-bit word operations
  * to gain speed but compensate for machine endianness and hence work
  * correctly on both styles of machine.
  */
 
-static void gf128mul_x_lle(be128 *r, const be128 *x)
-{
-   u64 a = be64_to_cpu(x->a);
-   u64 b = be64_to_cpu(x->b);
-   u64 _tt = gf128mul_table_le[(b << 7) & 0xff];
-
-   r->b = cpu_to_be64((b >> 1) | (a << 63));
-   r->a = cpu_to_be64((a >> 1) ^ (_tt << 48));
-}
-
-static void gf128mul_x_bbe(be128 *r, const be128 *x)
-{
-   u64 a = be64_to_cpu(x->a);
-   u64 b = be64_to_cpu(x->b);
-   u64 _tt = gf128mul_table_be[a >> 63];
-
-   r->a = cpu_to_be64((a << 1) | (b >> 63));
-   r->b = cpu_to_be64((b << 1) ^ _tt);
-}
-
-void gf128mul_x_ble(be128 *r, const be128 *x)
-{
-   u64 a = le64_to_cpu(x->a);
-   u64 b = le64_to_cpu(x->b);
-   u64 _tt = gf128mul_table_be[b >> 63];
-
-   r->a = cpu_to_le64((a << 1) ^ _tt);
-   r->b = cpu_to_le64((b << 1) | (a >> 63));
-}
-EXPORT_SYMBOL(gf128mul_x_ble);
-
 static void gf128mul_x8_lle(be128 *x)
 {
u64 a = be64_to_cpu(x->a);
diff --git a/include/crypto/gf128mul.h b/include/crypto/gf128mul.h
index 0bc9b5f..6e43be5 100644
--- a/include/crypto/gf128mul.h
+++ b/include/crypto/gf128mul.h
@@ -49,6 +49,7 @@
 #ifndef _CRYPTO_GF128MUL_H
 #define _CRYPTO_GF128MUL_H
 
+#include 
 #include 
 #include 
 
@@ -163,8 +164,58 @@ void gf128mul_lle(be128 *a, const be128 *b);
 
 void gf128mul_bbe(be128 *a, const be128 *b);
 
-/* multiply by x in ble format, needed by XTS */
-void gf128mul_x_ble(be128 *a, const be128 *b);
+/*
+ * The following functions multiply a field element by x in
+ * the polynomial field representation.  They use 64-bit word operations
+ * to gain speed but compensate for machine endianness and hence work
+ * correctly on both styles of machine.
+ *
+ * They are defined here for performance.
+ */
+
+static inline u64 gf128mul_mask_from_bit(u64 x, int which)
+{
+   /* a constant-time version of 'x & ((u64)1 << which) ? (u64)-1 : 0' */
+   return ((s64)(x << (63 - which)) >> 63);
+}
+
+static inline void gf128mul_x_lle(be128 *r, const be128 *x)
+{
+   u64 a = be64_to_cpu(x->a);
+   u64 b = be64_to_cpu(x->b);
+
+   /* equivalent to gf128mul_table_le[(b << 7) & 0xff] >> 8
+* (see crypto/gf128mul.c): */
+   u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe1;
+
+   r->b = cpu_to_be64((b >> 1) | (a << 63));
+   r->a = cpu_to_be64((a >> 1) ^ (_tt << 56));
+}
+
+static inline void gf128mul_x_bbe(be128 *r, const be128 *x)
+{
+   u64 a = be64_to_cpu(x->a);
+   u64 b = be64_to_cpu(x->b);
+
+   /* equivalent to gf128mul_table_be[a >> 63] (see crypto/gf128mul.c): */
+   u64 _tt = gf128mul_mask_from_bit(a, 63) & 0x87;
+
+   r->a = cpu_to_be64((a << 1) | (b >> 63));
+   r->b = cpu_to_be64((b << 1) ^ _tt);
+}
+
+/* needed by XTS */
+static inline void gf128mul_x_ble(be128 *r, const be128 *x)
+{
+   u64 a = le64_to_cpu(x->a);
+   u64 b = le64_to_cpu(x->b);
+
+   /* equivalent to gf128mul_table_be[b >> 63] (see crypto/gf128mul.c): */
+   u64 _tt = gf128mul_mask_from_bit(b, 63) & 0x87;
+
+   r->a = cpu_to_le64((a << 1) ^ _tt);
+   r->b = cpu_to_le64((b << 1) | (a >> 63));
+}
 
 /* 4k table optimization */
 
-- 
2.9.3



Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h

2017-03-31 Thread Ondrej Mosnáček
Hi Jeff,

2017-03-31 8:05 GMT+02:00 Jeffrey Walton :
>>> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting
>>> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore 
>>> makes the
>>> new version more efficient than one might expect:
>>>
>>> sar$0x3f,%rax
>>> and$0x87,%eax
>>>
>>> It could even be written the branchless way explicitly, but it shouldn't 
>>> matter.
>>
>> I think the definition using unsigned operations is more intuitive...
>> Let's just leave the clever tricks up to the compiler :)
>
> It may be a good idea to use the one that provides constant time-ness
> to help avoid leaking information.

That's a good point... I played around with various ways to write the
expression in Compiler Explorer [1] and indeed GCC fails to produce
constant-time code from my version on some architectures (e.g. the
32-bit ARM). The version with an explicit arithmetic right shift seems
to produce the most efficient code across platforms, so I'll rewrite
it like that for v3.

Thanks,
O.M.

[1] https://gcc.godbolt.org/


Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h

2017-03-31 Thread Jeffrey Walton
>> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting
>> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore makes 
>> the
>> new version more efficient than one might expect:
>>
>> sar$0x3f,%rax
>> and$0x87,%eax
>>
>> It could even be written the branchless way explicitly, but it shouldn't 
>> matter.
>
> I think the definition using unsigned operations is more intuitive...
> Let's just leave the clever tricks up to the compiler :)

It may be a good idea to use the one that provides constant time-ness
to help avoid leaking information.

Jeff