Re: [PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h
On Fri, Mar 31, 2017 at 11:27:03AM +0200, Ondrej Mosnacek wrote: > The gf128mul_x_ble function is currently defined in gf128mul.c, because > it depends on the gf128mul_table_be multiplication table. > > However, since the function is very small and only uses two values from > the table, it is better for it to be defined as inline function in > gf128mul.h. That way, the function can be inlined by the compiler for > better performance. > > For consistency, the other gf128mul_x_* functions are also moved to the > header file. In addition, the code is rewritten to be constant-time. > > After this change, the speed of the generic 'xts(aes)' implementation > increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup > benchmark -c aes-xts-plain64' on an Intel system with CRYPTO_AES_X86_64 > and CRYPTO_AES_NI_INTEL disabled). > > Signed-off-by: Ondrej Mosnacek> Cc: Eric Biggers Reviewed-by: Eric Biggers Also, I realized that for gf128mul_x_lle() now that we aren't using the table we don't need to shift '_tt' but rather can use the constant 0xe100: /* equivalent to (u64)gf128mul_table_le[(b << 7) & 0xff] << 48 * (see crypto/gf128mul.c): */ u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe100; r->b = cpu_to_be64((b >> 1) | (a << 63)); r->a = cpu_to_be64((a >> 1) ^ _tt); I think that would be better and you could send a v4 to do it that way if you want. It's not a huge deal though. Thanks! - Eric
Re: [PATCH] OF: mark released devices as no longer populated
On Fri, Mar 31, 2017 at 10:23 AM, Horia Geantăwrote: > On 3/31/2017 1:40 PM, Russell King - ARM Linux wrote: >> Ping, this issue still exists with 4.11-rc4 - and there's been no >> reaction from the alleged CAAM maintainers. >> > Sorry, this somehow slipped through (Cc vs. To, no linux-crypto). > >> On Tue, Aug 09, 2016 at 11:48:38AM -0500, Rob Herring wrote: >>> On Tue, Aug 9, 2016 at 4:33 AM, Russell King >>> wrote: When a Linux device is released and cleaned up, we left the OF device node marked as populated. This causes the Freescale CAAM driver (drivers/crypto/caam) problems when the module is removed and re- inserted: JR0 Platform device creation error JR0 Platform device creation error caam 210.caam: no queues configured, terminating caam: probe of 210.caam failed with error -12 The reason is that CAAM creates platform devices for each job ring: for_each_available_child_of_node(nprop, np) if (of_device_is_compatible(np, "fsl,sec-v4.0-job-ring") || of_device_is_compatible(np, "fsl,sec4.0-job-ring")) { ctrlpriv->jrpdev[ring] = of_platform_device_create(np, NULL, dev); which sets OF_POPULATED on the device node, but then it cleans these up: /* Remove platform devices for JobRs */ for (ring = 0; ring < ctrlpriv->total_jobrs; ring++) { if (ctrlpriv->jrpdev[ring]) of_device_unregister(ctrlpriv->jrpdev[ring]); >>> >>> This looks a bit asymmetrical to me with a of_platform_device_* call >>> and a of_device_* call. >>> >>> I think you could use of_platform_{de}populate here instead. That >>> would simplify things in the driver a bit too as you wouldn't need to >>> store jrpdev. It wouldn't work if there are other child nodes with > Indeed, this would clean-up the driver a bit. However, the driver needs > to know how many of the devices probed successfully - to print the > number and more importantly to exit in case total_jobrs = 0. The only thing you are guaranteed is the OF code created some platform devices. That's it. Whether any driver probed successfully is separate and a lot more things can go wrong there. The only thing you are checking is that your dtb is not crap. > Thus, I would keep the one-by-one probing of the devices. > What options are there in this case? > Should a function symmetric to of_platform_device_create() be added - to > replace of_device_unregister() - or rely on an open-coded solution? Certainly not the latter. We don't want drivers mucking with flags internal to the DT code. Rob
Re: [PATCH 1/6] virtio: wrap find_vqs
On Fri, Mar 31, 2017 at 12:04:55PM +0800, Jason Wang wrote: > > > On 2017年03月30日 22:32, Michael S. Tsirkin wrote: > > On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote: > > > > > > On 2017年03月30日 04:48, Michael S. Tsirkin wrote: > > > > We are going to add more parameters to find_vqs, let's wrap the call so > > > > we don't need to tweak all drivers every time. > > > > > > > > Signed-off-by: Michael S. Tsirkin> > > > --- > > > A quick glance and it looks ok, but what the benefit of this series, is it > > > required by other changes? > > > > > > Thanks > > Yes - to avoid touching all devices when doing the rest of > > the patchset. > > Maybe I'm not clear. I mean the benefit of this series not this single > patch. I guess it may be used by you proposal that avoid reset when set XDP? In particular, yes. It generally simplifies things significantly if we can get the true buffer size back. > If yes, do we really want to drop some packets after XDP is set? > > Thanks We would rather not drop packets. We could detect and copy them to make XDP work. -- MST
Re: [PATCH] OF: mark released devices as no longer populated
On 3/31/2017 1:40 PM, Russell King - ARM Linux wrote: > Ping, this issue still exists with 4.11-rc4 - and there's been no > reaction from the alleged CAAM maintainers. > Sorry, this somehow slipped through (Cc vs. To, no linux-crypto). > On Tue, Aug 09, 2016 at 11:48:38AM -0500, Rob Herring wrote: >> On Tue, Aug 9, 2016 at 4:33 AM, Russell King>> wrote: >>> When a Linux device is released and cleaned up, we left the OF device >>> node marked as populated. This causes the Freescale CAAM driver >>> (drivers/crypto/caam) problems when the module is removed and re- >>> inserted: >>> >>> JR0 Platform device creation error >>> JR0 Platform device creation error >>> caam 210.caam: no queues configured, terminating >>> caam: probe of 210.caam failed with error -12 >>> >>> The reason is that CAAM creates platform devices for each job ring: >>> >>> for_each_available_child_of_node(nprop, np) >>> if (of_device_is_compatible(np, "fsl,sec-v4.0-job-ring") || >>> of_device_is_compatible(np, "fsl,sec4.0-job-ring")) { >>> ctrlpriv->jrpdev[ring] = >>> of_platform_device_create(np, NULL, dev); >>> >>> which sets OF_POPULATED on the device node, but then it cleans these >>> up: >>> >>> /* Remove platform devices for JobRs */ >>> for (ring = 0; ring < ctrlpriv->total_jobrs; ring++) { >>> if (ctrlpriv->jrpdev[ring]) >>> of_device_unregister(ctrlpriv->jrpdev[ring]); >> >> This looks a bit asymmetrical to me with a of_platform_device_* call >> and a of_device_* call. >> >> I think you could use of_platform_{de}populate here instead. That >> would simplify things in the driver a bit too as you wouldn't need to >> store jrpdev. It wouldn't work if there are other child nodes with Indeed, this would clean-up the driver a bit. However, the driver needs to know how many of the devices probed successfully - to print the number and more importantly to exit in case total_jobrs = 0. Thus, I would keep the one-by-one probing of the devices. What options are there in this case? Should a function symmetric to of_platform_device_create() be added - to replace of_device_unregister() - or rely on an open-coded solution? Thanks, Horia >> compatible strings which you don't want devices created. >> >>> } >>> >>> which leaves OF_POPULATED set. >>> >>> Arrange for platform devices with a device node to clear the >>> OF_POPULATED bit when they are released. >>> >>> Signed-off-by: Russell King >>> --- >>> Please check this carefully - it may have issues where an of_node >>> pointer is copied from one platform device to another, but IMHO >>> doing that is itself buggy behaviour. >> >> Agreed, that is wrong. >> >>> >>> Resending due to wrong list address, sorry. >>> >>> include/linux/of_device.h | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/include/linux/of_device.h b/include/linux/of_device.h >>> index cc7dd687a89d..7a8362d0c6d2 100644 >>> --- a/include/linux/of_device.h >>> +++ b/include/linux/of_device.h >>> @@ -43,6 +43,7 @@ extern int of_device_uevent_modalias(struct device *dev, >>> struct kobj_uevent_env >>> >>> static inline void of_device_node_put(struct device *dev) >>> { >>> + of_node_clear_flag(dev->of_node, OF_POPULATED); >> >> This would result in clearing the flag twice in the >> of_platform_populate/of_platform_depopulate case. It would do the same >> for other bus types like i2c as well. That doesn't really hurt >> anything that I can think of, but just not the best implementation. I >> think adding a of_platform_device_unregister() call that wraps >> of_platform_device_destroy would be more balanced. >> >> I looked thru all the callers of of_platform_device_create. The only >> other ones affected by this are: >> >> drivers/macintosh/ams/ams-core.c >> drivers/macintosh/therm_adt746x.c >> drivers/macintosh/therm_windtunnel.c >> >> The others either have no remove path or a buggy remove path. >> >> Rob >
Re: [PATCH v5 1/2] crypto: skcipher AF_ALG - overhaul memory management
On Thu, Mar 16, 2017 at 11:18:33AM +0100, Stephan Müller wrote: > Am Donnerstag, 16. März 2017, 10:52:48 CET schrieb Herbert Xu: > > Hi Herbert, > > > First of all you're only limiting the amount of memory occupied > > by the SG list which is not the same thing as the memory pinned > > down by the actual recvmsg. > > I am fully aware of that. As this was present in the code, I thought I could > reuse that approach. > > Are you saying that you want to stop this approach? No you're confusing things. Previously there was an explicit limit on the number of pages that can be pinned. Now you're only indirectly limiting it by limiting the size of the metadata through sock_kmalloc. The end result is that you're now allowing a huge amount of user memory to be pinned down by the system call. This is *unacceptable*. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Crypto Fixes for 4.11
Hi Linus: This push fixes the following issues: - Memory corruption when kmalloc fails in xts/lrw. - Mark some CCP DMA channels as private. - Fix reordering race in padata. - Regression in omap-rng DT description. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Eric Biggers (1): crypto: xts,lrw - fix out-of-bounds write after kmalloc failure Gary R Hook (1): crypto: ccp - Make some CCP DMA channels private Jason A. Donenfeld (1): padata: avoid race in reordering Thomas Petazzoni (1): dt-bindings: rng: clocks property on omap_rng not always mandatory Documentation/devicetree/bindings/rng/omap_rng.txt |3 +- crypto/lrw.c |7 +++- crypto/xts.c |7 +++- drivers/crypto/ccp/ccp-dev-v5.c|1 + drivers/crypto/ccp/ccp-dev.h |5 +++ drivers/crypto/ccp/ccp-dmaengine.c | 41 kernel/padata.c|5 ++- 7 files changed, 62 insertions(+), 7 deletions(-) Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h
The gf128mul_x_ble function is currently defined in gf128mul.c, because it depends on the gf128mul_table_be multiplication table. However, since the function is very small and only uses two values from the table, it is better for it to be defined as inline function in gf128mul.h. That way, the function can be inlined by the compiler for better performance. For consistency, the other gf128mul_x_* functions are also moved to the header file. In addition, the code is rewritten to be constant-time. After this change, the speed of the generic 'xts(aes)' implementation increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup benchmark -c aes-xts-plain64' on an Intel system with CRYPTO_AES_X86_64 and CRYPTO_AES_NI_INTEL disabled). Signed-off-by: Ondrej MosnacekCc: Eric Biggers --- v2 -> v3: constant-time implementation v1 -> v2: move all _x_ functions to the header, not just gf128mul_x_ble crypto/gf128mul.c | 33 +--- include/crypto/gf128mul.h | 55 +-- 2 files changed, 54 insertions(+), 34 deletions(-) diff --git a/crypto/gf128mul.c b/crypto/gf128mul.c index 04facc0..dc01212 100644 --- a/crypto/gf128mul.c +++ b/crypto/gf128mul.c @@ -130,43 +130,12 @@ static const u16 gf128mul_table_le[256] = gf128mul_dat(xda_le); static const u16 gf128mul_table_be[256] = gf128mul_dat(xda_be); /* - * The following functions multiply a field element by x or by x^8 in + * The following functions multiply a field element by x^8 in * the polynomial field representation. They use 64-bit word operations * to gain speed but compensate for machine endianness and hence work * correctly on both styles of machine. */ -static void gf128mul_x_lle(be128 *r, const be128 *x) -{ - u64 a = be64_to_cpu(x->a); - u64 b = be64_to_cpu(x->b); - u64 _tt = gf128mul_table_le[(b << 7) & 0xff]; - - r->b = cpu_to_be64((b >> 1) | (a << 63)); - r->a = cpu_to_be64((a >> 1) ^ (_tt << 48)); -} - -static void gf128mul_x_bbe(be128 *r, const be128 *x) -{ - u64 a = be64_to_cpu(x->a); - u64 b = be64_to_cpu(x->b); - u64 _tt = gf128mul_table_be[a >> 63]; - - r->a = cpu_to_be64((a << 1) | (b >> 63)); - r->b = cpu_to_be64((b << 1) ^ _tt); -} - -void gf128mul_x_ble(be128 *r, const be128 *x) -{ - u64 a = le64_to_cpu(x->a); - u64 b = le64_to_cpu(x->b); - u64 _tt = gf128mul_table_be[b >> 63]; - - r->a = cpu_to_le64((a << 1) ^ _tt); - r->b = cpu_to_le64((b << 1) | (a >> 63)); -} -EXPORT_SYMBOL(gf128mul_x_ble); - static void gf128mul_x8_lle(be128 *x) { u64 a = be64_to_cpu(x->a); diff --git a/include/crypto/gf128mul.h b/include/crypto/gf128mul.h index 0bc9b5f..6e43be5 100644 --- a/include/crypto/gf128mul.h +++ b/include/crypto/gf128mul.h @@ -49,6 +49,7 @@ #ifndef _CRYPTO_GF128MUL_H #define _CRYPTO_GF128MUL_H +#include #include #include @@ -163,8 +164,58 @@ void gf128mul_lle(be128 *a, const be128 *b); void gf128mul_bbe(be128 *a, const be128 *b); -/* multiply by x in ble format, needed by XTS */ -void gf128mul_x_ble(be128 *a, const be128 *b); +/* + * The following functions multiply a field element by x in + * the polynomial field representation. They use 64-bit word operations + * to gain speed but compensate for machine endianness and hence work + * correctly on both styles of machine. + * + * They are defined here for performance. + */ + +static inline u64 gf128mul_mask_from_bit(u64 x, int which) +{ + /* a constant-time version of 'x & ((u64)1 << which) ? (u64)-1 : 0' */ + return ((s64)(x << (63 - which)) >> 63); +} + +static inline void gf128mul_x_lle(be128 *r, const be128 *x) +{ + u64 a = be64_to_cpu(x->a); + u64 b = be64_to_cpu(x->b); + + /* equivalent to gf128mul_table_le[(b << 7) & 0xff] >> 8 +* (see crypto/gf128mul.c): */ + u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe1; + + r->b = cpu_to_be64((b >> 1) | (a << 63)); + r->a = cpu_to_be64((a >> 1) ^ (_tt << 56)); +} + +static inline void gf128mul_x_bbe(be128 *r, const be128 *x) +{ + u64 a = be64_to_cpu(x->a); + u64 b = be64_to_cpu(x->b); + + /* equivalent to gf128mul_table_be[a >> 63] (see crypto/gf128mul.c): */ + u64 _tt = gf128mul_mask_from_bit(a, 63) & 0x87; + + r->a = cpu_to_be64((a << 1) | (b >> 63)); + r->b = cpu_to_be64((b << 1) ^ _tt); +} + +/* needed by XTS */ +static inline void gf128mul_x_ble(be128 *r, const be128 *x) +{ + u64 a = le64_to_cpu(x->a); + u64 b = le64_to_cpu(x->b); + + /* equivalent to gf128mul_table_be[b >> 63] (see crypto/gf128mul.c): */ + u64 _tt = gf128mul_mask_from_bit(b, 63) & 0x87; + + r->a = cpu_to_le64((a << 1) ^ _tt); + r->b = cpu_to_le64((b << 1) | (a >> 63)); +} /* 4k table optimization */ -- 2.9.3
Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
Hi Jeff, 2017-03-31 8:05 GMT+02:00 Jeffrey Walton: >>> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting >>> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore >>> makes the >>> new version more efficient than one might expect: >>> >>> sar$0x3f,%rax >>> and$0x87,%eax >>> >>> It could even be written the branchless way explicitly, but it shouldn't >>> matter. >> >> I think the definition using unsigned operations is more intuitive... >> Let's just leave the clever tricks up to the compiler :) > > It may be a good idea to use the one that provides constant time-ness > to help avoid leaking information. That's a good point... I played around with various ways to write the expression in Compiler Explorer [1] and indeed GCC fails to produce constant-time code from my version on some architectures (e.g. the 32-bit ARM). The version with an explicit arithmetic right shift seems to produce the most efficient code across platforms, so I'll rewrite it like that for v3. Thanks, O.M. [1] https://gcc.godbolt.org/
Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
>> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting >> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore makes >> the >> new version more efficient than one might expect: >> >> sar$0x3f,%rax >> and$0x87,%eax >> >> It could even be written the branchless way explicitly, but it shouldn't >> matter. > > I think the definition using unsigned operations is more intuitive... > Let's just leave the clever tricks up to the compiler :) It may be a good idea to use the one that provides constant time-ness to help avoid leaking information. Jeff