Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure

2019-08-20 Thread Sabrina Dubroca
2019-08-20, 12:01:40 +0200, Antoine Tenart wrote:
> So it seems the ability to enable or disable the offloading on a given
> interface is the main missing feature. I'll add that, however I'll
> probably (at least at first):
> 
> - Have the interface to be fully offloaded or fully handled in s/w (with
>   errors being thrown if a given configuration isn't supported). Having
>   both at the same time on a given interface would be tricky because of
>   the MACsec validation parameter.
> 
> - Won't allow to enable/disable the offloading of there are rules in
>   place, as we're not sure the same rules would be accepted by the other
>   implementation.

That's probably quite problematic actually, because to do that you
need to be able to resync the state between software and hardware,
particularly packet numbers. So, yeah, we're better off having it
completely blocked until we have a working implementation (if that
ever happens).

However, that means in case the user wants to set up something that's
not offloadable, they'll have to:
 - configure the offloaded version until EOPNOTSUPP
 - tear everything down
 - restart from scratch without offloading

That's inconvenient.

Talking about packet numbers, can you describe how PN exhaustion is
handled?  I couldn't find much about packet numbers at all in the
driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
the same SA).  At some point userspace needs to know that we're
getting close to 2^32 and that it's time to re-key.  Since the whole
TX path of the software implementation is bypassed, it looks like the
PN (as far as drivers/net/macsec.c is concerned) never increases, so
userspace can't know when to negotiate a new SA.

> I'm not sure if we should allow to mix the implementations on a given
> physical interface (by having two MACsec interfaces attached) as the
> validation would be impossible to do (we would have no idea if a
> packet was correctly handled by the offloading part or just not being
> a MACsec packet in the first place, in Rx).

That's something that really bothers me with this proposal. Quoting
from the commit message:

> the packets seen by the networking stack on both the physical and
> MACsec virtual interface are exactly the same

If the HW/driver is expected to strip the sectag, I don't see how we
could ever have multiple secy's at the same time and demultiplex
properly between them. That's part of the reason why I chose to have
virtual interfaces: without them, picking the right secy on TX gets
weird.

AFAICT, it means that any filters (tc, tcpdump) need to be different
between offloaded and non-offloaded cases.

And it's going to be confusing to the administrator when they look at
tcpdumps expecting to see MACsec frames.

I don't see how this implementation handles non-macsec traffic (on TX,
that would be packets sent directly through the real interface, for
example by wpa_supplicant - on RX, incoming MKA traffic for
wpa_supplicant). Unless I missed something, incoming MKA traffic will
end up on the macsec interface as well as the lower interface (not
entirely critical, as long as wpa_supplicant can grab it on the lower
device, but not consistent with the software implementation). How does
the driver distinguish traffic that should pass through unmodified
from traffic that the HW needs to encapsulate and encrypt?

If you look at IPsec offloading, the networking stack builds up the
ESP header, and passes the unencrypted data down to the driver. I'm
wondering if the same would be possible with MACsec offloading: the
macsec virtual interface adds the header (and maybe a dummy ICV), and
then the HW does the encryption. In case of HW that needs to add the
sectag itself, the driver would first strip the headers that the stack
created. On receive, the driver would recreate a sectag and the macsec
interface would just skip all verification (decrypt, PN).

-- 
Sabrina


Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure

2019-08-16 Thread Sabrina Dubroca
2019-08-13, 16:18:40 +, Igor Russkikh wrote:
> On 13.08.2019 16:17, Andrew Lunn wrote:
> > On Tue, Aug 13, 2019 at 10:58:17AM +0200, Antoine Tenart wrote:
> >> I think this question is linked to the use of a MACsec virtual interface
> >> when using h/w offloading. The starting point for me was that I wanted
> >> to reuse the data structures and the API exposed to the userspace by the
> >> s/w implementation of MACsec. I then had two choices: keeping the exact
> >> same interface for the user (having a virtual MACsec interface), or
> >> registering the MACsec genl ops onto the real net devices (and making
> >> the s/w implementation a virtual net dev and a provider of the MACsec
> >> "offloading" ops).
> >>
> >> The advantages of the first option were that nearly all the logic of the
> >> s/w implementation could be kept and especially that it would be
> >> transparent for the user to use both implementations of MACsec.
> > 
> > Hi Antoine
> > 
> > We have always talked about offloading operations to the hardware,
> > accelerating what the linux stack can do by making use of hardware
> > accelerators. The basic user API should not change because of
> > acceleration. Those are the general guidelines.
> > 
> > It would however be interesting to get comments from those who did the
> > software implementation and what they think of this architecture. I've
> > no personal experience with MACSec, so it is hard for me to say if the
> > current architecture makes sense when using accelerators.
> 
> In terms of overall concepts, I'd add the following:
> 
> 1) With current implementation it's impossible to install SW macsec engine 
> onto
> the device which supports HW offload.

You mean how it's implemented in this patchset?

> That could be a strong limitation in
> cases when user sees HW macsec offload is broken or work differently, and 
> he/she
> wants to replace it with SW one.

Agreed, I think an offload that cannot be disabled is quite problematic.

> MACSec is a complex feature, and it may happen something is missing in HW.
> Trivial example is 256bit encryption, which is not always a musthave in HW
> implementations.

+1

> 2) I think, Antoine, its not totally true that otherwise the user macsec API
> will be broken/changed. netlink api is the same, the only thing we may want to
> add is an optional parameter to force selection of SW macsec engine.

Yes, I think we need an offload on/off parameter (and IMO it should
probably be off by default). Then, if offloading is requested but
cannot be satisfied (unsupported key length, too many SAs, etc), or if
incompatible settings are requested (mixing offloaded and
non-offloaded SCs on a device that cannot do it), return an error.

If we also export that offload parameter during netlink dumps, we can
inspect the state of the system, which helps for debugging.

> I'm also eager to hear from sw macsec users/devs on whats better here.

I don't do much development on MACsec these days, and I don't
personally use it outside of testing and development.

-- 
Sabrina


Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure

2019-08-16 Thread Sabrina Dubroca
2019-08-13, 18:28:23 +0200, Andrew Lunn wrote:
> > 1) With current implementation it's impossible to install SW macsec engine 
> > onto
> > the device which supports HW offload. That could be a strong limitation in
> > cases when user sees HW macsec offload is broken or work differently, and 
> > he/she
> > wants to replace it with SW one.
> > MACSec is a complex feature, and it may happen something is missing in HW.
> > Trivial example is 256bit encryption, which is not always a musthave in HW
> > implementations.
> 
> Ideally, we want the driver to return EOPNOTSUPP if it does not
> support something and the software implement should be used.
> 
> If the offload is broken, we want a bug report! And if it works
> differently, it suggests there is also a bug we need to fix, or the
> standard is ambiguous.

Yes. But in the meantime, we want the user to be able to disable the
offload. It's helpful for debugging purposes, and it can provide some
level of functionality until the bug is fixed or non-buggy hardware
becomes available.

> It would also be nice to add extra information to the netlink API to
> indicate if HW or SW is being used. In other places where we offload
> to accelerators we have such additional information.

+1

-- 
Sabrina


Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure

2019-08-16 Thread Sabrina Dubroca
2019-08-13, 10:58:17 +0200, Antoine Tenart wrote:
> Hi Igor,
> 
> On Sat, Aug 10, 2019 at 01:20:32PM +, Igor Russkikh wrote:
> > On 08.08.2019 17:05, Antoine Tenart wrote:
> > 
> > > The Rx and TX handlers are modified to take in account the special case
> > > were the MACsec transformation happens in the hardware, whether in a PHY
> > > or in a MAC, as the packets seen by the networking stack on both the
> > 
> > Don't you think we may eventually may need xmit / handle_frame ops to be
> > a part of macsec_ops?
> > 
> > That way software macsec could be extract to just another type of offload.
> > The drawback of current code is it doesn't show explicitly the path of
> > offloaded packets. It is hidden in `handle_not_macsec` and in
> > `macsec_start_xmit` branch. This makes incorrect counters to tick (see my 
> > below
> > comment)
> > 
> > Another thing is that both xmit / macsec_handle_frame can't now be 
> > customized
> > by device driver. But this may be required.
> > We for example have usecases and HW features to allow specific flows to 
> > bypass
> > macsec encryption. This is normally used for macsec key control protocols,
> > identified by ethertype. Your phy is also capable on that as I see.
> 
> I think this question is linked to the use of a MACsec virtual interface
> when using h/w offloading. The starting point for me was that I wanted
> to reuse the data structures and the API exposed to the userspace by the
> s/w implementation of MACsec. I then had two choices: keeping the exact
> same interface for the user (having a virtual MACsec interface), or

Unless it's really infeasible, yes, that's how things should be done IMO.

> registering the MACsec genl ops onto the real net devices (and making
> the s/w implementation a virtual net dev and a provider of the MACsec
> "offloading" ops).

Please, no :( Let's keep it as close as possible to the software
implementation, unless there's a really good reason not to. It's not
just "ip macsec" btw, wpa_supplicant can also configure MACsec and
would also need some logic to pick the device on which to do the genl
operations in that case.

> The advantages of the first option were that nearly all the logic of the
> s/w implementation could be kept and especially that it would be
> transparent for the user to use both implementations of MACsec. But this
> raised an issue as I had to modify the xmit / handle_frame ops to let
> all the traffic pass. This is because we have no way of knowing if a
> frame was handled by the MACsec h/w or not in ingress. So the virtual
> interface here only serve as the entrypoint for the API...

It's also the interface on which you'll run DHCP or install IP addresses.

> The second option would have the advantage to better represent the actual
> flow, but the way of configuring MACsec would be a bit different for the
> user, whether he wants to use s/w or h/w MACsec. If we were to do this I
> think we could extract the genl functions from the MACsec s/w
> implementation, and let it implement the MACsec ops (exactly as the
> offloading drivers).
> 
> I'm open to discussing this :)
> 
> As for the need for xmit / handle_frame ops (for a MAC w/ MACsec
> offloading), I'd say the xmit / handle_frame ops of the real net device
> driver could be used as the one of the MACsec virtual interface do not
> do much (regardless of the implementation choice discussed above).

There's no "handle_frame" op on a real device. macsec_handle_frame is
an rx_handler specificity that grabs packets from a real device and
sends them into a virtual device stacked on top of it. A real device
just hands packets over to the stack via NAPI.


> > > @@ -2546,11 +2814,15 @@ static netdev_tx_t macsec_start_xmit(struct 
> > > sk_buff *skb,
> > >  {
> > >   struct macsec_dev *macsec = netdev_priv(dev);
> > >   struct macsec_secy *secy = >secy;
> > > + struct macsec_tx_sc *tx_sc = >tx_sc;
> > >   struct pcpu_secy_stats *secy_stats;
> > > + struct macsec_tx_sa *tx_sa;
> > >   int ret, len;
> > >  
> > > + tx_sa = macsec_txsa_get(tx_sc->sa[tx_sc->encoding_sa]);
> > 
> > Declared, but not used?
> 
> I'll remove it then.

That's also a refcount leak, so, yes, please get rid of it.

[I'll answer the rest of the patch separately]

-- 
Sabrina


Re: linux-next: Signed-off-by missing for commits in the net-next tree

2018-12-10 Thread Sabrina Dubroca
2018-12-10, 14:35:00 +0100, Andrew Lunn wrote:
> > The problem here is the '--' delimiter, Andrew should have either
> > used nothing or something else.
> 
> I picked -- because it was not --- !
> 
> Anyway, lesson learned. But i kind of expect it will happen again to
> others, since the "Submitting Patches" documentation just mentions ---
> and does not say that other similar markers may also trigger the magic
> scissors.

-- doesn't trigger the scissors. It looks like patchwork replaced your
-- with a ---, as I wrote in my other reply.

-- 
Sabrina


Re: linux-next: Signed-off-by missing for commits in the net-next tree

2018-12-09 Thread Sabrina Dubroca
2018-12-09, 22:49:07 +0100, Andrew Lunn wrote:
> On Sun, Dec 09, 2018 at 10:33:10PM +0100, Heiner Kallweit wrote:
> > On 09.12.2018 22:11, Andrew Lunn wrote:
> > > On Mon, Dec 10, 2018 at 08:00:45AM +1100, Stephen Rothwell wrote:
> > >> Hi all,
> > >>
> > >> Commits
> > >>
> > >>   dc9d38cec71c ("net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform 
> > >> data")
> > >>   04fa26bab06d ("net: phy: mdio-gpio: Add platform_data support for 
> > >> phy_mask")
> > >>
> > >> are missing a Signed-off-by from their author.
> > > 
> > > Hi David.
> > > 
> > > Any idea what happened here? The version in my git repo has SOB.
> > > 
> > > https://patchwork.ozlabs.org/patch/1009811/ also has my SOB.
> > > 
> > Instead of
> > v2
> > --
> > int -> u32 in platform data structure
> > 
> > Signed-off-by: Andrew Lunn 
> > 
> > shouldn't it be
> > 
> > Signed-off-by: Andrew Lunn 
> > ---
> > v2
> > - int -> u32 in platform data structure
> 
> Hi Heiner
> 
> David said he wanted to see the version history. So i deliberately put
> it above the ---.
> 
> I'm just wondering if -- was enough to trigger something in David's
> scripts? Or patchwork. The -- has disappeared in the commit which made
> it into net-next.
> 
>Andrew

If you fetch the mbox from patchwork at

https://patchwork.ozlabs.org/patch/1009811/mbox/

it contains:

 8< 
v2
Reviewed-by: Florian Fainelli 
---
int -> u32 in platform data structure

Signed-off-by: Andrew Lunn 
---
[diffstat]
 8< 

That's 3 dashes instead of the 2 from your mail. If you "git am" that,
the sign-off and actual history will get chopped off.

Maybe Stephen's script to detect those missing sign-offs could be run
as a commit/apply hook by David? This happens regularly, sometimes
dropping more than just a sign-off.

-- 
Sabrina


Re: Regression with 5dcd8400884c ("macsec: missing dev_put() on error in macsec_newlink()")

2018-04-14 Thread Sabrina Dubroca
Hello Laura,

2018-04-14, 10:56:55 -0700, Laura Abbott wrote:
> Hi,
> 
> Fedora got a bug report of a regression when trying to remove the
> the macsec module (https://bugzilla.redhat.com/show_bug.cgi?id=1566410).
> I did a bisect and found
> 
> commit 5dcd8400884cc4a043a6d4617e042489e5d566a9
> Author: Dan Carpenter 
> Date:   Wed Mar 21 11:09:01 2018 +0300
> 
> macsec: missing dev_put() on error in macsec_newlink()
> We moved the dev_hold(real_dev); call earlier in the function but forgot
> to update the error paths.
> Fixes: 0759e552bce7 ("macsec: fix negative refcnt on parent link")
> Signed-off-by: Dan Carpenter 
> Signed-off-by: David S. Miller 
> 
> The script I used for testing based on the reporter is attached. It
> looks like modprobe is stuck in the D state. Any idea?

I don't think that reference was actually leaked. It gets released in
macsec_free_netdev() when the device is deleted.

modprobe getting stuck is just a side-effect of the refcount going
negative on the parent device, since removing the module needs to take
the lock that is held by device deletion.

I'll send a revert tomorrow.

Thanks for the report,

-- 
Sabrina


Re: Regression with 5dcd8400884c ("macsec: missing dev_put() on error in macsec_newlink()")

2018-04-14 Thread Sabrina Dubroca
Hello Laura,

2018-04-14, 10:56:55 -0700, Laura Abbott wrote:
> Hi,
> 
> Fedora got a bug report of a regression when trying to remove the
> the macsec module (https://bugzilla.redhat.com/show_bug.cgi?id=1566410).
> I did a bisect and found
> 
> commit 5dcd8400884cc4a043a6d4617e042489e5d566a9
> Author: Dan Carpenter 
> Date:   Wed Mar 21 11:09:01 2018 +0300
> 
> macsec: missing dev_put() on error in macsec_newlink()
> We moved the dev_hold(real_dev); call earlier in the function but forgot
> to update the error paths.
> Fixes: 0759e552bce7 ("macsec: fix negative refcnt on parent link")
> Signed-off-by: Dan Carpenter 
> Signed-off-by: David S. Miller 
> 
> The script I used for testing based on the reporter is attached. It
> looks like modprobe is stuck in the D state. Any idea?

I don't think that reference was actually leaked. It gets released in
macsec_free_netdev() when the device is deleted.

modprobe getting stuck is just a side-effect of the refcount going
negative on the parent device, since removing the module needs to take
the lock that is held by device deletion.

I'll send a revert tomorrow.

Thanks for the report,

-- 
Sabrina


Re: [PATCH] net: ipv4: avoid unused variable warning for sysctl

2018-02-28 Thread Sabrina Dubroca
2018-02-28, 14:32:48 +0100, Arnd Bergmann wrote:
> The newly introudced ip_min_valid_pmtu variable is only used when
> CONFIG_SYSCTL is set:
> 
> net/ipv4/route.c:135:12: error: 'ip_min_valid_pmtu' defined but not used 
> [-Werror=unused-variable]
> 
> This moves it to the other variables like it, to avoid the harmless
> warning.
> 
> Fixes: c7272c2f1229 ("net: ipv4: don't allow setting net.ipv4.route.min_pmtu 
> below 68")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Crap. Thanks, and sorry for the mess.

Acked-by: Sabrina Dubroca <s...@queasysnail.net>

-- 
Sabrina


Re: [PATCH] net: ipv4: avoid unused variable warning for sysctl

2018-02-28 Thread Sabrina Dubroca
2018-02-28, 14:32:48 +0100, Arnd Bergmann wrote:
> The newly introudced ip_min_valid_pmtu variable is only used when
> CONFIG_SYSCTL is set:
> 
> net/ipv4/route.c:135:12: error: 'ip_min_valid_pmtu' defined but not used 
> [-Werror=unused-variable]
> 
> This moves it to the other variables like it, to avoid the harmless
> warning.
> 
> Fixes: c7272c2f1229 ("net: ipv4: don't allow setting net.ipv4.route.min_pmtu 
> below 68")
> Signed-off-by: Arnd Bergmann 

Crap. Thanks, and sorry for the mess.

Acked-by: Sabrina Dubroca 

-- 
Sabrina


[PATCH crypto] crypto: aesni - add wrapper for generic gcm(aes)

2017-12-13 Thread Sabrina Dubroca
When I added generic-gcm-aes I didn't add a wrapper like the one
provided for rfc4106(gcm(aes)). We need to add a cryptd wrapper to fall
back on in case the FPU is not available, otherwise we might corrupt the
FPU state.

Fixes: cce2ea8d90fe ("crypto: aesni - add generic gcm(aes)")
Reported-by: Ilya Lesokhin <il...@mellanox.com>
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
Reviewed-by: Stefano Brivio <sbri...@redhat.com>
---
 arch/x86/crypto/aesni-intel_glue.c | 66 +++---
 1 file changed, 54 insertions(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 8981ed1eb7ad..a5ee78d723cd 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -690,8 +690,8 @@ static int common_rfc4106_set_key(struct crypto_aead *aead, 
const u8 *key,
   rfc4106_set_hash_subkey(ctx->hash_subkey, key, key_len);
 }
 
-static int rfc4106_set_key(struct crypto_aead *parent, const u8 *key,
-  unsigned int key_len)
+static int gcmaes_wrapper_set_key(struct crypto_aead *parent, const u8 *key,
+ unsigned int key_len)
 {
struct cryptd_aead **ctx = crypto_aead_ctx(parent);
struct cryptd_aead *cryptd_tfm = *ctx;
@@ -716,8 +716,8 @@ static int common_rfc4106_set_authsize(struct crypto_aead 
*aead,
 
 /* This is the Integrity Check Value (aka the authentication tag length and can
  * be 8, 12 or 16 bytes long. */
-static int rfc4106_set_authsize(struct crypto_aead *parent,
-   unsigned int authsize)
+static int gcmaes_wrapper_set_authsize(struct crypto_aead *parent,
+  unsigned int authsize)
 {
struct cryptd_aead **ctx = crypto_aead_ctx(parent);
struct cryptd_aead *cryptd_tfm = *ctx;
@@ -929,7 +929,7 @@ static int helper_rfc4106_decrypt(struct aead_request *req)
  aes_ctx);
 }
 
-static int rfc4106_encrypt(struct aead_request *req)
+static int gcmaes_wrapper_encrypt(struct aead_request *req)
 {
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
struct cryptd_aead **ctx = crypto_aead_ctx(tfm);
@@ -945,7 +945,7 @@ static int rfc4106_encrypt(struct aead_request *req)
return crypto_aead_encrypt(req);
 }
 
-static int rfc4106_decrypt(struct aead_request *req)
+static int gcmaes_wrapper_decrypt(struct aead_request *req)
 {
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
struct cryptd_aead **ctx = crypto_aead_ctx(tfm);
@@ -1128,6 +1128,30 @@ static int generic_gcmaes_decrypt(struct aead_request 
*req)
  aes_ctx);
 }
 
+static int generic_gcmaes_init(struct crypto_aead *aead)
+{
+   struct cryptd_aead *cryptd_tfm;
+   struct cryptd_aead **ctx = crypto_aead_ctx(aead);
+
+   cryptd_tfm = cryptd_alloc_aead("__driver-generic-gcm-aes-aesni",
+  CRYPTO_ALG_INTERNAL,
+  CRYPTO_ALG_INTERNAL);
+   if (IS_ERR(cryptd_tfm))
+   return PTR_ERR(cryptd_tfm);
+
+   *ctx = cryptd_tfm;
+   crypto_aead_set_reqsize(aead, crypto_aead_reqsize(_tfm->base));
+
+   return 0;
+}
+
+static void generic_gcmaes_exit(struct crypto_aead *aead)
+{
+   struct cryptd_aead **ctx = crypto_aead_ctx(aead);
+
+   cryptd_free_aead(*ctx);
+}
+
 static struct aead_alg aesni_aead_algs[] = { {
.setkey = common_rfc4106_set_key,
.setauthsize= common_rfc4106_set_authsize,
@@ -1147,10 +1171,10 @@ static struct aead_alg aesni_aead_algs[] = { {
 }, {
.init   = rfc4106_init,
.exit   = rfc4106_exit,
-   .setkey = rfc4106_set_key,
-   .setauthsize= rfc4106_set_authsize,
-   .encrypt= rfc4106_encrypt,
-   .decrypt= rfc4106_decrypt,
+   .setkey = gcmaes_wrapper_set_key,
+   .setauthsize= gcmaes_wrapper_set_authsize,
+   .encrypt= gcmaes_wrapper_encrypt,
+   .decrypt= gcmaes_wrapper_decrypt,
.ivsize = GCM_RFC4106_IV_SIZE,
.maxauthsize= 16,
.base = {
@@ -1169,14 +1193,32 @@ static struct aead_alg aesni_aead_algs[] = { {
.decrypt= generic_gcmaes_decrypt,
.ivsize = GCM_AES_IV_SIZE,
.maxauthsize= 16,
+   .base = {
+   .cra_name   = "__generic-gcm-aes-aesni",
+   .cra_driver_name= "__driver-generic-gcm-aes-aesni",
+   .cra_priority   = 0,
+   .cra_flags  = CRYPTO_ALG_INTERNAL,
+   .cra_blocksize  = 1,
+   .cra_ctxsize= sizeof(struct generic_gcmaes

[PATCH crypto] crypto: aesni - add wrapper for generic gcm(aes)

2017-12-13 Thread Sabrina Dubroca
When I added generic-gcm-aes I didn't add a wrapper like the one
provided for rfc4106(gcm(aes)). We need to add a cryptd wrapper to fall
back on in case the FPU is not available, otherwise we might corrupt the
FPU state.

Fixes: cce2ea8d90fe ("crypto: aesni - add generic gcm(aes)")
Reported-by: Ilya Lesokhin 
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 arch/x86/crypto/aesni-intel_glue.c | 66 +++---
 1 file changed, 54 insertions(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 8981ed1eb7ad..a5ee78d723cd 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -690,8 +690,8 @@ static int common_rfc4106_set_key(struct crypto_aead *aead, 
const u8 *key,
   rfc4106_set_hash_subkey(ctx->hash_subkey, key, key_len);
 }
 
-static int rfc4106_set_key(struct crypto_aead *parent, const u8 *key,
-  unsigned int key_len)
+static int gcmaes_wrapper_set_key(struct crypto_aead *parent, const u8 *key,
+ unsigned int key_len)
 {
struct cryptd_aead **ctx = crypto_aead_ctx(parent);
struct cryptd_aead *cryptd_tfm = *ctx;
@@ -716,8 +716,8 @@ static int common_rfc4106_set_authsize(struct crypto_aead 
*aead,
 
 /* This is the Integrity Check Value (aka the authentication tag length and can
  * be 8, 12 or 16 bytes long. */
-static int rfc4106_set_authsize(struct crypto_aead *parent,
-   unsigned int authsize)
+static int gcmaes_wrapper_set_authsize(struct crypto_aead *parent,
+  unsigned int authsize)
 {
struct cryptd_aead **ctx = crypto_aead_ctx(parent);
struct cryptd_aead *cryptd_tfm = *ctx;
@@ -929,7 +929,7 @@ static int helper_rfc4106_decrypt(struct aead_request *req)
  aes_ctx);
 }
 
-static int rfc4106_encrypt(struct aead_request *req)
+static int gcmaes_wrapper_encrypt(struct aead_request *req)
 {
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
struct cryptd_aead **ctx = crypto_aead_ctx(tfm);
@@ -945,7 +945,7 @@ static int rfc4106_encrypt(struct aead_request *req)
return crypto_aead_encrypt(req);
 }
 
-static int rfc4106_decrypt(struct aead_request *req)
+static int gcmaes_wrapper_decrypt(struct aead_request *req)
 {
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
struct cryptd_aead **ctx = crypto_aead_ctx(tfm);
@@ -1128,6 +1128,30 @@ static int generic_gcmaes_decrypt(struct aead_request 
*req)
  aes_ctx);
 }
 
+static int generic_gcmaes_init(struct crypto_aead *aead)
+{
+   struct cryptd_aead *cryptd_tfm;
+   struct cryptd_aead **ctx = crypto_aead_ctx(aead);
+
+   cryptd_tfm = cryptd_alloc_aead("__driver-generic-gcm-aes-aesni",
+  CRYPTO_ALG_INTERNAL,
+  CRYPTO_ALG_INTERNAL);
+   if (IS_ERR(cryptd_tfm))
+   return PTR_ERR(cryptd_tfm);
+
+   *ctx = cryptd_tfm;
+   crypto_aead_set_reqsize(aead, crypto_aead_reqsize(_tfm->base));
+
+   return 0;
+}
+
+static void generic_gcmaes_exit(struct crypto_aead *aead)
+{
+   struct cryptd_aead **ctx = crypto_aead_ctx(aead);
+
+   cryptd_free_aead(*ctx);
+}
+
 static struct aead_alg aesni_aead_algs[] = { {
.setkey = common_rfc4106_set_key,
.setauthsize= common_rfc4106_set_authsize,
@@ -1147,10 +1171,10 @@ static struct aead_alg aesni_aead_algs[] = { {
 }, {
.init   = rfc4106_init,
.exit   = rfc4106_exit,
-   .setkey = rfc4106_set_key,
-   .setauthsize= rfc4106_set_authsize,
-   .encrypt= rfc4106_encrypt,
-   .decrypt= rfc4106_decrypt,
+   .setkey = gcmaes_wrapper_set_key,
+   .setauthsize= gcmaes_wrapper_set_authsize,
+   .encrypt= gcmaes_wrapper_encrypt,
+   .decrypt= gcmaes_wrapper_decrypt,
.ivsize = GCM_RFC4106_IV_SIZE,
.maxauthsize= 16,
.base = {
@@ -1169,14 +1193,32 @@ static struct aead_alg aesni_aead_algs[] = { {
.decrypt= generic_gcmaes_decrypt,
.ivsize = GCM_AES_IV_SIZE,
.maxauthsize= 16,
+   .base = {
+   .cra_name   = "__generic-gcm-aes-aesni",
+   .cra_driver_name= "__driver-generic-gcm-aes-aesni",
+   .cra_priority   = 0,
+   .cra_flags  = CRYPTO_ALG_INTERNAL,
+   .cra_blocksize  = 1,
+   .cra_ctxsize= sizeof(struct generic_gcmaes_ctx),
+   .cra_alignmask  = AESNI_ALIGN - 1,
+  

[PATCH crypto] crypto: aesni - fix typo in generic_gcmaes_decrypt

2017-12-13 Thread Sabrina Dubroca
generic_gcmaes_decrypt needs to use generic_gcmaes_ctx, not
aesni_rfc4106_gcm_ctx. This is actually harmless because the fields in
struct generic_gcmaes_ctx share the layout of the same fields in
aesni_rfc4106_gcm_ctx.

Fixes: cce2ea8d90fe ("crypto: aesni - add generic gcm(aes)")
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
Reviewed-by: Stefano Brivio <sbri...@redhat.com>
---
 arch/x86/crypto/aesni-intel_glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 3bf3dcf29825..8981ed1eb7ad 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1117,7 +1117,7 @@ static int generic_gcmaes_decrypt(struct aead_request 
*req)
 {
__be32 counter = cpu_to_be32(1);
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
-   struct aesni_rfc4106_gcm_ctx *ctx = aesni_rfc4106_gcm_ctx_get(tfm);
+   struct generic_gcmaes_ctx *ctx = generic_gcmaes_ctx_get(tfm);
void *aes_ctx = &(ctx->aes_key_expanded);
u8 iv[16] __attribute__ ((__aligned__(AESNI_ALIGN)));
 
-- 
2.15.1



[PATCH crypto] crypto: aesni - fix typo in generic_gcmaes_decrypt

2017-12-13 Thread Sabrina Dubroca
generic_gcmaes_decrypt needs to use generic_gcmaes_ctx, not
aesni_rfc4106_gcm_ctx. This is actually harmless because the fields in
struct generic_gcmaes_ctx share the layout of the same fields in
aesni_rfc4106_gcm_ctx.

Fixes: cce2ea8d90fe ("crypto: aesni - add generic gcm(aes)")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 arch/x86/crypto/aesni-intel_glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 3bf3dcf29825..8981ed1eb7ad 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1117,7 +1117,7 @@ static int generic_gcmaes_decrypt(struct aead_request 
*req)
 {
__be32 counter = cpu_to_be32(1);
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
-   struct aesni_rfc4106_gcm_ctx *ctx = aesni_rfc4106_gcm_ctx_get(tfm);
+   struct generic_gcmaes_ctx *ctx = generic_gcmaes_ctx_get(tfm);
void *aes_ctx = &(ctx->aes_key_expanded);
u8 iv[16] __attribute__ ((__aligned__(AESNI_ALIGN)));
 
-- 
2.15.1



[PATCH] tracing/kprobes: allow to create probe with a module name starting with a digit

2017-06-22 Thread Sabrina Dubroca
Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.

This allows creating a probe such as:

p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0

Which is necessary for this command to work:

perf probe -m 8021q -a vlan_gro_receive

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 kernel/trace/trace_kprobe.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c129fca6ec99..b53c8d369163 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -707,20 +707,16 @@ static int create_trace_kprobe(int argc, char **argv)
pr_info("Probe point is not specified.\n");
return -EINVAL;
}
-   if (isdigit(argv[1][0])) {
-   /* an address specified */
-   ret = kstrtoul([1][0], 0, (unsigned long *));
-   if (ret) {
-   pr_info("Failed to parse address.\n");
-   return ret;
-   }
-   } else {
+
+   /* try to parse an address. if that fails, try to read the
+* input as a symbol. */
+   if (kstrtoul(argv[1], 0, (unsigned long *))) {
/* a symbol specified */
symbol = argv[1];
/* TODO: support .init module functions */
ret = traceprobe_split_symbol_offset(symbol, );
if (ret) {
-   pr_info("Failed to parse symbol.\n");
+   pr_info("Failed to parse either an address or a 
symbol.\n");
return ret;
}
if (offset && is_return &&
-- 
2.13.1



[PATCH] tracing/kprobes: allow to create probe with a module name starting with a digit

2017-06-22 Thread Sabrina Dubroca
Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.

This allows creating a probe such as:

p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0

Which is necessary for this command to work:

perf probe -m 8021q -a vlan_gro_receive

Signed-off-by: Sabrina Dubroca 
---
 kernel/trace/trace_kprobe.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c129fca6ec99..b53c8d369163 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -707,20 +707,16 @@ static int create_trace_kprobe(int argc, char **argv)
pr_info("Probe point is not specified.\n");
return -EINVAL;
}
-   if (isdigit(argv[1][0])) {
-   /* an address specified */
-   ret = kstrtoul([1][0], 0, (unsigned long *));
-   if (ret) {
-   pr_info("Failed to parse address.\n");
-   return ret;
-   }
-   } else {
+
+   /* try to parse an address. if that fails, try to read the
+* input as a symbol. */
+   if (kstrtoul(argv[1], 0, (unsigned long *))) {
/* a symbol specified */
symbol = argv[1];
/* TODO: support .init module functions */
ret = traceprobe_split_symbol_offset(symbol, );
if (ret) {
-   pr_info("Failed to parse symbol.\n");
+   pr_info("Failed to parse either an address or a 
symbol.\n");
return ret;
}
if (offset && is_return &&
-- 
2.13.1



Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-15 Thread Sabrina Dubroca
2017-05-15, 21:18:35 +0800, Dave Young wrote:
> On 05/15/17 at 01:10pm, Sabrina Dubroca wrote:
> > 2017-05-15, 16:37:40 +0800, Dave Young wrote:
> > > diff --git a/arch/x86/platform/efi/efi-bgrt.c 
> > > b/arch/x86/platform/efi/efi-bgrt.c
> > > index 04ca876..b986e26 100644
> > > --- a/arch/x86/platform/efi/efi-bgrt.c
> > > +++ b/arch/x86/platform/efi/efi-bgrt.c
> > > @@ -36,6 +36,9 @@ void __init efi_bgrt_init(struct acpi_table_header 
> > > *table)
> > >   if (acpi_disabled)
> > >   return;
> > >  
> > > + if (!efi_enabled(EFI_CONFIG_TABLES))
> 
> A better version should be checking EFI_BOOT, could you retest with
> below instead? If it works I can send a patch with your Tested-by:
> if (!efi_enabled(EFI_BOOT))

Yes, that works. Thanks for the fix :)

> > > + return;
> > > +
> > >   if (table->length < sizeof(bgrt_tab)) {
> > >   pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
> > >  table->length, sizeof(bgrt_tab));
> > > 
> > 
> > -- 
> > Sabrina
> 
> Thanks
> Dave

-- 
Sabrina


Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-15 Thread Sabrina Dubroca
2017-05-15, 21:18:35 +0800, Dave Young wrote:
> On 05/15/17 at 01:10pm, Sabrina Dubroca wrote:
> > 2017-05-15, 16:37:40 +0800, Dave Young wrote:
> > > diff --git a/arch/x86/platform/efi/efi-bgrt.c 
> > > b/arch/x86/platform/efi/efi-bgrt.c
> > > index 04ca876..b986e26 100644
> > > --- a/arch/x86/platform/efi/efi-bgrt.c
> > > +++ b/arch/x86/platform/efi/efi-bgrt.c
> > > @@ -36,6 +36,9 @@ void __init efi_bgrt_init(struct acpi_table_header 
> > > *table)
> > >   if (acpi_disabled)
> > >   return;
> > >  
> > > + if (!efi_enabled(EFI_CONFIG_TABLES))
> 
> A better version should be checking EFI_BOOT, could you retest with
> below instead? If it works I can send a patch with your Tested-by:
> if (!efi_enabled(EFI_BOOT))

Yes, that works. Thanks for the fix :)

> > > + return;
> > > +
> > >   if (table->length < sizeof(bgrt_tab)) {
> > >   pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
> > >  table->length, sizeof(bgrt_tab));
> > > 
> > 
> > -- 
> > Sabrina
> 
> Thanks
> Dave

-- 
Sabrina


Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-15 Thread Sabrina Dubroca
2017-05-15, 16:37:40 +0800, Dave Young wrote:
> Hi,
> 
> Thanks for the report.
> On 05/14/17 at 01:18am, Sabrina Dubroca wrote:
> > 2017-01-31, 13:21:40 +, Ard Biesheuvel wrote:
> > > From: Dave Young <dyo...@redhat.com>
> > > 
> > > Before invoking the arch specific handler, efi_mem_reserve() reserves
> > > the given memory region through memblock.
> > > 
> > > efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
> > > time memblock is dead and should not be used anymore.
> > > 
> > > The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
> > > table, so move parsing of the BGRT table to ACPI early boot code to
> > > ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.
> > > 
> > > Signed-off-by: Dave Young <dyo...@redhat.com>
> > > Cc: Matt Fleming <m...@codeblueprint.co.uk>
> > > Cc: "Rafael J. Wysocki" <r...@rjwysocki.net>
> > > Cc: Len Brown <l...@kernel.org>
> > > Cc: linux-a...@vger.kernel.org
> > > Tested-by: Bhupesh Sharma <bhsha...@redhat.com>
> > > Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> > 
> > I have a box that panics in early boot after this patch. The kernel
> > config is based on a Fedora 25 kernel + localmodconfig.
> > 
> > BUG: unable to handle kernel paging request at ff240001
> > IP: efi_bgrt_init+0xdc/0x134
> > PGD 1ac0c067
> > PUD 1ac0e067
> > PMD 1aee9067
> > PTE 938070180163
> > 
> > Oops: 0009 [#1] SMP
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
> > Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 
> > v01.02 05/03/2012
> > task: 9fc10500 task.stack: 9fc0
> > RIP: 0010:efi_bgrt_init+0xdc/0x134
> > RSP: :9fc03d58 EFLAGS: 00010082
> > RAX: ff240001 RBX:  RCX: 138070180006
> > RDX: 8163 RSI: 938070180163 RDI: 05be
> > RBP: 9fc03d70 R08: 138070181000 R09: 0002
> > R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
> > R13: 9ff49480 R14: 0010 R15: 
> > FS:  () GS:9fd2() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
> > Call Trace:
> >  ? acpi_parse_ioapic+0x98/0x98
> >  acpi_parse_bgrt+0x9/0xd
> >  acpi_table_parse+0x7a/0xa9
> >  acpi_boot_init+0x3c7/0x4f9
> >  ? acpi_parse_x2apic+0x74/0x74
> >  ? acpi_parse_x2apic_nmi+0x46/0x46
> >  setup_arch+0xb4b/0xc6f
> >  ? printk+0x52/0x6e
> >  start_kernel+0xb2/0x47b
> >  ? early_idt_handler_array+0x120/0x120
> >  x86_64_start_reservations+0x24/0x26
> >  x86_64_start_kernel+0xf7/0x11a
> >  start_cpu+0x14/0x14
> > Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 00 
> > 00 48 85 c0 75 0e 48 c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 20 
> > be 06 00 00 00 48 89 c7 8b 58 02 e8 87 00 01 00 66
> > RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
> > CR2: ff240001
> > ---[ end trace f68728a0d3053b52 ]---
> > Kernel panic - not syncing: Attempted to kill the idle task!
> > ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> > 
> > 
> > That code is:
> > 
> > 
> > All code
> > 
> >0:   48 c7 c7 10 16 a0 9fmov$0x9fa01610,%rdi
> >7:   e8 4e 94 40 ff  callq  0xff40945a
> >c:   eb 62   jmp0x70
> >e:   be 06 00 00 00  mov$0x6,%esi
> >   13:   e8 f9 ff 00 00  callq  0x10011
> >   18:   48 85 c0test   %rax,%rax
> >   1b:   75 0e   jne0x2b
> >   1d:   48 c7 c7 40 16 a0 9fmov$0x9fa01640,%rdi
> >   24:   e8 31 94 40 ff  callq  0xff40945a
> >   29:   eb 45   jmp0x70
> >   2b:*  66 44 8b 20 mov(%rax),%r12w <-- 
> > trapping instruction
> >   2f:   be 06 00 00 00  mov$0x6,%esi
> >   34:   48 89 c7mov%rax,%rdi
> >   37:   8b 58 02mov0x2(%rax),%ebx
> >   3a:   e8 87 00 01 00  callq  0x100c6
> >   3f:   66  data16
> > 

Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-15 Thread Sabrina Dubroca
2017-05-15, 16:37:40 +0800, Dave Young wrote:
> Hi,
> 
> Thanks for the report.
> On 05/14/17 at 01:18am, Sabrina Dubroca wrote:
> > 2017-01-31, 13:21:40 +, Ard Biesheuvel wrote:
> > > From: Dave Young 
> > > 
> > > Before invoking the arch specific handler, efi_mem_reserve() reserves
> > > the given memory region through memblock.
> > > 
> > > efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
> > > time memblock is dead and should not be used anymore.
> > > 
> > > The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
> > > table, so move parsing of the BGRT table to ACPI early boot code to
> > > ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.
> > > 
> > > Signed-off-by: Dave Young 
> > > Cc: Matt Fleming 
> > > Cc: "Rafael J. Wysocki" 
> > > Cc: Len Brown 
> > > Cc: linux-a...@vger.kernel.org
> > > Tested-by: Bhupesh Sharma 
> > > Signed-off-by: Ard Biesheuvel 
> > 
> > I have a box that panics in early boot after this patch. The kernel
> > config is based on a Fedora 25 kernel + localmodconfig.
> > 
> > BUG: unable to handle kernel paging request at ff240001
> > IP: efi_bgrt_init+0xdc/0x134
> > PGD 1ac0c067
> > PUD 1ac0e067
> > PMD 1aee9067
> > PTE 938070180163
> > 
> > Oops: 0009 [#1] SMP
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
> > Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 
> > v01.02 05/03/2012
> > task: 9fc10500 task.stack: 9fc0
> > RIP: 0010:efi_bgrt_init+0xdc/0x134
> > RSP: :9fc03d58 EFLAGS: 00010082
> > RAX: ff240001 RBX:  RCX: 138070180006
> > RDX: 8163 RSI: 938070180163 RDI: 05be
> > RBP: 9fc03d70 R08: 138070181000 R09: 0002
> > R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
> > R13: 9ff49480 R14: 0010 R15: 
> > FS:  () GS:9fd2() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
> > Call Trace:
> >  ? acpi_parse_ioapic+0x98/0x98
> >  acpi_parse_bgrt+0x9/0xd
> >  acpi_table_parse+0x7a/0xa9
> >  acpi_boot_init+0x3c7/0x4f9
> >  ? acpi_parse_x2apic+0x74/0x74
> >  ? acpi_parse_x2apic_nmi+0x46/0x46
> >  setup_arch+0xb4b/0xc6f
> >  ? printk+0x52/0x6e
> >  start_kernel+0xb2/0x47b
> >  ? early_idt_handler_array+0x120/0x120
> >  x86_64_start_reservations+0x24/0x26
> >  x86_64_start_kernel+0xf7/0x11a
> >  start_cpu+0x14/0x14
> > Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 00 
> > 00 48 85 c0 75 0e 48 c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 20 
> > be 06 00 00 00 48 89 c7 8b 58 02 e8 87 00 01 00 66
> > RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
> > CR2: ff240001
> > ---[ end trace f68728a0d3053b52 ]---
> > Kernel panic - not syncing: Attempted to kill the idle task!
> > ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> > 
> > 
> > That code is:
> > 
> > 
> > All code
> > 
> >0:   48 c7 c7 10 16 a0 9fmov$0x9fa01610,%rdi
> >7:   e8 4e 94 40 ff  callq  0xff40945a
> >c:   eb 62   jmp0x70
> >e:   be 06 00 00 00  mov$0x6,%esi
> >   13:   e8 f9 ff 00 00  callq  0x10011
> >   18:   48 85 c0test   %rax,%rax
> >   1b:   75 0e   jne0x2b
> >   1d:   48 c7 c7 40 16 a0 9fmov$0x9fa01640,%rdi
> >   24:   e8 31 94 40 ff  callq  0xff40945a
> >   29:   eb 45   jmp0x70
> >   2b:*  66 44 8b 20 mov(%rax),%r12w <-- 
> > trapping instruction
> >   2f:   be 06 00 00 00  mov$0x6,%esi
> >   34:   48 89 c7mov%rax,%rdi
> >   37:   8b 58 02mov0x2(%rax),%ebx
> >   3a:   e8 87 00 01 00  callq  0x100c6
> >   3f:   66  data16
> > 
> > Code starting with the faulting instruction
> > ===
> >0:   66 44 8b 20 mov(%rax),%r12w
> >  

Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-13 Thread Sabrina Dubroca
2017-01-31, 13:21:40 +, Ard Biesheuvel wrote:
> From: Dave Young 
> 
> Before invoking the arch specific handler, efi_mem_reserve() reserves
> the given memory region through memblock.
> 
> efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
> time memblock is dead and should not be used anymore.
> 
> The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
> table, so move parsing of the BGRT table to ACPI early boot code to
> ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.
> 
> Signed-off-by: Dave Young 
> Cc: Matt Fleming 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: linux-a...@vger.kernel.org
> Tested-by: Bhupesh Sharma 
> Signed-off-by: Ard Biesheuvel 

I have a box that panics in early boot after this patch. The kernel
config is based on a Fedora 25 kernel + localmodconfig.

BUG: unable to handle kernel paging request at ff240001
IP: efi_bgrt_init+0xdc/0x134
PGD 1ac0c067
PUD 1ac0e067
PMD 1aee9067
PTE 938070180163

Oops: 0009 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.02 
05/03/2012
task: 9fc10500 task.stack: 9fc0
RIP: 0010:efi_bgrt_init+0xdc/0x134
RSP: :9fc03d58 EFLAGS: 00010082
RAX: ff240001 RBX:  RCX: 138070180006
RDX: 8163 RSI: 938070180163 RDI: 05be
RBP: 9fc03d70 R08: 138070181000 R09: 0002
R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
R13: 9ff49480 R14: 0010 R15: 
FS:  () GS:9fd2() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
Call Trace:
 ? acpi_parse_ioapic+0x98/0x98
 acpi_parse_bgrt+0x9/0xd
 acpi_table_parse+0x7a/0xa9
 acpi_boot_init+0x3c7/0x4f9
 ? acpi_parse_x2apic+0x74/0x74
 ? acpi_parse_x2apic_nmi+0x46/0x46
 setup_arch+0xb4b/0xc6f
 ? printk+0x52/0x6e
 start_kernel+0xb2/0x47b
 ? early_idt_handler_array+0x120/0x120
 x86_64_start_reservations+0x24/0x26
 x86_64_start_kernel+0xf7/0x11a
 start_cpu+0x14/0x14
Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 00 00 
48 85 c0 75 0e 48 c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 20 be 06 00 
00 00 48 89 c7 8b 58 02 e8 87 00 01 00 66
RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
CR2: ff240001
---[ end trace f68728a0d3053b52 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task!


That code is:


All code

   0:   48 c7 c7 10 16 a0 9fmov$0x9fa01610,%rdi
   7:   e8 4e 94 40 ff  callq  0xff40945a
   c:   eb 62   jmp0x70
   e:   be 06 00 00 00  mov$0x6,%esi
  13:   e8 f9 ff 00 00  callq  0x10011
  18:   48 85 c0test   %rax,%rax
  1b:   75 0e   jne0x2b
  1d:   48 c7 c7 40 16 a0 9fmov$0x9fa01640,%rdi
  24:   e8 31 94 40 ff  callq  0xff40945a
  29:   eb 45   jmp0x70
  2b:*  66 44 8b 20 mov(%rax),%r12w <-- trapping 
instruction
  2f:   be 06 00 00 00  mov$0x6,%esi
  34:   48 89 c7mov%rax,%rdi
  37:   8b 58 02mov0x2(%rax),%ebx
  3a:   e8 87 00 01 00  callq  0x100c6
  3f:   66  data16

Code starting with the faulting instruction
===
   0:   66 44 8b 20 mov(%rax),%r12w
   4:   be 06 00 00 00  mov$0x6,%esi
   9:   48 89 c7mov%rax,%rdi
   c:   8b 58 02mov0x2(%rax),%ebx
   f:   e8 87 00 01 00  callq  0x1009b
  14:   66  data16


which is just after the early_memremap() call.

I enabled early_ioremap_debug and the last warning had:

__early_ioremap(138070181000, 1000) [1] => 0001 + ff24



Rest of the log, in case there's anything useful in there:


Linux version 4.10.0-rc5-00116-g7b0a911 (root@netdev4) (gcc version 6.3.1 
20161221 (Red Hat 6.3.1-1) (GCC) ) #19 SMP Sat May 13 23:16:09 CEST 2017
Command line: BOOT_IMAGE=/vmlinuz-4.10.0-rc5-00116-g7b0a911 
root=UUID=3b849e12-46bd-4406-a2ec-f44238a55d56 ro console=ttyS0,115200 
earlyprintk=serial,0x03F8,115200
x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 

Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-13 Thread Sabrina Dubroca
2017-01-31, 13:21:40 +, Ard Biesheuvel wrote:
> From: Dave Young 
> 
> Before invoking the arch specific handler, efi_mem_reserve() reserves
> the given memory region through memblock.
> 
> efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
> time memblock is dead and should not be used anymore.
> 
> The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
> table, so move parsing of the BGRT table to ACPI early boot code to
> ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.
> 
> Signed-off-by: Dave Young 
> Cc: Matt Fleming 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: linux-a...@vger.kernel.org
> Tested-by: Bhupesh Sharma 
> Signed-off-by: Ard Biesheuvel 

I have a box that panics in early boot after this patch. The kernel
config is based on a Fedora 25 kernel + localmodconfig.

BUG: unable to handle kernel paging request at ff240001
IP: efi_bgrt_init+0xdc/0x134
PGD 1ac0c067
PUD 1ac0e067
PMD 1aee9067
PTE 938070180163

Oops: 0009 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.02 
05/03/2012
task: 9fc10500 task.stack: 9fc0
RIP: 0010:efi_bgrt_init+0xdc/0x134
RSP: :9fc03d58 EFLAGS: 00010082
RAX: ff240001 RBX:  RCX: 138070180006
RDX: 8163 RSI: 938070180163 RDI: 05be
RBP: 9fc03d70 R08: 138070181000 R09: 0002
R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
R13: 9ff49480 R14: 0010 R15: 
FS:  () GS:9fd2() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
Call Trace:
 ? acpi_parse_ioapic+0x98/0x98
 acpi_parse_bgrt+0x9/0xd
 acpi_table_parse+0x7a/0xa9
 acpi_boot_init+0x3c7/0x4f9
 ? acpi_parse_x2apic+0x74/0x74
 ? acpi_parse_x2apic_nmi+0x46/0x46
 setup_arch+0xb4b/0xc6f
 ? printk+0x52/0x6e
 start_kernel+0xb2/0x47b
 ? early_idt_handler_array+0x120/0x120
 x86_64_start_reservations+0x24/0x26
 x86_64_start_kernel+0xf7/0x11a
 start_cpu+0x14/0x14
Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 00 00 
48 85 c0 75 0e 48 c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 20 be 06 00 
00 00 48 89 c7 8b 58 02 e8 87 00 01 00 66
RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
CR2: ff240001
---[ end trace f68728a0d3053b52 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task!


That code is:


All code

   0:   48 c7 c7 10 16 a0 9fmov$0x9fa01610,%rdi
   7:   e8 4e 94 40 ff  callq  0xff40945a
   c:   eb 62   jmp0x70
   e:   be 06 00 00 00  mov$0x6,%esi
  13:   e8 f9 ff 00 00  callq  0x10011
  18:   48 85 c0test   %rax,%rax
  1b:   75 0e   jne0x2b
  1d:   48 c7 c7 40 16 a0 9fmov$0x9fa01640,%rdi
  24:   e8 31 94 40 ff  callq  0xff40945a
  29:   eb 45   jmp0x70
  2b:*  66 44 8b 20 mov(%rax),%r12w <-- trapping 
instruction
  2f:   be 06 00 00 00  mov$0x6,%esi
  34:   48 89 c7mov%rax,%rdi
  37:   8b 58 02mov0x2(%rax),%ebx
  3a:   e8 87 00 01 00  callq  0x100c6
  3f:   66  data16

Code starting with the faulting instruction
===
   0:   66 44 8b 20 mov(%rax),%r12w
   4:   be 06 00 00 00  mov$0x6,%esi
   9:   48 89 c7mov%rax,%rdi
   c:   8b 58 02mov0x2(%rax),%ebx
   f:   e8 87 00 01 00  callq  0x1009b
  14:   66  data16


which is just after the early_memremap() call.

I enabled early_ioremap_debug and the last warning had:

__early_ioremap(138070181000, 1000) [1] => 0001 + ff24



Rest of the log, in case there's anything useful in there:


Linux version 4.10.0-rc5-00116-g7b0a911 (root@netdev4) (gcc version 6.3.1 
20161221 (Red Hat 6.3.1-1) (GCC) ) #19 SMP Sat May 13 23:16:09 CEST 2017
Command line: BOOT_IMAGE=/vmlinuz-4.10.0-rc5-00116-g7b0a911 
root=UUID=3b849e12-46bd-4406-a2ec-f44238a55d56 ro console=ttyS0,115200 
earlyprintk=serial,0x03F8,115200
x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 
'standard' format.
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x-0x00093bff] usable
BIOS-e820: [mem 

Re: [PATCH v9 3/3] printk: fix double printing with earlycon

2017-05-09 Thread Sabrina Dubroca
Hi Aleksey,

2017-04-05, 23:20:00 +0300, Aleksey Makarov wrote:
> If a console was specified by ACPI SPCR table _and_ command line
> parameters like "console=ttyAMA0" _and_ "earlycon" were specified,
> then log messages appear twice.
> 
> The root cause is that the code traverses the list of specified
> consoles (the `console_cmdline` array) and stops at the first match.
> But it may happen that the same console is referred by the elements
> of this array twice:
> 
>   pl011,mmio,0x87e02400,115200 -- from SPCR
>   ttyAMA0 -- from command line
> 
> but in this case `preferred_console` points to the second entry and
> the flag CON_CONSDEV is not set, so bootconsole is not deregistered.
> 
> To fix that, introduce an invariant "The last non-braille console
> is always the preferred one" on the entries of the console_cmdline
> array.  Then traverse it in reverse order to be sure that if
> the console is preferred then it will be the first matching entry.
> Introduce variable console_cmdline_cnt that keeps the number
> of elements of the console_cmdline array (Petr Mladek).  It helps
> to get rid of the loop that searches for the end of this array.

That's caused a change of behavior in my qemu setup, with this cmdline

root=/dev/sda1 console=ttyS1 console=ttyS0

Before, the kernel logs appeared on ttyS1, and I logged in with ttyS0
(with my setup, ttyS1 is a file and ttyS0 is unix socket). Now, the
kernel logs go to ttyS0. I need to swap the two console= parameters to
restore behavior.

There might be some other problem (in qemu?) though, because adding
console=tty0 anywhere on that cmdline makes the logs appear on both
tty0 and one ttyS* (but only one of them, and the ordering of the
ttyS* matters).


Thanks,

-- 
Sabrina


Re: [PATCH v9 3/3] printk: fix double printing with earlycon

2017-05-09 Thread Sabrina Dubroca
Hi Aleksey,

2017-04-05, 23:20:00 +0300, Aleksey Makarov wrote:
> If a console was specified by ACPI SPCR table _and_ command line
> parameters like "console=ttyAMA0" _and_ "earlycon" were specified,
> then log messages appear twice.
> 
> The root cause is that the code traverses the list of specified
> consoles (the `console_cmdline` array) and stops at the first match.
> But it may happen that the same console is referred by the elements
> of this array twice:
> 
>   pl011,mmio,0x87e02400,115200 -- from SPCR
>   ttyAMA0 -- from command line
> 
> but in this case `preferred_console` points to the second entry and
> the flag CON_CONSDEV is not set, so bootconsole is not deregistered.
> 
> To fix that, introduce an invariant "The last non-braille console
> is always the preferred one" on the entries of the console_cmdline
> array.  Then traverse it in reverse order to be sure that if
> the console is preferred then it will be the first matching entry.
> Introduce variable console_cmdline_cnt that keeps the number
> of elements of the console_cmdline array (Petr Mladek).  It helps
> to get rid of the loop that searches for the end of this array.

That's caused a change of behavior in my qemu setup, with this cmdline

root=/dev/sda1 console=ttyS1 console=ttyS0

Before, the kernel logs appeared on ttyS1, and I logged in with ttyS0
(with my setup, ttyS1 is a file and ttyS0 is unix socket). Now, the
kernel logs go to ttyS0. I need to swap the two console= parameters to
restore behavior.

There might be some other problem (in qemu?) though, because adding
console=tty0 anywhere on that cmdline makes the logs appear on both
tty0 and one ttyS* (but only one of them, and the ordering of the
ttyS* matters).


Thanks,

-- 
Sabrina


Re: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-04-28 Thread Sabrina Dubroca
2017-04-25, 20:47:30 +0200, Jason A. Donenfeld wrote:
> This is a defense-in-depth measure in response to bugs like
> 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). While
> we're at it, we also limit the amount of recursion this function is
> allowed to do. Not actually providing a bounded base case is a future
> diaster that we can easily avoid here.
> 
> Signed-off-by: Jason A. Donenfeld 
> ---
> Changes v5->v6:
>   * Use unlikely() for the rare overflow conditions.
>   * Also bound recursion, since this is a potential disaster we can avert.
> 
>  net/core/skbuff.c | 31 ---
>  1 file changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f86bf69cfb8d..24fb53f8534e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3489,16 +3489,22 @@ void __init skb_init(void)
>   *   @len: Length of buffer space to be mapped
>   *
>   *   Fill the specified scatter-gather list with mappings/pointers into a
> - *   region of the buffer space attached to a socket buffer.
> + *   region of the buffer space attached to a socket buffer. Returns either
> + *   the number of scatterlist items used, or -EMSGSIZE if the contents
> + *   could not fit.
>   */

One small thing here: since you're touching this comment, could you
move it next to skb_to_sgvec, since that's the function it's supposed
to document?

Thanks!

>  static int
> -__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
> len)
> +__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
> len,
> +unsigned int recursion_level)

-- 
Sabrina


Re: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-04-28 Thread Sabrina Dubroca
2017-04-25, 20:47:30 +0200, Jason A. Donenfeld wrote:
> This is a defense-in-depth measure in response to bugs like
> 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). While
> we're at it, we also limit the amount of recursion this function is
> allowed to do. Not actually providing a bounded base case is a future
> diaster that we can easily avoid here.
> 
> Signed-off-by: Jason A. Donenfeld 
> ---
> Changes v5->v6:
>   * Use unlikely() for the rare overflow conditions.
>   * Also bound recursion, since this is a potential disaster we can avert.
> 
>  net/core/skbuff.c | 31 ---
>  1 file changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f86bf69cfb8d..24fb53f8534e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3489,16 +3489,22 @@ void __init skb_init(void)
>   *   @len: Length of buffer space to be mapped
>   *
>   *   Fill the specified scatter-gather list with mappings/pointers into a
> - *   region of the buffer space attached to a socket buffer.
> + *   region of the buffer space attached to a socket buffer. Returns either
> + *   the number of scatterlist items used, or -EMSGSIZE if the contents
> + *   could not fit.
>   */

One small thing here: since you're touching this comment, could you
move it next to skb_to_sgvec, since that's the function it's supposed
to document?

Thanks!

>  static int
> -__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
> len)
> +__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
> len,
> +unsigned int recursion_level)

-- 
Sabrina


[PATCH 1/7] crypto: aesni: make non-AVX AES-GCM work with any aadlen

2017-04-28 Thread Sabrina Dubroca
This is the first step to make the aesni AES-GCM implementation
generic. The current code was written for rfc4106, so it handles only
some specific sizes of associated data.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_asm.S | 169 +-
 1 file changed, 132 insertions(+), 37 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 3c465184ff8a..605726aaf0a2 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -89,6 +89,29 @@ SHIFT_MASK: .octa 0x0f0e0d0c0b0a09080706050403020100
 ALL_F:  .octa 0x
 .octa 0x
 
+.section .rodata
+.align 16
+.type aad_shift_arr, @object
+.size aad_shift_arr, 272
+aad_shift_arr:
+.octa 0x
+.octa 0xff0C
+.octa 0x0D0C
+.octa 0xff0E0D0C
+.octa 0x0F0E0D0C
+.octa 0xff0C0B0A0908
+.octa 0x0D0C0B0A0908
+.octa 0xff0E0D0C0B0A0908
+.octa 0x0F0E0D0C0B0A0908
+.octa 0xff0C0B0A090807060504
+.octa 0x0D0C0B0A090807060504
+.octa 0xff0E0D0C0B0A090807060504
+.octa 0x0F0E0D0C0B0A090807060504
+.octa 0xff0C0B0A09080706050403020100
+.octa 0x0D0C0B0A09080706050403020100
+.octa 0xff0E0D0C0B0A09080706050403020100
+.octa 0x0F0E0D0C0B0A09080706050403020100
+
 
 .text
 
@@ -252,32 +275,66 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
movarg8, %r12   # %r12 = aadLen
mov%r12, %r11
pxor   %xmm\i, %xmm\i
+   pxor   \XMM2, \XMM2
 
-_get_AAD_loop\num_initial_blocks\operation:
-   movd   (%r10), \TMP1
-   pslldq $12, \TMP1
-   psrldq $4, %xmm\i
+   cmp$16, %r11
+   jl _get_AAD_rest8\num_initial_blocks\operation
+_get_AAD_blocks\num_initial_blocks\operation:
+   movdqu (%r10), %xmm\i
+   PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
+   pxor   %xmm\i, \XMM2
+   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   add$16, %r10
+   sub$16, %r12
+   sub$16, %r11
+   cmp$16, %r11
+   jge_get_AAD_blocks\num_initial_blocks\operation
+
+   movdqu \XMM2, %xmm\i
+   cmp$0, %r11
+   je _get_AAD_done\num_initial_blocks\operation
+
+   pxor   %xmm\i,%xmm\i
+
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\num_initial_blocks\operation:
+   cmp$4, %r11
+   jle_get_AAD_rest4\num_initial_blocks\operation
+   movq   (%r10), \TMP1
+   add$8, %r10
+   sub$8, %r11
+   pslldq $8, \TMP1
+   psrldq $8, %xmm\i
pxor   \TMP1, %xmm\i
+   jmp_get_AAD_rest8\num_initial_blocks\operation
+_get_AAD_rest4\num_initial_blocks\operation:
+   cmp$0, %r11
+   jle_get_AAD_rest0\num_initial_blocks\operation
+   mov(%r10), %eax
+   movq   %rax, \TMP1
add$4, %r10
-   sub$4, %r12
-   jne_get_AAD_loop\num_initial_blocks\operation
-
-   cmp$16, %r11
-   je _get_AAD_loop2_done\num_initial_blocks\operation
-
-   mov$16, %r12
-_get_AAD_loop2\num_initial_blocks\operation:
+   sub$4, %r10
+   pslldq $12, \TMP1
psrldq $4, %xmm\i
-   sub$4, %r12
-   cmp%r11, %r12
-   jne_get_AAD_loop2\num_initial_blocks\operation
-
-_get_AAD_loop2_done\num_initial_blocks\operation:
+   pxor   \TMP1, %xmm\i
+_get_AAD_rest0\num_initial_blocks\operation:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq   %r12, %r11
+   salq   $4, %r11
+   movdqu aad_shift_arr(%r11), \TMP1
+   PSHUFB_XMM \TMP1, %xmm\i
+_get_AAD_rest_final\num_initial_blocks\operation:
PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
+   pxor   \XMM2, %xmm\i
+   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
 
+_get_AAD_done\num_initial_blocks\operation:
xor%r11, %r11 # initialise the data pointer offset as zero
-
-# start AES for num_initial_blocks blocks
+   # 

[PATCH 1/7] crypto: aesni: make non-AVX AES-GCM work with any aadlen

2017-04-28 Thread Sabrina Dubroca
This is the first step to make the aesni AES-GCM implementation
generic. The current code was written for rfc4106, so it handles only
some specific sizes of associated data.

Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_asm.S | 169 +-
 1 file changed, 132 insertions(+), 37 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 3c465184ff8a..605726aaf0a2 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -89,6 +89,29 @@ SHIFT_MASK: .octa 0x0f0e0d0c0b0a09080706050403020100
 ALL_F:  .octa 0x
 .octa 0x
 
+.section .rodata
+.align 16
+.type aad_shift_arr, @object
+.size aad_shift_arr, 272
+aad_shift_arr:
+.octa 0x
+.octa 0xff0C
+.octa 0x0D0C
+.octa 0xff0E0D0C
+.octa 0x0F0E0D0C
+.octa 0xff0C0B0A0908
+.octa 0x0D0C0B0A0908
+.octa 0xff0E0D0C0B0A0908
+.octa 0x0F0E0D0C0B0A0908
+.octa 0xff0C0B0A090807060504
+.octa 0x0D0C0B0A090807060504
+.octa 0xff0E0D0C0B0A090807060504
+.octa 0x0F0E0D0C0B0A090807060504
+.octa 0xff0C0B0A09080706050403020100
+.octa 0x0D0C0B0A09080706050403020100
+.octa 0xff0E0D0C0B0A09080706050403020100
+.octa 0x0F0E0D0C0B0A09080706050403020100
+
 
 .text
 
@@ -252,32 +275,66 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
movarg8, %r12   # %r12 = aadLen
mov%r12, %r11
pxor   %xmm\i, %xmm\i
+   pxor   \XMM2, \XMM2
 
-_get_AAD_loop\num_initial_blocks\operation:
-   movd   (%r10), \TMP1
-   pslldq $12, \TMP1
-   psrldq $4, %xmm\i
+   cmp$16, %r11
+   jl _get_AAD_rest8\num_initial_blocks\operation
+_get_AAD_blocks\num_initial_blocks\operation:
+   movdqu (%r10), %xmm\i
+   PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
+   pxor   %xmm\i, \XMM2
+   GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
+   add$16, %r10
+   sub$16, %r12
+   sub$16, %r11
+   cmp$16, %r11
+   jge_get_AAD_blocks\num_initial_blocks\operation
+
+   movdqu \XMM2, %xmm\i
+   cmp$0, %r11
+   je _get_AAD_done\num_initial_blocks\operation
+
+   pxor   %xmm\i,%xmm\i
+
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\num_initial_blocks\operation:
+   cmp$4, %r11
+   jle_get_AAD_rest4\num_initial_blocks\operation
+   movq   (%r10), \TMP1
+   add$8, %r10
+   sub$8, %r11
+   pslldq $8, \TMP1
+   psrldq $8, %xmm\i
pxor   \TMP1, %xmm\i
+   jmp_get_AAD_rest8\num_initial_blocks\operation
+_get_AAD_rest4\num_initial_blocks\operation:
+   cmp$0, %r11
+   jle_get_AAD_rest0\num_initial_blocks\operation
+   mov(%r10), %eax
+   movq   %rax, \TMP1
add$4, %r10
-   sub$4, %r12
-   jne_get_AAD_loop\num_initial_blocks\operation
-
-   cmp$16, %r11
-   je _get_AAD_loop2_done\num_initial_blocks\operation
-
-   mov$16, %r12
-_get_AAD_loop2\num_initial_blocks\operation:
+   sub$4, %r10
+   pslldq $12, \TMP1
psrldq $4, %xmm\i
-   sub$4, %r12
-   cmp%r11, %r12
-   jne_get_AAD_loop2\num_initial_blocks\operation
-
-_get_AAD_loop2_done\num_initial_blocks\operation:
+   pxor   \TMP1, %xmm\i
+_get_AAD_rest0\num_initial_blocks\operation:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq   %r12, %r11
+   salq   $4, %r11
+   movdqu aad_shift_arr(%r11), \TMP1
+   PSHUFB_XMM \TMP1, %xmm\i
+_get_AAD_rest_final\num_initial_blocks\operation:
PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
+   pxor   \XMM2, %xmm\i
+   GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
 
+_get_AAD_done\num_initial_blocks\operation:
xor%r11, %r11 # initialise the data pointer offset as zero
-
-# start AES for num_initial_blocks blocks
+   # start AES for num_initial_blocks blo

[PATCH 2/7] crypto: aesni: make non-AVX AES-GCM work with all valid auth_tag_len

2017-04-28 Thread Sabrina Dubroca
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_asm.S | 62 ++-
 1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 605726aaf0a2..16627fec80b2 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -1549,18 +1549,35 @@ ENTRY(aesni_gcm_dec)
mov arg10, %r11   # %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_decrypt
-   cmp $12, %r11
-   je  _T_12_decrypt
+   cmp $8, %r11
+   jl  _T_4_decrypt
 _T_8_decrypt:
MOVQ_R64_XMM%xmm0, %rax
mov %rax, (%r10)
-   jmp _return_T_done_decrypt
-_T_12_decrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_decrypt
+_T_4_decrypt:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_decrypt
+_T_123_decrypt:
movd%xmm0, %eax
-   mov %eax, 8(%r10)
+   cmp $2, %r11
+   jl  _T_1_decrypt
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_decrypt
+   add $2, %r10
+   sar $16, %eax
+_T_1_decrypt:
+   mov %al, (%r10)
jmp _return_T_done_decrypt
 _T_16_decrypt:
movdqu  %xmm0, (%r10)
@@ -1813,18 +1830,35 @@ ENTRY(aesni_gcm_enc)
mov arg10, %r11# %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_encrypt
-   cmp $12, %r11
-   je  _T_12_encrypt
+   cmp $8, %r11
+   jl  _T_4_encrypt
 _T_8_encrypt:
MOVQ_R64_XMM%xmm0, %rax
mov %rax, (%r10)
-   jmp _return_T_done_encrypt
-_T_12_encrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_encrypt
+_T_4_encrypt:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_encrypt
+_T_123_encrypt:
movd%xmm0, %eax
-   mov %eax, 8(%r10)
+   cmp $2, %r11
+   jl  _T_1_encrypt
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_encrypt
+   add $2, %r10
+   sar $16, %eax
+_T_1_encrypt:
+   mov %al, (%r10)
jmp _return_T_done_encrypt
 _T_16_encrypt:
movdqu  %xmm0, (%r10)
-- 
2.12.2



[PATCH 2/7] crypto: aesni: make non-AVX AES-GCM work with all valid auth_tag_len

2017-04-28 Thread Sabrina Dubroca
Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_asm.S | 62 ++-
 1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 605726aaf0a2..16627fec80b2 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -1549,18 +1549,35 @@ ENTRY(aesni_gcm_dec)
mov arg10, %r11   # %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_decrypt
-   cmp $12, %r11
-   je  _T_12_decrypt
+   cmp $8, %r11
+   jl  _T_4_decrypt
 _T_8_decrypt:
MOVQ_R64_XMM%xmm0, %rax
mov %rax, (%r10)
-   jmp _return_T_done_decrypt
-_T_12_decrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_decrypt
+_T_4_decrypt:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_decrypt
+_T_123_decrypt:
movd%xmm0, %eax
-   mov %eax, 8(%r10)
+   cmp $2, %r11
+   jl  _T_1_decrypt
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_decrypt
+   add $2, %r10
+   sar $16, %eax
+_T_1_decrypt:
+   mov %al, (%r10)
jmp _return_T_done_decrypt
 _T_16_decrypt:
movdqu  %xmm0, (%r10)
@@ -1813,18 +1830,35 @@ ENTRY(aesni_gcm_enc)
mov arg10, %r11# %r11 = auth_tag_len
cmp $16, %r11
je  _T_16_encrypt
-   cmp $12, %r11
-   je  _T_12_encrypt
+   cmp $8, %r11
+   jl  _T_4_encrypt
 _T_8_encrypt:
MOVQ_R64_XMM%xmm0, %rax
mov %rax, (%r10)
-   jmp _return_T_done_encrypt
-_T_12_encrypt:
-   MOVQ_R64_XMM%xmm0, %rax
-   mov %rax, (%r10)
+   add $8, %r10
+   sub $8, %r11
psrldq  $8, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_encrypt
+_T_4_encrypt:
+   movd%xmm0, %eax
+   mov %eax, (%r10)
+   add $4, %r10
+   sub $4, %r11
+   psrldq  $4, %xmm0
+   cmp $0, %r11
+   je  _return_T_done_encrypt
+_T_123_encrypt:
movd%xmm0, %eax
-   mov %eax, 8(%r10)
+   cmp $2, %r11
+   jl  _T_1_encrypt
+   mov %ax, (%r10)
+   cmp $2, %r11
+   je  _return_T_done_encrypt
+   add $2, %r10
+   sar $16, %eax
+_T_1_encrypt:
+   mov %al, (%r10)
jmp _return_T_done_encrypt
 _T_16_encrypt:
movdqu  %xmm0, (%r10)
-- 
2.12.2



[PATCH 6/7] crypto: aesni: make AVX2 AES-GCM work with all valid auth_tag_len

2017-04-28 Thread Sabrina Dubroca
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 7230808a7cef..faecb1518bf8 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -2804,19 +2804,36 @@ ENDPROC(aesni_gcm_dec_avx_gen2)
 cmp $16, %r11
 je  _T_16\@
 
-cmp $12, %r11
-je  _T_12\@
+cmp $8, %r11
+jl  _T_4\@
 
 _T_8\@:
 vmovq   %xmm9, %rax
 mov %rax, (%r10)
-jmp _return_T_done\@
-_T_12\@:
-vmovq   %xmm9, %rax
-mov %rax, (%r10)
+add $8, %r10
+sub $8, %r11
 vpsrldq $8, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_4\@:
 vmovd   %xmm9, %eax
-mov %eax, 8(%r10)
+mov %eax, (%r10)
+add $4, %r10
+sub $4, %r11
+vpsrldq $4, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_123\@:
+vmovd %xmm9, %eax
+cmp $2, %r11
+jl _T_1\@
+mov %ax, (%r10)
+cmp $2, %r11
+je _return_T_done\@
+add $2, %r10
+sar $16, %eax
+_T_1\@:
+mov %al, (%r10)
 jmp _return_T_done\@
 
 _T_16\@:
-- 
2.12.2



[PATCH 6/7] crypto: aesni: make AVX2 AES-GCM work with all valid auth_tag_len

2017-04-28 Thread Sabrina Dubroca
Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 7230808a7cef..faecb1518bf8 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -2804,19 +2804,36 @@ ENDPROC(aesni_gcm_dec_avx_gen2)
 cmp $16, %r11
 je  _T_16\@
 
-cmp $12, %r11
-je  _T_12\@
+cmp $8, %r11
+jl  _T_4\@
 
 _T_8\@:
 vmovq   %xmm9, %rax
 mov %rax, (%r10)
-jmp _return_T_done\@
-_T_12\@:
-vmovq   %xmm9, %rax
-mov %rax, (%r10)
+add $8, %r10
+sub $8, %r11
 vpsrldq $8, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_4\@:
 vmovd   %xmm9, %eax
-mov %eax, 8(%r10)
+mov %eax, (%r10)
+add $4, %r10
+sub $4, %r11
+vpsrldq $4, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_123\@:
+vmovd %xmm9, %eax
+cmp $2, %r11
+jl _T_1\@
+mov %ax, (%r10)
+cmp $2, %r11
+je _return_T_done\@
+add $2, %r10
+sar $16, %eax
+_T_1\@:
+mov %al, (%r10)
 jmp _return_T_done\@
 
 _T_16\@:
-- 
2.12.2



[PATCH 4/7] crypto: aesni: make AVX AES-GCM work with all valid auth_tag_len

2017-04-28 Thread Sabrina Dubroca
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index a73117c84904..ee6283120f83 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -1481,19 +1481,36 @@ VARIABLE_OFFSET = 16*8
 cmp $16, %r11
 je  _T_16\@
 
-cmp $12, %r11
-je  _T_12\@
+cmp $8, %r11
+jl  _T_4\@
 
 _T_8\@:
 vmovq   %xmm9, %rax
 mov %rax, (%r10)
-jmp _return_T_done\@
-_T_12\@:
-vmovq   %xmm9, %rax
-mov %rax, (%r10)
+add $8, %r10
+sub $8, %r11
 vpsrldq $8, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_4\@:
 vmovd   %xmm9, %eax
-mov %eax, 8(%r10)
+mov %eax, (%r10)
+add $4, %r10
+sub $4, %r11
+vpsrldq $4, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_123\@:
+vmovd %xmm9, %eax
+cmp $2, %r11
+jl _T_1\@
+mov %ax, (%r10)
+cmp $2, %r11
+je _return_T_done\@
+add $2, %r10
+sar $16, %eax
+_T_1\@:
+mov %al, (%r10)
 jmp _return_T_done\@
 
 _T_16\@:
-- 
2.12.2



[PATCH 4/7] crypto: aesni: make AVX AES-GCM work with all valid auth_tag_len

2017-04-28 Thread Sabrina Dubroca
Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index a73117c84904..ee6283120f83 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -1481,19 +1481,36 @@ VARIABLE_OFFSET = 16*8
 cmp $16, %r11
 je  _T_16\@
 
-cmp $12, %r11
-je  _T_12\@
+cmp $8, %r11
+jl  _T_4\@
 
 _T_8\@:
 vmovq   %xmm9, %rax
 mov %rax, (%r10)
-jmp _return_T_done\@
-_T_12\@:
-vmovq   %xmm9, %rax
-mov %rax, (%r10)
+add $8, %r10
+sub $8, %r11
 vpsrldq $8, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_4\@:
 vmovd   %xmm9, %eax
-mov %eax, 8(%r10)
+mov %eax, (%r10)
+add $4, %r10
+sub $4, %r11
+vpsrldq $4, %xmm9, %xmm9
+cmp $0, %r11
+je _return_T_done\@
+_T_123\@:
+vmovd %xmm9, %eax
+cmp $2, %r11
+jl _T_1\@
+mov %ax, (%r10)
+cmp $2, %r11
+je _return_T_done\@
+add $2, %r10
+sar $16, %eax
+_T_1\@:
+mov %al, (%r10)
 jmp _return_T_done\@
 
 _T_16\@:
-- 
2.12.2



[PATCH 7/7] crypto: aesni: add generic gcm(aes)

2017-04-28 Thread Sabrina Dubroca
Now that the asm side of things can support all the valid lengths of ICV
and all lengths of associated data, provide the glue code to expose a
generic gcm(aes) crypto algorithm.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_glue.c | 208 -
 1 file changed, 158 insertions(+), 50 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 93de8ea51548..4a55cdcdc008 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -61,6 +61,11 @@ struct aesni_rfc4106_gcm_ctx {
u8 nonce[4];
 };
 
+struct generic_gcmaes_ctx {
+   u8 hash_subkey[16] AESNI_ALIGN_ATTR;
+   struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
+};
+
 struct aesni_xts_ctx {
u8 raw_tweak_ctx[sizeof(struct crypto_aes_ctx)] AESNI_ALIGN_ATTR;
u8 raw_crypt_ctx[sizeof(struct crypto_aes_ctx)] AESNI_ALIGN_ATTR;
@@ -102,13 +107,11 @@ asmlinkage void aesni_xts_crypt8(struct crypto_aes_ctx 
*ctx, u8 *out,
  * u8 *out, Ciphertext output. Encrypt in-place is allowed.
  * const u8 *in, Plaintext input
  * unsigned long plaintext_len, Length of data in bytes for encryption.
- * u8 *iv, Pre-counter block j0: 4 byte salt (from Security Association)
- * concatenated with 8 byte Initialisation Vector (from IPSec ESP
- * Payload) concatenated with 0x0001. 16-byte aligned pointer.
+ * u8 *iv, Pre-counter block j0: 12 byte IV concatenated with 0x0001.
+ * 16-byte aligned pointer.
  * u8 *hash_subkey, the Hash sub key input. Data starts on a 16-byte boundary.
  * const u8 *aad, Additional Authentication Data (AAD)
- * unsigned long aad_len, Length of AAD in bytes. With RFC4106 this
- *  is going to be 8 or 12 bytes
+ * unsigned long aad_len, Length of AAD in bytes.
  * u8 *auth_tag, Authenticated Tag output.
  * unsigned long auth_tag_len), Authenticated Tag Length in bytes.
  *  Valid values are 16 (most likely), 12 or 8.
@@ -123,9 +126,8 @@ asmlinkage void aesni_gcm_enc(void *ctx, u8 *out,
  * u8 *out, Plaintext output. Decrypt in-place is allowed.
  * const u8 *in, Ciphertext input
  * unsigned long ciphertext_len, Length of data in bytes for decryption.
- * u8 *iv, Pre-counter block j0: 4 byte salt (from Security Association)
- * concatenated with 8 byte Initialisation Vector (from IPSec ESP
- * Payload) concatenated with 0x0001. 16-byte aligned pointer.
+ * u8 *iv, Pre-counter block j0: 12 byte IV concatenated with 0x0001.
+ * 16-byte aligned pointer.
  * u8 *hash_subkey, the Hash sub key input. Data starts on a 16-byte boundary.
  * const u8 *aad, Additional Authentication Data (AAD)
  * unsigned long aad_len, Length of AAD in bytes. With RFC4106 this is going
@@ -275,6 +277,16 @@ aesni_rfc4106_gcm_ctx *aesni_rfc4106_gcm_ctx_get(struct 
crypto_aead *tfm)
align = 1;
return PTR_ALIGN(crypto_aead_ctx(tfm), align);
 }
+
+static inline struct
+generic_gcmaes_ctx *generic_gcmaes_ctx_get(struct crypto_aead *tfm)
+{
+   unsigned long align = AESNI_ALIGN;
+
+   if (align <= crypto_tfm_ctx_alignment())
+   align = 1;
+   return PTR_ALIGN(crypto_aead_ctx(tfm), align);
+}
 #endif
 
 static inline struct crypto_aes_ctx *aes_ctx(void *raw_ctx)
@@ -712,32 +724,34 @@ static int rfc4106_set_authsize(struct crypto_aead 
*parent,
return crypto_aead_setauthsize(_tfm->base, authsize);
 }
 
-static int helper_rfc4106_encrypt(struct aead_request *req)
+static int generic_gcmaes_set_authsize(struct crypto_aead *tfm,
+  unsigned int authsize)
+{
+   switch (authsize) {
+   case 4:
+   case 8:
+   case 12:
+   case 13:
+   case 14:
+   case 15:
+   case 16:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen,
+ u8 *hash_subkey, u8 *iv, void *aes_ctx)
 {
u8 one_entry_in_sg = 0;
u8 *src, *dst, *assoc;
-   __be32 counter = cpu_to_be32(1);
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
-   struct aesni_rfc4106_gcm_ctx *ctx = aesni_rfc4106_gcm_ctx_get(tfm);
-   void *aes_ctx = &(ctx->aes_key_expanded);
unsigned long auth_tag_len = crypto_aead_authsize(tfm);
-   u8 iv[16] __attribute__ ((__aligned__(AESNI_ALIGN)));
struct scatter_walk src_sg_walk;
struct scatter_walk dst_sg_walk = {};
-   unsigned int i;
-
-   /* Assuming we are supporting rfc4106 64-bit extended */
-   /* sequence numbers We need to have the AAD length equal */
-   /* to 16 or 20 bytes */
-   if (unlikely(req->assoclen != 16 && req->assoclen != 20))
-   return -EINVAL;
-
-   /* IV below built */
-   for (i = 0; i < 4; i++)
- 

[PATCH 7/7] crypto: aesni: add generic gcm(aes)

2017-04-28 Thread Sabrina Dubroca
Now that the asm side of things can support all the valid lengths of ICV
and all lengths of associated data, provide the glue code to expose a
generic gcm(aes) crypto algorithm.

Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_glue.c | 208 -
 1 file changed, 158 insertions(+), 50 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 93de8ea51548..4a55cdcdc008 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -61,6 +61,11 @@ struct aesni_rfc4106_gcm_ctx {
u8 nonce[4];
 };
 
+struct generic_gcmaes_ctx {
+   u8 hash_subkey[16] AESNI_ALIGN_ATTR;
+   struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
+};
+
 struct aesni_xts_ctx {
u8 raw_tweak_ctx[sizeof(struct crypto_aes_ctx)] AESNI_ALIGN_ATTR;
u8 raw_crypt_ctx[sizeof(struct crypto_aes_ctx)] AESNI_ALIGN_ATTR;
@@ -102,13 +107,11 @@ asmlinkage void aesni_xts_crypt8(struct crypto_aes_ctx 
*ctx, u8 *out,
  * u8 *out, Ciphertext output. Encrypt in-place is allowed.
  * const u8 *in, Plaintext input
  * unsigned long plaintext_len, Length of data in bytes for encryption.
- * u8 *iv, Pre-counter block j0: 4 byte salt (from Security Association)
- * concatenated with 8 byte Initialisation Vector (from IPSec ESP
- * Payload) concatenated with 0x0001. 16-byte aligned pointer.
+ * u8 *iv, Pre-counter block j0: 12 byte IV concatenated with 0x0001.
+ * 16-byte aligned pointer.
  * u8 *hash_subkey, the Hash sub key input. Data starts on a 16-byte boundary.
  * const u8 *aad, Additional Authentication Data (AAD)
- * unsigned long aad_len, Length of AAD in bytes. With RFC4106 this
- *  is going to be 8 or 12 bytes
+ * unsigned long aad_len, Length of AAD in bytes.
  * u8 *auth_tag, Authenticated Tag output.
  * unsigned long auth_tag_len), Authenticated Tag Length in bytes.
  *  Valid values are 16 (most likely), 12 or 8.
@@ -123,9 +126,8 @@ asmlinkage void aesni_gcm_enc(void *ctx, u8 *out,
  * u8 *out, Plaintext output. Decrypt in-place is allowed.
  * const u8 *in, Ciphertext input
  * unsigned long ciphertext_len, Length of data in bytes for decryption.
- * u8 *iv, Pre-counter block j0: 4 byte salt (from Security Association)
- * concatenated with 8 byte Initialisation Vector (from IPSec ESP
- * Payload) concatenated with 0x0001. 16-byte aligned pointer.
+ * u8 *iv, Pre-counter block j0: 12 byte IV concatenated with 0x0001.
+ * 16-byte aligned pointer.
  * u8 *hash_subkey, the Hash sub key input. Data starts on a 16-byte boundary.
  * const u8 *aad, Additional Authentication Data (AAD)
  * unsigned long aad_len, Length of AAD in bytes. With RFC4106 this is going
@@ -275,6 +277,16 @@ aesni_rfc4106_gcm_ctx *aesni_rfc4106_gcm_ctx_get(struct 
crypto_aead *tfm)
align = 1;
return PTR_ALIGN(crypto_aead_ctx(tfm), align);
 }
+
+static inline struct
+generic_gcmaes_ctx *generic_gcmaes_ctx_get(struct crypto_aead *tfm)
+{
+   unsigned long align = AESNI_ALIGN;
+
+   if (align <= crypto_tfm_ctx_alignment())
+   align = 1;
+   return PTR_ALIGN(crypto_aead_ctx(tfm), align);
+}
 #endif
 
 static inline struct crypto_aes_ctx *aes_ctx(void *raw_ctx)
@@ -712,32 +724,34 @@ static int rfc4106_set_authsize(struct crypto_aead 
*parent,
return crypto_aead_setauthsize(_tfm->base, authsize);
 }
 
-static int helper_rfc4106_encrypt(struct aead_request *req)
+static int generic_gcmaes_set_authsize(struct crypto_aead *tfm,
+  unsigned int authsize)
+{
+   switch (authsize) {
+   case 4:
+   case 8:
+   case 12:
+   case 13:
+   case 14:
+   case 15:
+   case 16:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int gcmaes_encrypt(struct aead_request *req, unsigned int assoclen,
+ u8 *hash_subkey, u8 *iv, void *aes_ctx)
 {
u8 one_entry_in_sg = 0;
u8 *src, *dst, *assoc;
-   __be32 counter = cpu_to_be32(1);
struct crypto_aead *tfm = crypto_aead_reqtfm(req);
-   struct aesni_rfc4106_gcm_ctx *ctx = aesni_rfc4106_gcm_ctx_get(tfm);
-   void *aes_ctx = &(ctx->aes_key_expanded);
unsigned long auth_tag_len = crypto_aead_authsize(tfm);
-   u8 iv[16] __attribute__ ((__aligned__(AESNI_ALIGN)));
struct scatter_walk src_sg_walk;
struct scatter_walk dst_sg_walk = {};
-   unsigned int i;
-
-   /* Assuming we are supporting rfc4106 64-bit extended */
-   /* sequence numbers We need to have the AAD length equal */
-   /* to 16 or 20 bytes */
-   if (unlikely(req->assoclen != 16 && req->assoclen != 20))
-   return -EINVAL;
-
-   /* IV below built */
-   for (i = 0; i < 4; i++)
-   *(iv+i) = ctx->nonce

[PATCH 5/7] crypto: aesni: make AVX2 AES-GCM work with any aadlen

2017-04-28 Thread Sabrina Dubroca
This is the first step to make the aesni AES-GCM implementation
generic. The current code was written for rfc4106, so it handles only
some specific sizes of associated data.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 85 ++--
 1 file changed, 58 insertions(+), 27 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index ee6283120f83..7230808a7cef 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -1702,41 +1702,73 @@ ENDPROC(aesni_gcm_dec_avx_gen2)
 
 .macro INITIAL_BLOCKS_AVX2 num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 
XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC VER
i = (8-\num_initial_blocks)
+   j = 0
setreg
 
-mov arg6, %r10   # r10 = AAD
-mov arg7, %r12   # r12 = aadLen
-
-
-mov %r12, %r11
-
-vpxor   reg_i, reg_i, reg_i
-_get_AAD_loop\@:
-vmovd   (%r10), \T1
-vpslldq $12, \T1, \T1
-vpsrldq $4, reg_i, reg_i
-vpxor   \T1, reg_i, reg_i
+   mov arg6, %r10   # r10 = AAD
+   mov arg7, %r12   # r12 = aadLen
 
-add $4, %r10
-sub $4, %r12
-jg  _get_AAD_loop\@
 
+   mov %r12, %r11
 
-cmp $16, %r11
-je  _get_AAD_loop2_done\@
-mov $16, %r12
+   vpxor   reg_j, reg_j, reg_j
+   vpxor   reg_i, reg_i, reg_i
 
-_get_AAD_loop2\@:
-vpsrldq $4, reg_i, reg_i
-sub $4, %r12
-cmp %r11, %r12
-jg  _get_AAD_loop2\@
+   cmp $16, %r11
+   jl  _get_AAD_rest8\@
+_get_AAD_blocks\@:
+   vmovdqu (%r10), reg_i
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_i, reg_j, reg_j
+   GHASH_MUL_AVX2  reg_j, \T2, \T1, \T3, \T4, \T5, \T6
+   add $16, %r10
+   sub $16, %r12
+   sub $16, %r11
+   cmp $16, %r11
+   jge _get_AAD_blocks\@
+   vmovdqu reg_j, reg_i
+   cmp $0, %r11
+   je  _get_AAD_done\@
 
-_get_AAD_loop2_done\@:
+   vpxor   reg_i, reg_i, reg_i
 
-#byte-reflect the AAD data
-vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\@:
+   cmp $4, %r11
+   jle _get_AAD_rest4\@
+   movq(%r10), \T1
+   add $8, %r10
+   sub $8, %r11
+   vpslldq $8, \T1, \T1
+   vpsrldq $8, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+   jmp _get_AAD_rest8\@
+_get_AAD_rest4\@:
+   cmp $0, %r11
+   jle _get_AAD_rest0\@
+   mov (%r10), %eax
+   movq%rax, \T1
+   add $4, %r10
+   sub $4, %r11
+   vpslldq $12, \T1, \T1
+   vpsrldq $4, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+_get_AAD_rest0\@:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq%r12, %r11
+   salq$4, %r11
+   movdqu  aad_shift_arr(%r11), \T1
+   vpshufb \T1, reg_i, reg_i
+_get_AAD_rest_final\@:
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_j, reg_i, reg_i
+   GHASH_MUL_AVX2  reg_i, \T2, \T1, \T3, \T4, \T5, \T6
 
+_get_AAD_done\@:
# initialize the data pointer offset as zero
xor %r11, %r11
 
@@ -1811,7 +1843,6 @@ ENDPROC(aesni_gcm_dec_avx_gen2)
i = (8-\num_initial_blocks)
j = (9-\num_initial_blocks)
setreg
-GHASH_MUL_AVX2   reg_i, \T2, \T1, \T3, \T4, \T5, \T6
 
 .rep \num_initial_blocks
 vpxorreg_i, reg_j, reg_j
-- 
2.12.2



[PATCH 5/7] crypto: aesni: make AVX2 AES-GCM work with any aadlen

2017-04-28 Thread Sabrina Dubroca
This is the first step to make the aesni AES-GCM implementation
generic. The current code was written for rfc4106, so it handles only
some specific sizes of associated data.

Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 85 ++--
 1 file changed, 58 insertions(+), 27 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index ee6283120f83..7230808a7cef 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -1702,41 +1702,73 @@ ENDPROC(aesni_gcm_dec_avx_gen2)
 
 .macro INITIAL_BLOCKS_AVX2 num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 
XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC VER
i = (8-\num_initial_blocks)
+   j = 0
setreg
 
-mov arg6, %r10   # r10 = AAD
-mov arg7, %r12   # r12 = aadLen
-
-
-mov %r12, %r11
-
-vpxor   reg_i, reg_i, reg_i
-_get_AAD_loop\@:
-vmovd   (%r10), \T1
-vpslldq $12, \T1, \T1
-vpsrldq $4, reg_i, reg_i
-vpxor   \T1, reg_i, reg_i
+   mov arg6, %r10   # r10 = AAD
+   mov arg7, %r12   # r12 = aadLen
 
-add $4, %r10
-sub $4, %r12
-jg  _get_AAD_loop\@
 
+   mov %r12, %r11
 
-cmp $16, %r11
-je  _get_AAD_loop2_done\@
-mov $16, %r12
+   vpxor   reg_j, reg_j, reg_j
+   vpxor   reg_i, reg_i, reg_i
 
-_get_AAD_loop2\@:
-vpsrldq $4, reg_i, reg_i
-sub $4, %r12
-cmp %r11, %r12
-jg  _get_AAD_loop2\@
+   cmp $16, %r11
+   jl  _get_AAD_rest8\@
+_get_AAD_blocks\@:
+   vmovdqu (%r10), reg_i
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_i, reg_j, reg_j
+   GHASH_MUL_AVX2  reg_j, \T2, \T1, \T3, \T4, \T5, \T6
+   add $16, %r10
+   sub $16, %r12
+   sub $16, %r11
+   cmp $16, %r11
+   jge _get_AAD_blocks\@
+   vmovdqu reg_j, reg_i
+   cmp $0, %r11
+   je  _get_AAD_done\@
 
-_get_AAD_loop2_done\@:
+   vpxor   reg_i, reg_i, reg_i
 
-#byte-reflect the AAD data
-vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\@:
+   cmp $4, %r11
+   jle _get_AAD_rest4\@
+   movq(%r10), \T1
+   add $8, %r10
+   sub $8, %r11
+   vpslldq $8, \T1, \T1
+   vpsrldq $8, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+   jmp _get_AAD_rest8\@
+_get_AAD_rest4\@:
+   cmp $0, %r11
+   jle _get_AAD_rest0\@
+   mov (%r10), %eax
+   movq%rax, \T1
+   add $4, %r10
+   sub $4, %r11
+   vpslldq $12, \T1, \T1
+   vpsrldq $4, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+_get_AAD_rest0\@:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq%r12, %r11
+   salq$4, %r11
+   movdqu  aad_shift_arr(%r11), \T1
+   vpshufb \T1, reg_i, reg_i
+_get_AAD_rest_final\@:
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_j, reg_i, reg_i
+   GHASH_MUL_AVX2  reg_i, \T2, \T1, \T3, \T4, \T5, \T6
 
+_get_AAD_done\@:
# initialize the data pointer offset as zero
xor %r11, %r11
 
@@ -1811,7 +1843,6 @@ ENDPROC(aesni_gcm_dec_avx_gen2)
i = (8-\num_initial_blocks)
j = (9-\num_initial_blocks)
setreg
-GHASH_MUL_AVX2   reg_i, \T2, \T1, \T3, \T4, \T5, \T6
 
 .rep \num_initial_blocks
 vpxorreg_i, reg_j, reg_j
-- 
2.12.2



[PATCH 3/7] crypto: aesni: make AVX AES-GCM work with any aadlen

2017-04-28 Thread Sabrina Dubroca
This is the first step to make the aesni AES-GCM implementation
generic. The current code was written for rfc4106, so it handles
only some specific sizes of associated data.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 122 ++-
 1 file changed, 88 insertions(+), 34 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index d664382c6e56..a73117c84904 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -155,6 +155,30 @@ SHIFT_MASK:  .octa 
0x0f0e0d0c0b0a09080706050403020100
 ALL_F:   .octa 0x
  .octa 0x
 
+.section .rodata
+.align 16
+.type aad_shift_arr, @object
+.size aad_shift_arr, 272
+aad_shift_arr:
+.octa 0x
+.octa 0xff0C
+.octa 0x0D0C
+.octa 0xff0E0D0C
+.octa 0x0F0E0D0C
+.octa 0xff0C0B0A0908
+.octa 0x0D0C0B0A0908
+.octa 0xff0E0D0C0B0A0908
+.octa 0x0F0E0D0C0B0A0908
+.octa 0xff0C0B0A090807060504
+.octa 0x0D0C0B0A090807060504
+.octa 0xff0E0D0C0B0A090807060504
+.octa 0x0F0E0D0C0B0A090807060504
+.octa 0xff0C0B0A09080706050403020100
+.octa 0x0D0C0B0A09080706050403020100
+.octa 0xff0E0D0C0B0A09080706050403020100
+.octa 0x0F0E0D0C0B0A09080706050403020100
+
+
 .text
 
 
@@ -372,41 +396,72 @@ VARIABLE_OFFSET = 16*8
 
 .macro INITIAL_BLOCKS_AVX num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 XMM3 
XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC
i = (8-\num_initial_blocks)
+   j = 0
setreg
 
-mov arg6, %r10  # r10 = AAD
-mov arg7, %r12  # r12 = aadLen
-
-
-mov %r12, %r11
-
-vpxor   reg_i, reg_i, reg_i
-_get_AAD_loop\@:
-vmovd   (%r10), \T1
-vpslldq $12, \T1, \T1
-vpsrldq $4, reg_i, reg_i
-vpxor   \T1, reg_i, reg_i
-
-add $4, %r10
-sub $4, %r12
-jg  _get_AAD_loop\@
-
-
-cmp $16, %r11
-je  _get_AAD_loop2_done\@
-mov $16, %r12
-
-_get_AAD_loop2\@:
-vpsrldq $4, reg_i, reg_i
-sub $4, %r12
-cmp %r11, %r12
-jg  _get_AAD_loop2\@
-
-_get_AAD_loop2_done\@:
-
-#byte-reflect the AAD data
-vpshufb SHUF_MASK(%rip), reg_i, reg_i
-
+   mov arg6, %r10  # r10 = AAD
+   mov arg7, %r12  # r12 = aadLen
+
+
+   mov %r12, %r11
+
+   vpxor   reg_j, reg_j, reg_j
+   vpxor   reg_i, reg_i, reg_i
+   cmp $16, %r11
+   jl  _get_AAD_rest8\@
+_get_AAD_blocks\@:
+   vmovdqu (%r10), reg_i
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_i, reg_j, reg_j
+   GHASH_MUL_AVX   reg_j, \T2, \T1, \T3, \T4, \T5, \T6
+   add $16, %r10
+   sub $16, %r12
+   sub $16, %r11
+   cmp $16, %r11
+   jge _get_AAD_blocks\@
+   vmovdqu reg_j, reg_i
+   cmp $0, %r11
+   je  _get_AAD_done\@
+
+   vpxor   reg_i, reg_i, reg_i
+
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\@:
+   cmp $4, %r11
+   jle _get_AAD_rest4\@
+   movq(%r10), \T1
+   add $8, %r10
+   sub $8, %r11
+   vpslldq $8, \T1, \T1
+   vpsrldq $8, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+   jmp _get_AAD_rest8\@
+_get_AAD_rest4\@:
+   cmp $0, %r11
+   jle  _get_AAD_rest0\@
+   mov (%r10), %eax
+   movq%rax, \T1
+   add $4, %r10
+   sub $4, %r11
+   vpslldq $12, \T1, \T1
+   vpsrldq $4, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+_get_AAD_rest0\@:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq%r12, %r11
+   salq$4, %r11
+   movdqu  aad_shift_arr(%r11), \T1
+   vpshufb \T1, reg_i, reg_i
+_get_AAD_rest_final\@:
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_j, reg_i, reg_i
+   GHASH_MUL_AVX   reg_i, \T2, \T1, \T3, \T4, \T5, \T6
+
+_get_AAD_done\@:
# initialize the data pointer offset as zero
xor %r11, %r11
 

[PATCH 3/7] crypto: aesni: make AVX AES-GCM work with any aadlen

2017-04-28 Thread Sabrina Dubroca
This is the first step to make the aesni AES-GCM implementation
generic. The current code was written for rfc4106, so it handles
only some specific sizes of associated data.

Signed-off-by: Sabrina Dubroca 
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 122 ++-
 1 file changed, 88 insertions(+), 34 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index d664382c6e56..a73117c84904 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -155,6 +155,30 @@ SHIFT_MASK:  .octa 
0x0f0e0d0c0b0a09080706050403020100
 ALL_F:   .octa 0x
  .octa 0x
 
+.section .rodata
+.align 16
+.type aad_shift_arr, @object
+.size aad_shift_arr, 272
+aad_shift_arr:
+.octa 0x
+.octa 0xff0C
+.octa 0x0D0C
+.octa 0xff0E0D0C
+.octa 0x0F0E0D0C
+.octa 0xff0C0B0A0908
+.octa 0x0D0C0B0A0908
+.octa 0xff0E0D0C0B0A0908
+.octa 0x0F0E0D0C0B0A0908
+.octa 0xff0C0B0A090807060504
+.octa 0x0D0C0B0A090807060504
+.octa 0xff0E0D0C0B0A090807060504
+.octa 0x0F0E0D0C0B0A090807060504
+.octa 0xff0C0B0A09080706050403020100
+.octa 0x0D0C0B0A09080706050403020100
+.octa 0xff0E0D0C0B0A09080706050403020100
+.octa 0x0F0E0D0C0B0A09080706050403020100
+
+
 .text
 
 
@@ -372,41 +396,72 @@ VARIABLE_OFFSET = 16*8
 
 .macro INITIAL_BLOCKS_AVX num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 XMM3 
XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC
i = (8-\num_initial_blocks)
+   j = 0
setreg
 
-mov arg6, %r10  # r10 = AAD
-mov arg7, %r12  # r12 = aadLen
-
-
-mov %r12, %r11
-
-vpxor   reg_i, reg_i, reg_i
-_get_AAD_loop\@:
-vmovd   (%r10), \T1
-vpslldq $12, \T1, \T1
-vpsrldq $4, reg_i, reg_i
-vpxor   \T1, reg_i, reg_i
-
-add $4, %r10
-sub $4, %r12
-jg  _get_AAD_loop\@
-
-
-cmp $16, %r11
-je  _get_AAD_loop2_done\@
-mov $16, %r12
-
-_get_AAD_loop2\@:
-vpsrldq $4, reg_i, reg_i
-sub $4, %r12
-cmp %r11, %r12
-jg  _get_AAD_loop2\@
-
-_get_AAD_loop2_done\@:
-
-#byte-reflect the AAD data
-vpshufb SHUF_MASK(%rip), reg_i, reg_i
-
+   mov arg6, %r10  # r10 = AAD
+   mov arg7, %r12  # r12 = aadLen
+
+
+   mov %r12, %r11
+
+   vpxor   reg_j, reg_j, reg_j
+   vpxor   reg_i, reg_i, reg_i
+   cmp $16, %r11
+   jl  _get_AAD_rest8\@
+_get_AAD_blocks\@:
+   vmovdqu (%r10), reg_i
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_i, reg_j, reg_j
+   GHASH_MUL_AVX   reg_j, \T2, \T1, \T3, \T4, \T5, \T6
+   add $16, %r10
+   sub $16, %r12
+   sub $16, %r11
+   cmp $16, %r11
+   jge _get_AAD_blocks\@
+   vmovdqu reg_j, reg_i
+   cmp $0, %r11
+   je  _get_AAD_done\@
+
+   vpxor   reg_i, reg_i, reg_i
+
+   /* read the last <16B of AAD. since we have at least 4B of
+   data right after the AAD (the ICV, and maybe some CT), we can
+   read 4B/8B blocks safely, and then get rid of the extra stuff */
+_get_AAD_rest8\@:
+   cmp $4, %r11
+   jle _get_AAD_rest4\@
+   movq(%r10), \T1
+   add $8, %r10
+   sub $8, %r11
+   vpslldq $8, \T1, \T1
+   vpsrldq $8, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+   jmp _get_AAD_rest8\@
+_get_AAD_rest4\@:
+   cmp $0, %r11
+   jle  _get_AAD_rest0\@
+   mov (%r10), %eax
+   movq%rax, \T1
+   add $4, %r10
+   sub $4, %r11
+   vpslldq $12, \T1, \T1
+   vpsrldq $4, reg_i, reg_i
+   vpxor   \T1, reg_i, reg_i
+_get_AAD_rest0\@:
+   /* finalize: shift out the extra bytes we read, and align
+   left. since pslldq can only shift by an immediate, we use
+   vpshufb and an array of shuffle masks */
+   movq%r12, %r11
+   salq$4, %r11
+   movdqu  aad_shift_arr(%r11), \T1
+   vpshufb \T1, reg_i, reg_i
+_get_AAD_rest_final\@:
+   vpshufb SHUF_MASK(%rip), reg_i, reg_i
+   vpxor   reg_j, reg_i, reg_i
+   GHASH_MUL_AVX   reg_i, \T2, \T1, \T3, \T4, \T5, \T6
+
+_get_AAD_done\@:
# initialize the data pointer offset as zero
xor %r11, %r11
 
@@ -480,7 +535,6 @@ VARIABLE_OFF

[PATCH 0/7] crypto: aesni: provide generic gcm(aes)

2017-04-28 Thread Sabrina Dubroca
The current aesni AES-GCM implementation only offers support for
rfc4106(gcm(aes)).  This makes some things a little bit simpler
(handling of associated data and authentication tag), but it means
that non-IPsec users of gcm(aes) have to rely on
gcm_base(ctr-aes-aesni,ghash-clmulni), which is much slower.

This patchset adds handling of all valid authentication tag lengths
and of any associated data length to the assembly code, and exposes a
generic gcm(aes) AEAD algorithm to the crypto API.

With these patches, performance of MACsec on a single core increases
by 40% (from 4.5Gbps to around 6.3Gbps).

Sabrina Dubroca (7):
  crypto: aesni: make non-AVX AES-GCM work with any aadlen
  crypto: aesni: make non-AVX AES-GCM work with all valid auth_tag_len
  crypto: aesni: make AVX AES-GCM work with any aadlen
  crypto: aesni: make AVX AES-GCM work with all valid auth_tag_len
  crypto: aesni: make AVX2 AES-GCM work with any aadlen
  crypto: aesni: make AVX2 AES-GCM work with all valid auth_tag_len
  crypto: aesni: add generic gcm(aes)

 arch/x86/crypto/aesni-intel_asm.S| 231 +++--
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 283 ++-
 arch/x86/crypto/aesni-intel_glue.c   | 208 +--
 3 files changed, 539 insertions(+), 183 deletions(-)

-- 
2.12.2



[PATCH 0/7] crypto: aesni: provide generic gcm(aes)

2017-04-28 Thread Sabrina Dubroca
The current aesni AES-GCM implementation only offers support for
rfc4106(gcm(aes)).  This makes some things a little bit simpler
(handling of associated data and authentication tag), but it means
that non-IPsec users of gcm(aes) have to rely on
gcm_base(ctr-aes-aesni,ghash-clmulni), which is much slower.

This patchset adds handling of all valid authentication tag lengths
and of any associated data length to the assembly code, and exposes a
generic gcm(aes) AEAD algorithm to the crypto API.

With these patches, performance of MACsec on a single core increases
by 40% (from 4.5Gbps to around 6.3Gbps).

Sabrina Dubroca (7):
  crypto: aesni: make non-AVX AES-GCM work with any aadlen
  crypto: aesni: make non-AVX AES-GCM work with all valid auth_tag_len
  crypto: aesni: make AVX AES-GCM work with any aadlen
  crypto: aesni: make AVX AES-GCM work with all valid auth_tag_len
  crypto: aesni: make AVX2 AES-GCM work with any aadlen
  crypto: aesni: make AVX2 AES-GCM work with all valid auth_tag_len
  crypto: aesni: add generic gcm(aes)

 arch/x86/crypto/aesni-intel_asm.S| 231 +++--
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 283 ++-
 arch/x86/crypto/aesni-intel_glue.c   | 208 +--
 3 files changed, 539 insertions(+), 183 deletions(-)

-- 
2.12.2



Re: [PATCH] iov_iter: don't revert if csum error

2017-04-28 Thread Sabrina Dubroca
2017-04-28, 20:48:45 +0800, Ding Tianhong wrote:
> The patch 3278682 (make skb_copy_datagram_msg() et.al. preserve
> ->msg_iter on error) will revert the iov buffer if copy to iter
> failed, but it looks no need to revert for csum error, so fix it.
> 
> Fixes: 3278682 ("make skb_copy_datagram_msg() et.al. preserve->msg_iter on 
> error")

Please use 12 digits, ie 327868212381.

> Signed-off-by: Ding Tianhong 
> ---
>  net/core/datagram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index f4947e7..475a8e9 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -760,7 +760,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
> 
>   if (msg_data_left(msg) < chunk) {
>   if (__skb_checksum_complete(skb))
> - goto csum_error;
> + goto fault;

With this patch, skb_copy_and_csum_datagram_msg() will return -EFAULT
for an incorrect checksum, that doesn't seem right.

>   if (skb_copy_datagram_msg(skb, hlen, msg, chunk))
>   goto fault;
>   } else {
> -- 
> 1.8.3.1
> 

-- 
Sabrina


Re: [PATCH] iov_iter: don't revert if csum error

2017-04-28 Thread Sabrina Dubroca
2017-04-28, 20:48:45 +0800, Ding Tianhong wrote:
> The patch 3278682 (make skb_copy_datagram_msg() et.al. preserve
> ->msg_iter on error) will revert the iov buffer if copy to iter
> failed, but it looks no need to revert for csum error, so fix it.
> 
> Fixes: 3278682 ("make skb_copy_datagram_msg() et.al. preserve->msg_iter on 
> error")

Please use 12 digits, ie 327868212381.

> Signed-off-by: Ding Tianhong 
> ---
>  net/core/datagram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index f4947e7..475a8e9 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -760,7 +760,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
> 
>   if (msg_data_left(msg) < chunk) {
>   if (__skb_checksum_complete(skb))
> - goto csum_error;
> + goto fault;

With this patch, skb_copy_and_csum_datagram_msg() will return -EFAULT
for an incorrect checksum, that doesn't seem right.

>   if (skb_copy_datagram_msg(skb, hlen, msg, chunk))
>   goto fault;
>   } else {
> -- 
> 1.8.3.1
> 

-- 
Sabrina


Re: [PATCH v6 3/5] rxrpc: check return value of skb_to_sgvec always

2017-04-28 Thread Sabrina Dubroca
2017-04-25, 20:47:32 +0200, Jason A. Donenfeld wrote:
> Signed-off-by: Jason A. Donenfeld 
> ---
>  net/rxrpc/rxkad.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
> index 4374e7b9c7bf..dcf46c9c3ece 100644
> --- a/net/rxrpc/rxkad.c
> +++ b/net/rxrpc/rxkad.c
[...]
> @@ -429,7 +432,8 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
> struct sk_buff *skb,
>   }
>  

Adding a few more lines of context:

sg = _sg;
if (unlikely(nsg > 4)) {
sg = kmalloc(sizeof(*sg) * nsg, GFP_NOIO);
if (!sg)
goto nomem;
}

>   sg_init_table(sg, nsg);
> - skb_to_sgvec(skb, sg, offset, len);
> + if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0))
> + goto nomem;

You're leaking sg when nsg > 4, you'll need to add this:

if (sg != _sg)
kfree(sg);



BTW, when you resubmit, please Cc: the maintainers of the files you're
changing for each patch, so that they can review this stuff. And send
patch 1 to all of them, otherwise they might be surprised that we even
need <0 checking after calls to skb_to_sgvec.

You might also want to add a cover letter.

-- 
Sabrina


Re: [PATCH v6 3/5] rxrpc: check return value of skb_to_sgvec always

2017-04-28 Thread Sabrina Dubroca
2017-04-25, 20:47:32 +0200, Jason A. Donenfeld wrote:
> Signed-off-by: Jason A. Donenfeld 
> ---
>  net/rxrpc/rxkad.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
> index 4374e7b9c7bf..dcf46c9c3ece 100644
> --- a/net/rxrpc/rxkad.c
> +++ b/net/rxrpc/rxkad.c
[...]
> @@ -429,7 +432,8 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
> struct sk_buff *skb,
>   }
>  

Adding a few more lines of context:

sg = _sg;
if (unlikely(nsg > 4)) {
sg = kmalloc(sizeof(*sg) * nsg, GFP_NOIO);
if (!sg)
goto nomem;
}

>   sg_init_table(sg, nsg);
> - skb_to_sgvec(skb, sg, offset, len);
> + if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0))
> + goto nomem;

You're leaking sg when nsg > 4, you'll need to add this:

if (sg != _sg)
kfree(sg);



BTW, when you resubmit, please Cc: the maintainers of the files you're
changing for each patch, so that they can review this stuff. And send
patch 1 to all of them, otherwise they might be surprised that we even
need <0 checking after calls to skb_to_sgvec.

You might also want to add a cover letter.

-- 
Sabrina


Re: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-04-27 Thread Sabrina Dubroca
2017-04-27, 11:21:51 +0200, Jason A. Donenfeld wrote:
> However, perhaps there's the chance that fraglist skbs having
> separate fraglists are actually forbidden? Is this the case?

Hmm, I think this can actually happen:

/*  net/ipv4/ip_fragment.c  */
static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 struct net_device *dev)
{

...

/* If the first fragment is fragmented itself, we split
 * it to two chunks: the first with data and paged part
 * and the second, holding only fragments. */
if (skb_has_frag_list(head)) {
struct sk_buff *clone;
int i, plen = 0;

clone = alloc_skb(0, GFP_ATOMIC);
if (!clone)
goto out_nomem;
clone->next = head->next;
head->next = clone;
skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
skb_frag_list_init(head);
for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
plen += skb_frag_size(_shinfo(head)->frags[i]);
clone->len = clone->data_len = head->data_len - plen;
head->data_len -= clone->len;
head->len -= clone->len;
clone->csum = 0;
clone->ip_summed = head->ip_summed;
add_frag_mem_limit(qp->q.net, clone->truesize);
}

...
}


You can test that with a vxlan tunnel on top of a vxlan tunnel ("real"
MTU is 1500, first tunnel MTU set to 1, second tunnel MTU set to
4 -- or anything, as long as they both get fragmented).

-- 
Sabrina


Re: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-04-27 Thread Sabrina Dubroca
2017-04-27, 11:21:51 +0200, Jason A. Donenfeld wrote:
> However, perhaps there's the chance that fraglist skbs having
> separate fraglists are actually forbidden? Is this the case?

Hmm, I think this can actually happen:

/*  net/ipv4/ip_fragment.c  */
static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 struct net_device *dev)
{

...

/* If the first fragment is fragmented itself, we split
 * it to two chunks: the first with data and paged part
 * and the second, holding only fragments. */
if (skb_has_frag_list(head)) {
struct sk_buff *clone;
int i, plen = 0;

clone = alloc_skb(0, GFP_ATOMIC);
if (!clone)
goto out_nomem;
clone->next = head->next;
head->next = clone;
skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
skb_frag_list_init(head);
for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
plen += skb_frag_size(_shinfo(head)->frags[i]);
clone->len = clone->data_len = head->data_len - plen;
head->data_len -= clone->len;
head->len -= clone->len;
clone->csum = 0;
clone->ip_summed = head->ip_summed;
add_frag_mem_limit(qp->q.net, clone->truesize);
}

...
}


You can test that with a vxlan tunnel on top of a vxlan tunnel ("real"
MTU is 1500, first tunnel MTU set to 1, second tunnel MTU set to
4 -- or anything, as long as they both get fragmented).

-- 
Sabrina


Re: [PATCH v2] macsec: dynamically allocate space for sglist

2017-04-25 Thread Sabrina Dubroca
2017-04-25, 19:08:18 +0200, Jason A. Donenfeld wrote:
> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
> Cc: Sabrina Dubroca <s...@queasysnail.net>
> Cc: secur...@kernel.org
> Cc: sta...@vger.kernel.org

Acked-by: Sabrina Dubroca <s...@queasysnail.net>

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Fixes: CVE-2017-7477

David, this fix is essentially equivalent to my patch "macsec: avoid
heap overflow in skb_to_sgvec on receive".  Feel free to pick my patch
if you prefer (it's smaller), but this looks ok to me.


Thanks,

-- 
Sabrina


Re: [PATCH v2] macsec: dynamically allocate space for sglist

2017-04-25 Thread Sabrina Dubroca
2017-04-25, 19:08:18 +0200, Jason A. Donenfeld wrote:
> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld 
> Cc: Sabrina Dubroca 
> Cc: secur...@kernel.org
> Cc: sta...@vger.kernel.org

Acked-by: Sabrina Dubroca 

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Fixes: CVE-2017-7477

David, this fix is essentially equivalent to my patch "macsec: avoid
heap overflow in skb_to_sgvec on receive".  Feel free to pick my patch
if you prefer (it's smaller), but this looks ok to me.


Thanks,

-- 
Sabrina


Re: [PATCH] macsec: dynamically allocate space for sglist

2017-04-25 Thread Sabrina Dubroca
2017-04-25, 17:23:00 +0200, Jason A. Donenfeld wrote:
> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
> Cc: Sabrina Dubroca <s...@queasysnail.net>
> Cc: secur...@kernel.org
> Cc: sta...@vger.kernel.org
> ---
>  drivers/net/macsec.c | 25 -
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index dbab05afcdbe..56dafdee4c9c 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
[...]
> @@ -917,6 +926,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
>  {
>   int ret;
>   struct scatterlist *sg;
> + struct sk_buff *trailer;
>   unsigned char *iv;
>   struct aead_request *req;
>   struct macsec_eth_header *hdr;
> @@ -927,7 +937,12 @@ static struct sk_buff *macsec_decrypt(struct sk_buff 
> *skb,
>   if (!skb)
>   return ERR_PTR(-ENOMEM);
>  
> - req = macsec_alloc_req(rx_sa->key.tfm, , );
> + ret = skb_cow_data(skb, 0, );
> + if (unlikely(ret < 0)) {
> + kfree_skb(skb);
> + return ERR_PTR(ret);
> + }
> + req = macsec_alloc_req(rx_sa->key.tfm, , , ret);
>   if (!req) {
>   kfree_skb(skb);
>   return ERR_PTR(-ENOMEM);

There's a problem here (and in macsec_encrypt): you need to update the
call to sg_init_table, like I did in my patch.  Otherwise,
sg_init_table() is going to access sg[MAX_SKB_FRAGS], which may be
past what you allocated.

How did you test this? ;)

-- 
Sabrina


Re: [PATCH] macsec: dynamically allocate space for sglist

2017-04-25 Thread Sabrina Dubroca
2017-04-25, 17:23:00 +0200, Jason A. Donenfeld wrote:
> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld 
> Cc: Sabrina Dubroca 
> Cc: secur...@kernel.org
> Cc: sta...@vger.kernel.org
> ---
>  drivers/net/macsec.c | 25 -
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index dbab05afcdbe..56dafdee4c9c 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
[...]
> @@ -917,6 +926,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
>  {
>   int ret;
>   struct scatterlist *sg;
> + struct sk_buff *trailer;
>   unsigned char *iv;
>   struct aead_request *req;
>   struct macsec_eth_header *hdr;
> @@ -927,7 +937,12 @@ static struct sk_buff *macsec_decrypt(struct sk_buff 
> *skb,
>   if (!skb)
>   return ERR_PTR(-ENOMEM);
>  
> - req = macsec_alloc_req(rx_sa->key.tfm, , );
> + ret = skb_cow_data(skb, 0, );
> + if (unlikely(ret < 0)) {
> + kfree_skb(skb);
> + return ERR_PTR(ret);
> + }
> + req = macsec_alloc_req(rx_sa->key.tfm, , , ret);
>   if (!req) {
>   kfree_skb(skb);
>   return ERR_PTR(-ENOMEM);

There's a problem here (and in macsec_encrypt): you need to update the
call to sg_init_table, like I did in my patch.  Otherwise,
sg_init_table() is going to access sg[MAX_SKB_FRAGS], which may be
past what you allocated.

How did you test this? ;)

-- 
Sabrina


Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-25 Thread Sabrina Dubroca
2017-04-25, 17:08:28 +0200, Jason A. Donenfeld wrote:
> Hi Sabrina,
> 
> On Tue, Apr 25, 2017 at 4:53 PM, Sabrina Dubroca <s...@queasysnail.net> wrote:
> > Ugh, good catch :/
> >
> > AFAICT this patch doesn't really help, because NETIF_F_FRAGLIST
> > doesn't get tested in paths that can lead to triggering this.
> 
> You're right. This fixes the xmit() path, but not the receive path,
> which appears to take skbs directly from the upper device.
> 
> > I'll post a patch to allocate a properly-sized sg array.
> 
> I just posted this series, which should fix things in a robust way:
> 
> https://patchwork.ozlabs.org/patch/754861/

Yes, that prevents the overflow, but now you're just dropping
packets. I'll review that later, let's fix the overflow without
breaking connectivity for now.

-- 
Sabrina


Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-25 Thread Sabrina Dubroca
2017-04-25, 17:08:28 +0200, Jason A. Donenfeld wrote:
> Hi Sabrina,
> 
> On Tue, Apr 25, 2017 at 4:53 PM, Sabrina Dubroca  wrote:
> > Ugh, good catch :/
> >
> > AFAICT this patch doesn't really help, because NETIF_F_FRAGLIST
> > doesn't get tested in paths that can lead to triggering this.
> 
> You're right. This fixes the xmit() path, but not the receive path,
> which appears to take skbs directly from the upper device.
> 
> > I'll post a patch to allocate a properly-sized sg array.
> 
> I just posted this series, which should fix things in a robust way:
> 
> https://patchwork.ozlabs.org/patch/754861/

Yes, that prevents the overflow, but now you're just dropping
packets. I'll review that later, let's fix the overflow without
breaking connectivity for now.

-- 
Sabrina


Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-25 Thread Sabrina Dubroca
2017-04-21, 23:14:48 +0200, Jason A. Donenfeld wrote:
> While this may appear as a humdrum one line change, it's actually quite
> important. An sk_buff stores data in three places:
> 
> 1. A linear chunk of allocated memory in skb->data. This is the easiest
>one to work with, but it precludes using scatterdata since the memory
>must be linear.
> 2. The array skb_shinfo(skb)->frags, which is of maximum length
>MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
>can point to different pages.
> 3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
>which in turn can have data in either (1) or (2).
> 
> The first two are rather easy to deal with, since they're of a fixed
> maximum length, while the third one is not, since there can be
> potentially limitless chains of fragments. Fortunately dealing with
> frag_list is opt-in for drivers, so drivers don't actually have to deal
> with this mess. For whatever reason, macsec decided it wanted pain, and
> so it explicitly specified NETIF_F_FRAGLIST.
> 
> Because dealing with (1), (2), and (3) is insane, most users of sk_buff
> doing any sort of crypto or paging operation calls a convenient function
> called skb_to_sgvec (which happens to be recursive if (3) is in use!).
> This takes a sk_buff as input, and writes into its output pointer an
> array of scattergather list items. Sometimes people like to declare a
> fixed size scattergather list on the stack; othertimes people like to
> allocate a fixed size scattergather list on the heap. However, if you're
> doing it in a fixed-size fashion, you really shouldn't be using
> NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
> frag_list children arent't shared and then you check the number of
> fragments in total required.)
> 
> Macsec specifically does this:
> 
> size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
> tmp = kmalloc(size, GFP_ATOMIC);
> *sg = (struct scatterlist *)(tmp + sg_offset);
>   ...
> sg_init_table(sg, MAX_SKB_FRAGS + 1);
> skb_to_sgvec(skb, sg, 0, skb->len);
> 
> Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
> using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
> overflow the heap, and disaster ensues.

Ugh, good catch :/

AFAICT this patch doesn't really help, because NETIF_F_FRAGLIST
doesn't get tested in paths that can lead to triggering this.

I'll post a patch to allocate a properly-sized sg array.

-- 
Sabrina


Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-25 Thread Sabrina Dubroca
2017-04-21, 23:14:48 +0200, Jason A. Donenfeld wrote:
> While this may appear as a humdrum one line change, it's actually quite
> important. An sk_buff stores data in three places:
> 
> 1. A linear chunk of allocated memory in skb->data. This is the easiest
>one to work with, but it precludes using scatterdata since the memory
>must be linear.
> 2. The array skb_shinfo(skb)->frags, which is of maximum length
>MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
>can point to different pages.
> 3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
>which in turn can have data in either (1) or (2).
> 
> The first two are rather easy to deal with, since they're of a fixed
> maximum length, while the third one is not, since there can be
> potentially limitless chains of fragments. Fortunately dealing with
> frag_list is opt-in for drivers, so drivers don't actually have to deal
> with this mess. For whatever reason, macsec decided it wanted pain, and
> so it explicitly specified NETIF_F_FRAGLIST.
> 
> Because dealing with (1), (2), and (3) is insane, most users of sk_buff
> doing any sort of crypto or paging operation calls a convenient function
> called skb_to_sgvec (which happens to be recursive if (3) is in use!).
> This takes a sk_buff as input, and writes into its output pointer an
> array of scattergather list items. Sometimes people like to declare a
> fixed size scattergather list on the stack; othertimes people like to
> allocate a fixed size scattergather list on the heap. However, if you're
> doing it in a fixed-size fashion, you really shouldn't be using
> NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
> frag_list children arent't shared and then you check the number of
> fragments in total required.)
> 
> Macsec specifically does this:
> 
> size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
> tmp = kmalloc(size, GFP_ATOMIC);
> *sg = (struct scatterlist *)(tmp + sg_offset);
>   ...
> sg_init_table(sg, MAX_SKB_FRAGS + 1);
> skb_to_sgvec(skb, sg, 0, skb->len);
> 
> Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
> using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
> overflow the heap, and disaster ensues.

Ugh, good catch :/

AFAICT this patch doesn't really help, because NETIF_F_FRAGLIST
doesn't get tested in paths that can lead to triggering this.

I'll post a patch to allocate a properly-sized sg array.

-- 
Sabrina


Re: net/xfrm: stack-out-of-bounds in xfrm_state_find

2017-04-20 Thread Sabrina Dubroca
2017-04-20, 19:30:27 +0200, Andrey Konovalov wrote:
> On Thu, Apr 20, 2017 at 6:47 PM, Andrey Konovalov  
> wrote:
> > Hi,
> >
> > I've got the following error report while fuzzing the kernel with syzkaller.
> >
> > On linux-next commit 4f7d029b9bf009fbee76bb10c0c4351a1870d2f3 (4.11-rc7).
> >
> > A reproducer and .config are attached.
> >
> > ==
> > BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x2ce7/0x2f70 at
> > addr 88006654f790
> > Read of size 4 by task a.out/4065
> > page:ea00019953c0 count:0 mapcount:0 mapping:  (null) index:0x0
> > flags: 0x100()
> > raw: 0100   
> > raw:  ea00019953e0  
> > page dumped because: kasan: bad access detected
> > CPU: 1 PID: 4065 Comm: a.out Not tainted 4.11.0-rc7+ #251
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16
> >  dump_stack+0x292/0x398 lib/dump_stack.c:52
> >  kasan_report_error mm/kasan/report.c:212
> >  kasan_report+0x4d8/0x510 mm/kasan/report.c:347
> >  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
> >  xfrm_state_find+0x2ce7/0x2f70 net/xfrm/xfrm_state.c:897
> 
> I'm not sure if the line numbers in the report are correct.
> 
> My guess is that the guilty line is actually this one:
> 
> h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family);
> 
> but I might be wrong.

I think you're right. From udp_sendmsg we can get a flowi4 allocated
on the stack, and that's where saddr and daddr come from (in
xfrm_tmpl_resolve_one). Then we feed that to xfrm_dst_hash(), but we
ignore family (AF_INET) and use encap_family (AF_INET6), and then
xfrm_dst_hash treats both addresses as if they were IPv6, so we read
past the end of the flowi4.

I don't know what the correct behavior would be.


BTW, I ran into a different stack-out-of-bounds (in
xfrm_dst_update_origin), also due to a flowi4 on stack being treated
as something bigger, I'll send the patch for that one.

> >  xfrm_tmpl_resolve_one net/xfrm/xfrm_policy.c:1470
> >  xfrm_tmpl_resolve+0x308/0xc90 net/xfrm/xfrm_policy.c:1514
> >  xfrm_resolve_and_create_bundle+0x16e/0x2590 net/xfrm/xfrm_policy.c:1889
> >  xfrm_lookup+0xd72/0x1170 net/xfrm/xfrm_policy.c:2253
> >  xfrm_lookup_route+0x39/0x1a0 net/xfrm/xfrm_policy.c:2375
> >  ip_route_output_flow+0x7f/0xa0 net/ipv4/route.c:2483
> >  udp_sendmsg+0x1565/0x2cd0 net/ipv4/udp.c:1015
> >  udpv6_sendmsg+0x8af/0x3500 net/ipv6/udp.c:1083
> >  inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
> >  sock_sendmsg_nosec net/socket.c:633
> >  sock_sendmsg+0xca/0x110 net/socket.c:643
> >  SYSC_sendto+0x660/0x810 net/socket.c:1696
> >  SyS_sendto+0x40/0x50 net/socket.c:1664
> >  entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:204
> > RIP: 0033:0x7f3daefd0b79
> > RSP: 002b:7ffdb39bb0b8 EFLAGS: 0206 ORIG_RAX: 002c
> > RAX: ffda RBX: 7ffdb39bb210 RCX: 7f3daefd0b79
> > RDX:  RSI: 20001000 RDI: 0003
> > RBP: 004004a0 R08: 20013ff0 R09: 0010
> > R10: 2000 R11: 0206 R12: 
> > R13: 7ffdb39bb210 R14:  R15: 
> > Memory state around the buggy address:
> >  88006654f680: f1 f1 f1 00 f2 f2 f2 f2 f2 f2 f2 f8 f2 f2 f2 f2
> >  88006654f700: f2 f2 f2 00 00 00 00 f2 f2 f2 f2 00 00 00 00 00
> >>88006654f780: 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
> >  ^
> >  88006654f800: f2 f2 f2 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00
> >  88006654f880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > ==

-- 
Sabrina


Re: net/xfrm: stack-out-of-bounds in xfrm_state_find

2017-04-20 Thread Sabrina Dubroca
2017-04-20, 19:30:27 +0200, Andrey Konovalov wrote:
> On Thu, Apr 20, 2017 at 6:47 PM, Andrey Konovalov  
> wrote:
> > Hi,
> >
> > I've got the following error report while fuzzing the kernel with syzkaller.
> >
> > On linux-next commit 4f7d029b9bf009fbee76bb10c0c4351a1870d2f3 (4.11-rc7).
> >
> > A reproducer and .config are attached.
> >
> > ==
> > BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x2ce7/0x2f70 at
> > addr 88006654f790
> > Read of size 4 by task a.out/4065
> > page:ea00019953c0 count:0 mapcount:0 mapping:  (null) index:0x0
> > flags: 0x100()
> > raw: 0100   
> > raw:  ea00019953e0  
> > page dumped because: kasan: bad access detected
> > CPU: 1 PID: 4065 Comm: a.out Not tainted 4.11.0-rc7+ #251
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16
> >  dump_stack+0x292/0x398 lib/dump_stack.c:52
> >  kasan_report_error mm/kasan/report.c:212
> >  kasan_report+0x4d8/0x510 mm/kasan/report.c:347
> >  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
> >  xfrm_state_find+0x2ce7/0x2f70 net/xfrm/xfrm_state.c:897
> 
> I'm not sure if the line numbers in the report are correct.
> 
> My guess is that the guilty line is actually this one:
> 
> h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family);
> 
> but I might be wrong.

I think you're right. From udp_sendmsg we can get a flowi4 allocated
on the stack, and that's where saddr and daddr come from (in
xfrm_tmpl_resolve_one). Then we feed that to xfrm_dst_hash(), but we
ignore family (AF_INET) and use encap_family (AF_INET6), and then
xfrm_dst_hash treats both addresses as if they were IPv6, so we read
past the end of the flowi4.

I don't know what the correct behavior would be.


BTW, I ran into a different stack-out-of-bounds (in
xfrm_dst_update_origin), also due to a flowi4 on stack being treated
as something bigger, I'll send the patch for that one.

> >  xfrm_tmpl_resolve_one net/xfrm/xfrm_policy.c:1470
> >  xfrm_tmpl_resolve+0x308/0xc90 net/xfrm/xfrm_policy.c:1514
> >  xfrm_resolve_and_create_bundle+0x16e/0x2590 net/xfrm/xfrm_policy.c:1889
> >  xfrm_lookup+0xd72/0x1170 net/xfrm/xfrm_policy.c:2253
> >  xfrm_lookup_route+0x39/0x1a0 net/xfrm/xfrm_policy.c:2375
> >  ip_route_output_flow+0x7f/0xa0 net/ipv4/route.c:2483
> >  udp_sendmsg+0x1565/0x2cd0 net/ipv4/udp.c:1015
> >  udpv6_sendmsg+0x8af/0x3500 net/ipv6/udp.c:1083
> >  inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
> >  sock_sendmsg_nosec net/socket.c:633
> >  sock_sendmsg+0xca/0x110 net/socket.c:643
> >  SYSC_sendto+0x660/0x810 net/socket.c:1696
> >  SyS_sendto+0x40/0x50 net/socket.c:1664
> >  entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:204
> > RIP: 0033:0x7f3daefd0b79
> > RSP: 002b:7ffdb39bb0b8 EFLAGS: 0206 ORIG_RAX: 002c
> > RAX: ffda RBX: 7ffdb39bb210 RCX: 7f3daefd0b79
> > RDX:  RSI: 20001000 RDI: 0003
> > RBP: 004004a0 R08: 20013ff0 R09: 0010
> > R10: 2000 R11: 0206 R12: 
> > R13: 7ffdb39bb210 R14:  R15: 
> > Memory state around the buggy address:
> >  88006654f680: f1 f1 f1 00 f2 f2 f2 f2 f2 f2 f2 f8 f2 f2 f2 f2
> >  88006654f700: f2 f2 f2 00 00 00 00 f2 f2 f2 f2 00 00 00 00 00
> >>88006654f780: 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00
> >  ^
> >  88006654f800: f2 f2 f2 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00
> >  88006654f880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > ==

-- 
Sabrina


Re: [PATCH net-next 6/6] net: use core MTU range checking in misc drivers

2016-10-19 Thread Sabrina Dubroca
2016-10-18, 22:33:33 -0400, Jarod Wilson wrote:
[...]
> diff --git a/drivers/firewire/net.c b/drivers/firewire/net.c
> index 309311b..b5f125c 100644
> --- a/drivers/firewire/net.c
> +++ b/drivers/firewire/net.c
> @@ -1349,15 +1349,6 @@ static netdev_tx_t fwnet_tx(struct sk_buff *skb, 
> struct net_device *net)
>   return NETDEV_TX_OK;
>  }
>  
> -static int fwnet_change_mtu(struct net_device *net, int new_mtu)
> -{
> - if (new_mtu < 68)
> - return -EINVAL;
> -
> - net->mtu = new_mtu;
> - return 0;
> -}
> -

This doesn't do any upper bound checking.

>  static const struct ethtool_ops fwnet_ethtool_ops = {
>   .get_link   = ethtool_op_get_link,
>  };
> @@ -1366,7 +1357,6 @@ static const struct net_device_ops fwnet_netdev_ops = {
>   .ndo_open   = fwnet_open,
>   .ndo_stop   = fwnet_stop,
>   .ndo_start_xmit = fwnet_tx,
> - .ndo_change_mtu = fwnet_change_mtu,
>  };
>  
>  static void fwnet_init_dev(struct net_device *net)
> @@ -1481,6 +1471,8 @@ static int fwnet_probe(struct fw_unit *unit,
>   max_mtu = (1 << (card->max_receive + 1))
> - sizeof(struct rfc2734_header) - IEEE1394_GASP_HDR_SIZE;
>   net->mtu = min(1500U, max_mtu);
> + net->min_mtu = ETH_MIN_MTU;
> + net->max_mtu = net->mtu;

But that will now prevent increasing the MTU above the initial value?

-- 
Sabrina


Re: [PATCH net-next 6/6] net: use core MTU range checking in misc drivers

2016-10-19 Thread Sabrina Dubroca
2016-10-18, 22:33:33 -0400, Jarod Wilson wrote:
[...]
> diff --git a/drivers/firewire/net.c b/drivers/firewire/net.c
> index 309311b..b5f125c 100644
> --- a/drivers/firewire/net.c
> +++ b/drivers/firewire/net.c
> @@ -1349,15 +1349,6 @@ static netdev_tx_t fwnet_tx(struct sk_buff *skb, 
> struct net_device *net)
>   return NETDEV_TX_OK;
>  }
>  
> -static int fwnet_change_mtu(struct net_device *net, int new_mtu)
> -{
> - if (new_mtu < 68)
> - return -EINVAL;
> -
> - net->mtu = new_mtu;
> - return 0;
> -}
> -

This doesn't do any upper bound checking.

>  static const struct ethtool_ops fwnet_ethtool_ops = {
>   .get_link   = ethtool_op_get_link,
>  };
> @@ -1366,7 +1357,6 @@ static const struct net_device_ops fwnet_netdev_ops = {
>   .ndo_open   = fwnet_open,
>   .ndo_stop   = fwnet_stop,
>   .ndo_start_xmit = fwnet_tx,
> - .ndo_change_mtu = fwnet_change_mtu,
>  };
>  
>  static void fwnet_init_dev(struct net_device *net)
> @@ -1481,6 +1471,8 @@ static int fwnet_probe(struct fw_unit *unit,
>   max_mtu = (1 << (card->max_receive + 1))
> - sizeof(struct rfc2734_header) - IEEE1394_GASP_HDR_SIZE;
>   net->mtu = min(1500U, max_mtu);
> + net->min_mtu = ETH_MIN_MTU;
> + net->max_mtu = net->mtu;

But that will now prevent increasing the MTU above the initial value?

-- 
Sabrina


Re: [PATCH net-next 4/6] net: use core MTU range checking in core net infra

2016-10-19 Thread Sabrina Dubroca
2016-10-19, 10:40:06 -0400, Jarod Wilson wrote:
> On Wed, Oct 19, 2016 at 03:55:29PM +0200, Sabrina Dubroca wrote:
> > 2016-10-18, 22:33:31 -0400, Jarod Wilson wrote:
> > > geneve:
> > > - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
> > > - This one isn't quite as straight-forward as others, could use some
> > >   closer inspection and testing
> > > 
> > > macvlan:
> > > - set min/max_mtu
> > > 
> > > tun:
> > > - set min/max_mtu, remove tun_net_change_mtu
> > > 
> > > vxlan:
> > > - Merge __vxlan_change_mtu back into vxlan_change_mtu, set min/max_mtu
> > > - This one is also not as straight-forward and could use closer inspection
> > >   and testing from vxlan folks
> > > 
> > > bridge:
> > > - set max_mtu via br_min_mtu()
> > > 
> > > openvswitch:
> > > - set min/max_mtu, remove internal_dev_change_mtu
> > > - note: max_mtu wasn't checked previously, it's been set to 65535, which
> > >   is the largest possible size supported
> > > 
> > > sch_teql:
> > > - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
> > 
> > Nothing for other virtual netdevices? (dummy, veth, bond, etc) Their
> > MTU is limited to 1500 now.  Also missing macsec and ip_gre, probably
> > others that are using ether_setup.
> 
> Yeah, I've clearly missed more than I thought. Doing another sweep now.

Thanks.


> I'm thinking more and more that we ought to back out the patch that sets
> min/max in ether_setup, save it for last, after we're sure everyone that
> calls it has been prepared.

I'm not sure how that would work now, if some of the patches that
already went in for ethernet drivers assume that ether_setup will
configure a basic {min,max}_mtu pair (at least e100 makes that
assumption, but that might be the only one).

> > [...]
> > > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> > > index 89a687f..81fc79a 100644
> > > --- a/net/bridge/br_device.c
> > > +++ b/net/bridge/br_device.c
> > > @@ -184,17 +184,15 @@ static struct rtnl_link_stats64 
> > > *br_get_stats64(struct net_device *dev,
> > >  
> > >  static int br_change_mtu(struct net_device *dev, int new_mtu)
> > >  {
> > > +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > >   struct net_bridge *br = netdev_priv(dev);
> > > - if (new_mtu < 68 || new_mtu > br_min_mtu(br))
> > > - return -EINVAL;
> > > -
> > > - dev->mtu = new_mtu;
> > >  
> > > -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > >   /* remember the MTU in the rtable for PMTU */
> > >   dst_metric_set(>fake_rtable.dst, RTAX_MTU, new_mtu);
> > >  #endif
> > >  
> > > + dev->mtu = new_mtu;
> > > +
> > >   return 0;
> > >  }
> > >  
> > > @@ -390,6 +388,7 @@ void br_dev_setup(struct net_device *dev)
> > >   dev->hw_features = COMMON_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
> > >  NETIF_F_HW_VLAN_STAG_TX;
> > >   dev->vlan_features = COMMON_FEATURES;
> > > + dev->max_mtu = br_min_mtu(br);
> > 
> > br_min_mtu uses br->port_list, which is only initialized a few lines
> > later (right after the spin_lock_init() at the end of the context of
> > this diff).
> 
> Ah, okay, I'd just grouped it with the other dev->foo settings.
> 
> > Besides, I don't think this works: br_min_mtu(br) changes when you add
> > and remove ports, or when you change the MTU of an enslaved
> > device. But this makes the max MTU for the bridge fixed (to 1500).
> 
> Okay, how about this: set no max_mtu (or set it to IP_MAX_MTU/65535), and
> then retain a check against the possibly ever-changing br_min_mtu(br) in
> br_change_mtu()?

Sounds good to me.


-- 
Sabrina


Re: [PATCH net-next 4/6] net: use core MTU range checking in core net infra

2016-10-19 Thread Sabrina Dubroca
2016-10-19, 10:40:06 -0400, Jarod Wilson wrote:
> On Wed, Oct 19, 2016 at 03:55:29PM +0200, Sabrina Dubroca wrote:
> > 2016-10-18, 22:33:31 -0400, Jarod Wilson wrote:
> > > geneve:
> > > - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
> > > - This one isn't quite as straight-forward as others, could use some
> > >   closer inspection and testing
> > > 
> > > macvlan:
> > > - set min/max_mtu
> > > 
> > > tun:
> > > - set min/max_mtu, remove tun_net_change_mtu
> > > 
> > > vxlan:
> > > - Merge __vxlan_change_mtu back into vxlan_change_mtu, set min/max_mtu
> > > - This one is also not as straight-forward and could use closer inspection
> > >   and testing from vxlan folks
> > > 
> > > bridge:
> > > - set max_mtu via br_min_mtu()
> > > 
> > > openvswitch:
> > > - set min/max_mtu, remove internal_dev_change_mtu
> > > - note: max_mtu wasn't checked previously, it's been set to 65535, which
> > >   is the largest possible size supported
> > > 
> > > sch_teql:
> > > - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
> > 
> > Nothing for other virtual netdevices? (dummy, veth, bond, etc) Their
> > MTU is limited to 1500 now.  Also missing macsec and ip_gre, probably
> > others that are using ether_setup.
> 
> Yeah, I've clearly missed more than I thought. Doing another sweep now.

Thanks.


> I'm thinking more and more that we ought to back out the patch that sets
> min/max in ether_setup, save it for last, after we're sure everyone that
> calls it has been prepared.

I'm not sure how that would work now, if some of the patches that
already went in for ethernet drivers assume that ether_setup will
configure a basic {min,max}_mtu pair (at least e100 makes that
assumption, but that might be the only one).

> > [...]
> > > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> > > index 89a687f..81fc79a 100644
> > > --- a/net/bridge/br_device.c
> > > +++ b/net/bridge/br_device.c
> > > @@ -184,17 +184,15 @@ static struct rtnl_link_stats64 
> > > *br_get_stats64(struct net_device *dev,
> > >  
> > >  static int br_change_mtu(struct net_device *dev, int new_mtu)
> > >  {
> > > +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > >   struct net_bridge *br = netdev_priv(dev);
> > > - if (new_mtu < 68 || new_mtu > br_min_mtu(br))
> > > - return -EINVAL;
> > > -
> > > - dev->mtu = new_mtu;
> > >  
> > > -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > >   /* remember the MTU in the rtable for PMTU */
> > >   dst_metric_set(>fake_rtable.dst, RTAX_MTU, new_mtu);
> > >  #endif
> > >  
> > > + dev->mtu = new_mtu;
> > > +
> > >   return 0;
> > >  }
> > >  
> > > @@ -390,6 +388,7 @@ void br_dev_setup(struct net_device *dev)
> > >   dev->hw_features = COMMON_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
> > >  NETIF_F_HW_VLAN_STAG_TX;
> > >   dev->vlan_features = COMMON_FEATURES;
> > > + dev->max_mtu = br_min_mtu(br);
> > 
> > br_min_mtu uses br->port_list, which is only initialized a few lines
> > later (right after the spin_lock_init() at the end of the context of
> > this diff).
> 
> Ah, okay, I'd just grouped it with the other dev->foo settings.
> 
> > Besides, I don't think this works: br_min_mtu(br) changes when you add
> > and remove ports, or when you change the MTU of an enslaved
> > device. But this makes the max MTU for the bridge fixed (to 1500).
> 
> Okay, how about this: set no max_mtu (or set it to IP_MAX_MTU/65535), and
> then retain a check against the possibly ever-changing br_min_mtu(br) in
> br_change_mtu()?

Sounds good to me.


-- 
Sabrina


Re: [PATCH net-next 4/6] net: use core MTU range checking in core net infra

2016-10-19 Thread Sabrina Dubroca
2016-10-18, 22:33:31 -0400, Jarod Wilson wrote:
> geneve:
> - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
> - This one isn't quite as straight-forward as others, could use some
>   closer inspection and testing
> 
> macvlan:
> - set min/max_mtu
> 
> tun:
> - set min/max_mtu, remove tun_net_change_mtu
> 
> vxlan:
> - Merge __vxlan_change_mtu back into vxlan_change_mtu, set min/max_mtu
> - This one is also not as straight-forward and could use closer inspection
>   and testing from vxlan folks
> 
> bridge:
> - set max_mtu via br_min_mtu()
> 
> openvswitch:
> - set min/max_mtu, remove internal_dev_change_mtu
> - note: max_mtu wasn't checked previously, it's been set to 65535, which
>   is the largest possible size supported
> 
> sch_teql:
> - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

Nothing for other virtual netdevices? (dummy, veth, bond, etc) Their
MTU is limited to 1500 now.  Also missing macsec and ip_gre, probably
others that are using ether_setup.


[...]
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index 89a687f..81fc79a 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -184,17 +184,15 @@ static struct rtnl_link_stats64 *br_get_stats64(struct 
> net_device *dev,
>  
>  static int br_change_mtu(struct net_device *dev, int new_mtu)
>  {
> +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>   struct net_bridge *br = netdev_priv(dev);
> - if (new_mtu < 68 || new_mtu > br_min_mtu(br))
> - return -EINVAL;
> -
> - dev->mtu = new_mtu;
>  
> -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>   /* remember the MTU in the rtable for PMTU */
>   dst_metric_set(>fake_rtable.dst, RTAX_MTU, new_mtu);
>  #endif
>  
> + dev->mtu = new_mtu;
> +
>   return 0;
>  }
>  
> @@ -390,6 +388,7 @@ void br_dev_setup(struct net_device *dev)
>   dev->hw_features = COMMON_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
>  NETIF_F_HW_VLAN_STAG_TX;
>   dev->vlan_features = COMMON_FEATURES;
> + dev->max_mtu = br_min_mtu(br);

br_min_mtu uses br->port_list, which is only initialized a few lines
later (right after the spin_lock_init() at the end of the context of
this diff).

Besides, I don't think this works: br_min_mtu(br) changes when you add
and remove ports, or when you change the MTU of an enslaved
device. But this makes the max MTU for the bridge fixed (to 1500).

>  
>   br->dev = dev;
>   spin_lock_init(>lock);

> diff --git a/net/openvswitch/vport-internal_dev.c 
> b/net/openvswitch/vport-internal_dev.c
> index e7da290..d5d6cae 100644
> --- a/net/openvswitch/vport-internal_dev.c
> +++ b/net/openvswitch/vport-internal_dev.c
> @@ -89,15 +89,6 @@ static const struct ethtool_ops internal_dev_ethtool_ops = 
> {
>   .get_link   = ethtool_op_get_link,
>  };
>  
> -static int internal_dev_change_mtu(struct net_device *netdev, int new_mtu)
> -{
> - if (new_mtu < 68)
> - return -EINVAL;
> -
> - netdev->mtu = new_mtu;
> - return 0;
> -}
> -
>  static void internal_dev_destructor(struct net_device *dev)
>  {
>   struct vport *vport = ovs_internal_dev_get_vport(dev);
> @@ -148,7 +139,6 @@ static const struct net_device_ops 
> internal_dev_netdev_ops = {
>   .ndo_stop = internal_dev_stop,
>   .ndo_start_xmit = internal_dev_xmit,
>   .ndo_set_mac_address = eth_mac_addr,
> - .ndo_change_mtu = internal_dev_change_mtu,
>   .ndo_get_stats64 = internal_get_stats,
>   .ndo_set_rx_headroom = internal_set_rx_headroom,
>  };

vport-internal uses ether_setup, so the MTU is currently limited to
1500, no?


-- 
Sabrina


Re: [PATCH net-next 4/6] net: use core MTU range checking in core net infra

2016-10-19 Thread Sabrina Dubroca
2016-10-18, 22:33:31 -0400, Jarod Wilson wrote:
> geneve:
> - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
> - This one isn't quite as straight-forward as others, could use some
>   closer inspection and testing
> 
> macvlan:
> - set min/max_mtu
> 
> tun:
> - set min/max_mtu, remove tun_net_change_mtu
> 
> vxlan:
> - Merge __vxlan_change_mtu back into vxlan_change_mtu, set min/max_mtu
> - This one is also not as straight-forward and could use closer inspection
>   and testing from vxlan folks
> 
> bridge:
> - set max_mtu via br_min_mtu()
> 
> openvswitch:
> - set min/max_mtu, remove internal_dev_change_mtu
> - note: max_mtu wasn't checked previously, it's been set to 65535, which
>   is the largest possible size supported
> 
> sch_teql:
> - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

Nothing for other virtual netdevices? (dummy, veth, bond, etc) Their
MTU is limited to 1500 now.  Also missing macsec and ip_gre, probably
others that are using ether_setup.


[...]
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index 89a687f..81fc79a 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -184,17 +184,15 @@ static struct rtnl_link_stats64 *br_get_stats64(struct 
> net_device *dev,
>  
>  static int br_change_mtu(struct net_device *dev, int new_mtu)
>  {
> +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>   struct net_bridge *br = netdev_priv(dev);
> - if (new_mtu < 68 || new_mtu > br_min_mtu(br))
> - return -EINVAL;
> -
> - dev->mtu = new_mtu;
>  
> -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>   /* remember the MTU in the rtable for PMTU */
>   dst_metric_set(>fake_rtable.dst, RTAX_MTU, new_mtu);
>  #endif
>  
> + dev->mtu = new_mtu;
> +
>   return 0;
>  }
>  
> @@ -390,6 +388,7 @@ void br_dev_setup(struct net_device *dev)
>   dev->hw_features = COMMON_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
>  NETIF_F_HW_VLAN_STAG_TX;
>   dev->vlan_features = COMMON_FEATURES;
> + dev->max_mtu = br_min_mtu(br);

br_min_mtu uses br->port_list, which is only initialized a few lines
later (right after the spin_lock_init() at the end of the context of
this diff).

Besides, I don't think this works: br_min_mtu(br) changes when you add
and remove ports, or when you change the MTU of an enslaved
device. But this makes the max MTU for the bridge fixed (to 1500).

>  
>   br->dev = dev;
>   spin_lock_init(>lock);

> diff --git a/net/openvswitch/vport-internal_dev.c 
> b/net/openvswitch/vport-internal_dev.c
> index e7da290..d5d6cae 100644
> --- a/net/openvswitch/vport-internal_dev.c
> +++ b/net/openvswitch/vport-internal_dev.c
> @@ -89,15 +89,6 @@ static const struct ethtool_ops internal_dev_ethtool_ops = 
> {
>   .get_link   = ethtool_op_get_link,
>  };
>  
> -static int internal_dev_change_mtu(struct net_device *netdev, int new_mtu)
> -{
> - if (new_mtu < 68)
> - return -EINVAL;
> -
> - netdev->mtu = new_mtu;
> - return 0;
> -}
> -
>  static void internal_dev_destructor(struct net_device *dev)
>  {
>   struct vport *vport = ovs_internal_dev_get_vport(dev);
> @@ -148,7 +139,6 @@ static const struct net_device_ops 
> internal_dev_netdev_ops = {
>   .ndo_stop = internal_dev_stop,
>   .ndo_start_xmit = internal_dev_xmit,
>   .ndo_set_mac_address = eth_mac_addr,
> - .ndo_change_mtu = internal_dev_change_mtu,
>   .ndo_get_stats64 = internal_get_stats,
>   .ndo_set_rx_headroom = internal_set_rx_headroom,
>  };

vport-internal uses ether_setup, so the MTU is currently limited to
1500, no?


-- 
Sabrina


Re: [e1000_netpoll] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110

2016-07-28 Thread Sabrina Dubroca
2016-07-28, 07:43:55 +0200, Eric Dumazet wrote:
> On Wed, 2016-07-27 at 14:38 -0700, Jeff Kirsher wrote:
> > On Tue, 2016-07-26 at 11:14 +0200, Eric Dumazet wrote:
> > > Could you try this ?
> > > 
> > > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > index
> > > f42129d09e2c23ba9fdb5cde890d50ecb7166a42..a53c41c4c4f7d1fe52f95a2cab8784a
> > > 938b3820b 100644
> > > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > @@ -5257,9 +5257,13 @@ static void e1000_netpoll(struct net_device
> > > *netdev)
> > >  {
> > > struct e1000_adapter *adapter = netdev_priv(netdev);
> > >  
> > > -   disable_irq(adapter->pdev->irq);
> > > -   e1000_intr(adapter->pdev->irq, netdev);
> > > -   enable_irq(adapter->pdev->irq);
> > > +   if (napi_schedule_prep(>napi)) {
> > > +   adapter->total_tx_bytes = 0;
> > > +   adapter->total_tx_packets = 0;
> > > +   adapter->total_rx_bytes = 0;
> > > +   adapter->total_rx_packets = 0;
> > > +   __napi_schedule(>napi);
> > > +   }
> > >  }
> > >  #endif
> > >  
> > 
> > Since this fixes the issue Fengguang saw, will you be submitting a formal
> > patch Eric? (please) I can get this queued up for Dave's net tree as soon
> > as I receive the formal patch.
> 
> I would prefer having a definitive advice from Thomas Gleixner and/or
> others if disable_irq() is forbidden from IRQ path.
> 
> As I said, about all netpoll() methods in net drivers use disable_irq()
> so a lot of patches would be needed.
> 
> disable_irq() should then test this condition earlier, so that we can
> detect potential bug, even if the IRQ is not (yet) threaded.

The idea when this first came up was to skip the sleeping part of
disable_irq():

http://marc.info/?l=linux-netdev=142314159626052

This fell off my todolist and I didn't send the conversion patches,
which would basically look like this:


diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 41f32c0b341e..b022691e680b 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6713,20 +6713,20 @@ static irqreturn_t e1000_intr_msix(int __always_unused 
irq, void *data)
 
vector = 0;
msix_irq = adapter->msix_entries[vector].vector;
-   disable_irq(msix_irq);
-   e1000_intr_msix_rx(msix_irq, netdev);
+   if (disable_hardirq(msix_irq))
+   e1000_intr_msix_rx(msix_irq, netdev);
enable_irq(msix_irq);
 
vector++;
msix_irq = adapter->msix_entries[vector].vector;
-   disable_irq(msix_irq);
-   e1000_intr_msix_tx(msix_irq, netdev);
+   if (disable_hardirq(msix_irq))
+   e1000_intr_msix_tx(msix_irq, netdev);
enable_irq(msix_irq);
 
vector++;
msix_irq = adapter->msix_entries[vector].vector;
-   disable_irq(msix_irq);
-   e1000_msix_other(msix_irq, netdev);
+   if (disable_hardirq(msix_irq))
+   e1000_msix_other(msix_irq, netdev);
enable_irq(msix_irq);
}
 
@@ -6750,13 +6750,13 @@ static void e1000_netpoll(struct net_device *netdev)
e1000_intr_msix(adapter->pdev->irq, netdev);
break;
case E1000E_INT_MODE_MSI:
-   disable_irq(adapter->pdev->irq);
-   e1000_intr_msi(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr_msi(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
break;
default:/* E1000E_INT_MODE_LEGACY */
-   disable_irq(adapter->pdev->irq);
-   e1000_intr(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
break;
}


-- 
Sabrina


Re: [e1000_netpoll] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110

2016-07-28 Thread Sabrina Dubroca
2016-07-28, 07:43:55 +0200, Eric Dumazet wrote:
> On Wed, 2016-07-27 at 14:38 -0700, Jeff Kirsher wrote:
> > On Tue, 2016-07-26 at 11:14 +0200, Eric Dumazet wrote:
> > > Could you try this ?
> > > 
> > > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > index
> > > f42129d09e2c23ba9fdb5cde890d50ecb7166a42..a53c41c4c4f7d1fe52f95a2cab8784a
> > > 938b3820b 100644
> > > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > > @@ -5257,9 +5257,13 @@ static void e1000_netpoll(struct net_device
> > > *netdev)
> > >  {
> > > struct e1000_adapter *adapter = netdev_priv(netdev);
> > >  
> > > -   disable_irq(adapter->pdev->irq);
> > > -   e1000_intr(adapter->pdev->irq, netdev);
> > > -   enable_irq(adapter->pdev->irq);
> > > +   if (napi_schedule_prep(>napi)) {
> > > +   adapter->total_tx_bytes = 0;
> > > +   adapter->total_tx_packets = 0;
> > > +   adapter->total_rx_bytes = 0;
> > > +   adapter->total_rx_packets = 0;
> > > +   __napi_schedule(>napi);
> > > +   }
> > >  }
> > >  #endif
> > >  
> > 
> > Since this fixes the issue Fengguang saw, will you be submitting a formal
> > patch Eric? (please) I can get this queued up for Dave's net tree as soon
> > as I receive the formal patch.
> 
> I would prefer having a definitive advice from Thomas Gleixner and/or
> others if disable_irq() is forbidden from IRQ path.
> 
> As I said, about all netpoll() methods in net drivers use disable_irq()
> so a lot of patches would be needed.
> 
> disable_irq() should then test this condition earlier, so that we can
> detect potential bug, even if the IRQ is not (yet) threaded.

The idea when this first came up was to skip the sleeping part of
disable_irq():

http://marc.info/?l=linux-netdev=142314159626052

This fell off my todolist and I didn't send the conversion patches,
which would basically look like this:


diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 41f32c0b341e..b022691e680b 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6713,20 +6713,20 @@ static irqreturn_t e1000_intr_msix(int __always_unused 
irq, void *data)
 
vector = 0;
msix_irq = adapter->msix_entries[vector].vector;
-   disable_irq(msix_irq);
-   e1000_intr_msix_rx(msix_irq, netdev);
+   if (disable_hardirq(msix_irq))
+   e1000_intr_msix_rx(msix_irq, netdev);
enable_irq(msix_irq);
 
vector++;
msix_irq = adapter->msix_entries[vector].vector;
-   disable_irq(msix_irq);
-   e1000_intr_msix_tx(msix_irq, netdev);
+   if (disable_hardirq(msix_irq))
+   e1000_intr_msix_tx(msix_irq, netdev);
enable_irq(msix_irq);
 
vector++;
msix_irq = adapter->msix_entries[vector].vector;
-   disable_irq(msix_irq);
-   e1000_msix_other(msix_irq, netdev);
+   if (disable_hardirq(msix_irq))
+   e1000_msix_other(msix_irq, netdev);
enable_irq(msix_irq);
}
 
@@ -6750,13 +6750,13 @@ static void e1000_netpoll(struct net_device *netdev)
e1000_intr_msix(adapter->pdev->irq, netdev);
break;
case E1000E_INT_MODE_MSI:
-   disable_irq(adapter->pdev->irq);
-   e1000_intr_msi(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr_msi(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
break;
default:/* E1000E_INT_MODE_LEGACY */
-   disable_irq(adapter->pdev->irq);
-   e1000_intr(adapter->pdev->irq, netdev);
+   if (disable_hardirq(adapter->pdev->irq))
+   e1000_intr(adapter->pdev->irq, netdev);
enable_irq(adapter->pdev->irq);
break;
}


-- 
Sabrina


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-28 Thread Sabrina Dubroca
Hello,

2016-04-27, 17:14:44 -0700, Ben Greear wrote:
> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote:
> > Hi Ben,
> > 
> > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote:
> > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote:
> > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote:
> > > > > 
> > > > > 3.2.80-rc1 review patch.  If anyone has any objections, please let me 
> > > > > know.
> > > > I would be careful about this.  It causes regressions when sending
> > > > PACKET_SOCKET buffers from user-space to veth devices.
> > > > 
> > > > There was a proposed upstream fix for the regression, but it has not 
> > > > gone
> > > > into the tree as far as I know.
> > > > 
> > > > http://www.spinics.net/lists/netdev/msg370436.html
> > > [...]
> > > 
> > > OK, I'll drop this for now.
> > 
> > The fall out from not having this patch is in my opinion a bigger
> > fallout than not having this patch. This patch fixes silent data
> > corruption vs. the problem Ben Greear is talking about, which might not
> > be that a common usage.
> > 
> > What do others think?
> > 
> > Bye,
> > Hannes
> > 
> 
> This patch from Cong Wang seems to fix the regression for me, I think it 
> should be added and
> tested in the main tree, and then apply them to stable as a pair.
> 
> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a

Actually, no, this is not really a regression.

If you capture packets on a device with checksum offloading enabled,
the TCP/UDP checksum isn't filled.  veth also behaves that way.  What
the "veth: don't modify ip_summed" patch does is enable proper
checksum validation on veth.  This really was a bug in veth.

Cong's patch would also break cases where we choose to inject packets
with invalid checksums, and they would now be accepted as correct.

Your use case is invalid, it just happened to work because of a
bug.  If you want the stack to fill checksums so that you want capture
and reinject packets, you have to disable checksum offloading (or
compute the checksum yourself in userspace).

Thanks.

-- 
Sabrina


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-28 Thread Sabrina Dubroca
Hello,

2016-04-27, 17:14:44 -0700, Ben Greear wrote:
> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote:
> > Hi Ben,
> > 
> > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote:
> > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote:
> > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote:
> > > > > 
> > > > > 3.2.80-rc1 review patch.  If anyone has any objections, please let me 
> > > > > know.
> > > > I would be careful about this.  It causes regressions when sending
> > > > PACKET_SOCKET buffers from user-space to veth devices.
> > > > 
> > > > There was a proposed upstream fix for the regression, but it has not 
> > > > gone
> > > > into the tree as far as I know.
> > > > 
> > > > http://www.spinics.net/lists/netdev/msg370436.html
> > > [...]
> > > 
> > > OK, I'll drop this for now.
> > 
> > The fall out from not having this patch is in my opinion a bigger
> > fallout than not having this patch. This patch fixes silent data
> > corruption vs. the problem Ben Greear is talking about, which might not
> > be that a common usage.
> > 
> > What do others think?
> > 
> > Bye,
> > Hannes
> > 
> 
> This patch from Cong Wang seems to fix the regression for me, I think it 
> should be added and
> tested in the main tree, and then apply them to stable as a pair.
> 
> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a

Actually, no, this is not really a regression.

If you capture packets on a device with checksum offloading enabled,
the TCP/UDP checksum isn't filled.  veth also behaves that way.  What
the "veth: don't modify ip_summed" patch does is enable proper
checksum validation on veth.  This really was a bug in veth.

Cong's patch would also break cases where we choose to inject packets
with invalid checksums, and they would now be accepted as correct.

Your use case is invalid, it just happened to work because of a
bug.  If you want the stack to fill checksums so that you want capture
and reinject packets, you have to disable checksum offloading (or
compute the checksum yourself in userspace).

Thanks.

-- 
Sabrina


Re: [PATCH v3 3/3] netconsole: implement extended console support

2015-05-10 Thread Sabrina Dubroca
Hi Tejun,


2015-05-04, 16:04:56 -0400, Tejun Heo wrote:

[...]

> +/**
> + * send_ext_msg_udp - send extended log message to target
> + * @nt: target to send message to
> + * @msg: extended log message to send
> + * @msg_len: length of message
> + *
> + * Transfer extended log @msg to @nt.  If @msg is longer than
> + * MAX_PRINT_CHUNK, it'll be split and transmitted in multiple chunks with
> + * ncfrag header field added to identify them.
> + */
> +static void send_ext_msg_udp(struct netconsole_target *nt, const char *msg,
> +  int msg_len)
> +{
> + static char buf[MAX_PRINT_CHUNK];
> + const int max_extra_len = sizeof(",ncfrag=/");

Is msg_len guaranteed < 1?  Otherwise I think the WARN in the send
loop can trigger.

Also, I think your count is correct because sizeof adds one to the
string's length, but you don't explicitly account for the ';' between
header and body fragment here (and in chunk_len). header_len will stop
before the ;.


> + const char *header, *body;
> + int header_len = msg_len, body_len = 0;
> + int chunk_len, nr_chunks, i;
> +
> + if (msg_len <= MAX_PRINT_CHUNK) {
> + netpoll_send_udp(>np, msg, msg_len);
> + return;
> + }
> +
> + /* need to insert extra header fields, detect header and body */
> + header = msg;
> + body = memchr(msg, ';', msg_len);
> + if (body) {
> + header_len = body - header;
> + body_len = msg_len - header_len - 1;
> + body++;
> + }
> +
> + chunk_len = MAX_PRINT_CHUNK - header_len - max_extra_len;
> + if (WARN_ON_ONCE(chunk_len <= 0))
> + return;
> +
> + /*
> +  * Transfer possibly multiple chunks with extra header fields.
> +  *
> +  * If @msg needs to be split to fit MAX_PRINT_CHUNK, add
> +  * "ncfrag=/" to identify each chunk.
> +  */
> + memcpy(buf, header, header_len);
> + nr_chunks = DIV_ROUND_UP(body_len, chunk_len);

Wouldn't it be simpler to loop on the remaining size, instead of
doing a division?


> +
> + for (i = 0; i < nr_chunks; i++) {
> + int offset = i * chunk_len;
> + int this_header = header_len;
> + int this_chunk = min(body_len - offset, chunk_len);
> +
> + if (nr_chunks > 1)

We already know that there will be more than one chunk, since
you handle msg_len <= MAX_PRINT_CHUNK at the beginning?


> + this_header += scnprintf(buf + this_header,
> +  sizeof(buf) - this_header,
> +  ",ncfrag=%d/%d;",
> +  offset, body_len);
> +
> + if (WARN_ON_ONCE(this_header + chunk_len > MAX_PRINT_CHUNK))
> + return;

This WARN doesn't really seem necessary to me, except for the msg_len
maximum I mentionned earlier.
And if we don't use nr_chunks, we could compute the fragment's length
here in case some computation went wrong.


> +
> + memcpy(buf + this_header, body, this_chunk);
> +
> + netpoll_send_udp(>np, buf, this_header + this_chunk);
> +

netpoll_send_udp already does a memcpy (in skb_copy_to_linear_data).
Maybe it would be better to modify netpoll_send_udp, or add a variant
that takes two buffers? or more than two, with something like an iovec?

> + body += this_chunk;
> + }
> +}
>
> [...]




Thanks,

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] netconsole: implement extended console support

2015-05-10 Thread Sabrina Dubroca
Hi Tejun,


2015-05-04, 16:04:56 -0400, Tejun Heo wrote:

[...]

 +/**
 + * send_ext_msg_udp - send extended log message to target
 + * @nt: target to send message to
 + * @msg: extended log message to send
 + * @msg_len: length of message
 + *
 + * Transfer extended log @msg to @nt.  If @msg is longer than
 + * MAX_PRINT_CHUNK, it'll be split and transmitted in multiple chunks with
 + * ncfrag header field added to identify them.
 + */
 +static void send_ext_msg_udp(struct netconsole_target *nt, const char *msg,
 +  int msg_len)
 +{
 + static char buf[MAX_PRINT_CHUNK];
 + const int max_extra_len = sizeof(,ncfrag=/);

Is msg_len guaranteed  1?  Otherwise I think the WARN in the send
loop can trigger.

Also, I think your count is correct because sizeof adds one to the
string's length, but you don't explicitly account for the ';' between
header and body fragment here (and in chunk_len). header_len will stop
before the ;.


 + const char *header, *body;
 + int header_len = msg_len, body_len = 0;
 + int chunk_len, nr_chunks, i;
 +
 + if (msg_len = MAX_PRINT_CHUNK) {
 + netpoll_send_udp(nt-np, msg, msg_len);
 + return;
 + }
 +
 + /* need to insert extra header fields, detect header and body */
 + header = msg;
 + body = memchr(msg, ';', msg_len);
 + if (body) {
 + header_len = body - header;
 + body_len = msg_len - header_len - 1;
 + body++;
 + }
 +
 + chunk_len = MAX_PRINT_CHUNK - header_len - max_extra_len;
 + if (WARN_ON_ONCE(chunk_len = 0))
 + return;
 +
 + /*
 +  * Transfer possibly multiple chunks with extra header fields.
 +  *
 +  * If @msg needs to be split to fit MAX_PRINT_CHUNK, add
 +  * ncfrag=byte-offset/total-bytes to identify each chunk.
 +  */
 + memcpy(buf, header, header_len);
 + nr_chunks = DIV_ROUND_UP(body_len, chunk_len);

Wouldn't it be simpler to loop on the remaining size, instead of
doing a division?


 +
 + for (i = 0; i  nr_chunks; i++) {
 + int offset = i * chunk_len;
 + int this_header = header_len;
 + int this_chunk = min(body_len - offset, chunk_len);
 +
 + if (nr_chunks  1)

We already know that there will be more than one chunk, since
you handle msg_len = MAX_PRINT_CHUNK at the beginning?


 + this_header += scnprintf(buf + this_header,
 +  sizeof(buf) - this_header,
 +  ,ncfrag=%d/%d;,
 +  offset, body_len);
 +
 + if (WARN_ON_ONCE(this_header + chunk_len  MAX_PRINT_CHUNK))
 + return;

This WARN doesn't really seem necessary to me, except for the msg_len
maximum I mentionned earlier.
And if we don't use nr_chunks, we could compute the fragment's length
here in case some computation went wrong.


 +
 + memcpy(buf + this_header, body, this_chunk);
 +
 + netpoll_send_udp(nt-np, buf, this_header + this_chunk);
 +

netpoll_send_udp already does a memcpy (in skb_copy_to_linear_data).
Maybe it would be better to modify netpoll_send_udp, or add a variant
that takes two buffers? or more than two, with something like an iovec?

 + body += this_chunk;
 + }
 +}

 [...]




Thanks,

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-26 Thread Sabrina Dubroca
2015-02-26, 14:54:33 +0100, Denys Vlasenko wrote:
> On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko
>  wrote:
> > On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
> >  wrote:
> >> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski  
> >> wrote:
> >> In addition to my previous tests, I ran my home machine with
> >> patched kernel. Unfortunately, it works for me :(
> >>
> >> Will try on yet another machine.
> >
> > And voila, it does happen on another machine :)
> >
> > I'm debugging it right now. Looks like 64-bit syscalls just stop working
> > at some point in new processes. That is, existing process is alive and well,
> > but children get SEGV after fork (most likely on any syscall64 they do,
> > not after fork per se. They eventually manage to kill themselves -
> > not trivial when exit syscall isn't working either - by tripping on HLT 
> > insn).
> >
> > 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
> > I have static 32-bit busybox binary and everything works there.
> >
> > Also, any 64-bit process which was under strace continues to work correctly,
> > including forks and execs.
> >
> > This points towards some bug on fast path sysret64 code. Looking for it.
> 
> audit=0 makes crashes disappear.

Ah, yes.

> I found the problem. If syscall_trace_enter_phase1 returns 0,
> I restore %rax from pt_regs->ax, but should restore it from
> pt_regs->orig_ax:
> 
> call syscall_trace_enter_phase1
> test %rax, %rax
> jnz tracesys_phase2 /* if needed, run the slow path */
> -   RESTORE_C_REGS  /* else restore clobbered regs */
> +   RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
> +   movq ORIG_RAX-ARGOFFSET(%rsp),%rax
> jmp system_call_fastpath/*  and return to the fast path */

with s/-ARGOFFSET// on top of next-20150224, that works.

Thanks, Denys.

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-26 Thread Sabrina Dubroca
2015-02-26, 14:54:33 +0100, Denys Vlasenko wrote:
 On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko
 vda.li...@googlemail.com wrote:
  On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
  vda.li...@googlemail.com wrote:
  On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net 
  wrote:
  In addition to my previous tests, I ran my home machine with
  patched kernel. Unfortunately, it works for me :(
 
  Will try on yet another machine.
 
  And voila, it does happen on another machine :)
 
  I'm debugging it right now. Looks like 64-bit syscalls just stop working
  at some point in new processes. That is, existing process is alive and well,
  but children get SEGV after fork (most likely on any syscall64 they do,
  not after fork per se. They eventually manage to kill themselves -
  not trivial when exit syscall isn't working either - by tripping on HLT 
  insn).
 
  32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
  I have static 32-bit busybox binary and everything works there.
 
  Also, any 64-bit process which was under strace continues to work correctly,
  including forks and execs.
 
  This points towards some bug on fast path sysret64 code. Looking for it.
 
 audit=0 makes crashes disappear.

Ah, yes.

 I found the problem. If syscall_trace_enter_phase1 returns 0,
 I restore %rax from pt_regs-ax, but should restore it from
 pt_regs-orig_ax:
 
 call syscall_trace_enter_phase1
 test %rax, %rax
 jnz tracesys_phase2 /* if needed, run the slow path */
 -   RESTORE_C_REGS  /* else restore clobbered regs */
 +   RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
 +   movq ORIG_RAX-ARGOFFSET(%rsp),%rax
 jmp system_call_fastpath/*  and return to the fast path */

with s/-ARGOFFSET// on top of next-20150224, that works.

Thanks, Denys.

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
> I can run some userspace programs, but I have no idea what would be
> helpful.
> I can also try booting a real machine with archlinux/systemd tomorrow.

I got a good boot out of kernels that normally fail.  I booted
systemd's emergency shell and enabled a few services, in the same
order they normally start.  journald started cleanly, but after that,
every single command produced a "traps:" output and an "audit:" line.

I disabled systemd-journald (chmod -x, because `systemctl disable`
didn't really disable it), and now it boots, no "traps:" in the log.
If I run it, everything fails again (zsh has traps for simply pressing
enter on an empty cmd).

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote:
> On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko  wrote:
> > On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
> >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin  wrote:
> >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko :
>  On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
> >> 64-bit code was using six stack slots less by not saving/restoring
> >> registers which are callee-preserved according to C ABI,
> >> and not allocating space for them.
> >> Only when syscall needed a complete "struct pt_regs",
> >> the complete area was allocated and filled in.
> >> As an additional twist, on interrupt entry a "slightly less truncated 
> >> pt_regs"
> >> trick is used, to make nested interrupt stacks easier to unwind.
> >>
> >> This proved to be a source of significant obfuscation and subtle bugs.
> >> For example, stub_fork had to pop the return address,
> >> extend the struct, save registers, and push return address back. Ugly.
> >> ia32_ptregs_common pops return address and "returns" via jmp insn,
> >> throwing a wrench into CPU return stack cache.
> >>
> >> This patch changes code to always allocate a complete "struct pt_regs".
> >> The saving of registers is still done lazily.
> >>
> >> "Partial pt_regs" trick on interrupt stack is retained.
> >>
> >> Macros which manipulate "struct pt_regs" on stack are reworked:
> >> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
> >> SAVE_C_REGS saves to it those registers which are clobbered by C code.
> >> SAVE_EXTRA_REGS saves to it all other registers.
> >> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse 
> >> it.
> >>
> >> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
> >> return pointer.
> >>
> >> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
> >> instead of magic numbers.
> >>
> >> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
> >> instead of having it open-coded yet again.
> >>
> >> Patch was run-tested: 64-bit executables, 32-bit executables,
> >> strace works.
> >> Timing tests did not show measurable difference in 32-bit
> >> and 64-bit syscalls.
> >
> > Hello Denys,
> >
> > My test vm doesn't boot with this patch. Could you help to investigate
> > this issue?
> 
>  I think I found it. This part of my patch is possibly wrong:
> 
>  @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>   #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
>  TRACE_IRQS_ON; \
>  sti; \
>  -   SAVE_REST; \
>  +   SAVE_EXTRA_REGS; \
>  LOCKDEP_SYS_EXIT; \
>  -   RESTORE_REST; \
>  +   RESTORE_EXTRA_REGS; \
>  cli; \
>  TRACE_IRQS_OFF;
> 
>  The "SAVE_REST" here is intended to really *push* extra regs on stack,
>  but the patch changed it so that they are written to existing stack
>  slots above.
> 
>  From code inspection it should work in almost all cases, but some
>  locations where it is used are really obscure.
> 
>  If there are places where *pushing* regs is really necessary,
>  this can corrupt rbp,rbx,r12-15 registers.
> 
>  Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the 
>  bug
>  was here.
>  Please find updated patch attached. Can you try it?
> >>>
> >>> It doesn't work
> >
> > Thanks for testing it anyway.
> >
> >
> >>> [3.016262] traps: systemd-cgroups[390] general protection
> >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
> >>> ld-2.18.so[7f456f79e000+2]
> >
> > This is what I know about these crashes. The SEGV itself is caused by
> > HLT instruction executed by dynamic loader, ld-2.NN.so.
> > The instruction is in _exit function, and is only reachable if
> > exit_group and exit syscalls fail to terminate the process.
> > So it seems that syscall execution is getting badly broken somehow
> > at some point.
> >
> > This happens to both reporters.
> >
> > My theory that it is related to lockdep seems to be wrong, because
> > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
> >
> > Both kernels are paravirtualized, both are booted under KVM,
> > Andrey runs it with four virtual CPUs, Sabrina runs with two.
> >
> > My next theory is that I missed something related to paravirt.
> > I am looking at that code, so far I don't see anything suspicious.
> >
> > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
> > I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
> > I see in dmesg that kernel does detect that it is being run under KVM,
> > but it works for me. No mysterious segfaults.
> >
> > Andrey, can you send me your bzImage? Maybe it will trigger
> > the problem for me.
> >

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Sabrina Dubroca
Hello,

I'm seeing the same symptoms on next-2015022{4,5}, also with systemd in a VM:

traps: fsck[99] general protection ip:7fccb2401270 sp:7fffea3b8938 error:0 in 
libc-2.21.so[7fccb2349000+199000]
traps: systemd-cgroups[100] general protection ip:7fdd8ff784f8 sp:7ffcf6e27ad8 
error:0 in ld-2.21.so[7fdd8ff6+22000]
traps: systemd-cgroups[94] general protection ip:7f9f23bd24f8 sp:74fc5578 
error:0 in ld-2.21.so[7f9f23bba000+22000]
traps: systemd-cgroups[102] general protection ip:7f211e6574f8 sp:7ffdb8e0d538 
error:0 in ld-2.21.so[7f211e63f000+22000]
traps: systemd-cgroups[103] general protection ip:7f80627c34f8 sp:7ffc7fa4cff8 
error:0 in ld-2.21.so[7f80627ab000+22000]


2015-02-25, 14:55:34 +0100, Denys Vlasenko wrote:
> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
> > My test vm doesn't boot with this patch. Could you help to investigate
> > this issue?
> 
> Hi Andrey, thanks for testing!
> 
> > I have attached a kernel config and console log.
> 
> Looking at the logs, it seems that regular syscalls do work:
> systemd managed to function for some time, even spawned
> a few children.
> 
> It might be that the bug is somewhere in signal delivery code.
> This would explain why oops got delayed.

It doesn't oops here, it just tries to load other bits of systemd and hangs.
I've noticed that "ip:" - "the address after ld-2.21.so[" is always
the same value, I don't know if that's expected or relevant.

(full log below)

> I am trying to reproduce it. My gcc seems to be a bit old -
> it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config.
> 
> I switched to using "only" CONFIG_CC_STACKPROTECTOR_REGULAR=y:
> 
> CONFIG_CC_STACKPROTECTOR=y
> # CONFIG_CC_STACKPROTECTOR_NONE is not set
> CONFIG_CC_STACKPROTECTOR_REGULAR=y
> # CONFIG_CC_STACKPROTECTOR_STRONG is not set
> 
> and resulting kernel works for me.

I don't have any STACKPROTECTOR in my config:

# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_CC_STACKPROTECTOR_NONE=y
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set

(full config after the log)


I can start systemd's emergency shell (systemd.unit=emergency.target),
if running test programs helps.


Thanks,
Sabrina

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.0-rc1-next-20150225 (zappy@kria) (gcc version 
4.9.2 20150204 (prerelease) (GCC) ) #636 SMP PREEMPT Wed Feb 25 13:45:23 CET 
2015
[0.00] Command line: root=/dev/sda1 
netconsole=@10.0.1.23/,@10.0.1.10/ console=ttyS0
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable
[0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] Hypervisor detected: KVM
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x1ffe0 max_arch_pfn = 0x4
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] found SMP MP-table at [mem 0x000f1010-0x000f101f] mapped at 
[880f1010]
[0.00] Scanning 1 areas for low memory corruption
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00] init_memory_mapping: [mem 0x1fc0-0x1fdf]
[0.00] init_memory_mapping: [mem 0x0010-0x1fbf]
[0.00] init_memory_mapping: [mem 0x1fe0-0x1ffd]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F0DD0 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x1FFE18BC 34 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x1FFE0E48 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x1FFE0040 000E08 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x1FFE 40
[0.00] ACPI: SSDT 0x1FFE0EBC 000948 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x1FFE1804 80 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] ACPI: HPET 0x1FFE1884 38 (v01 BOCHS  BXPCHPET 
0001 BXPC 0001)
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[0.00] kvm-clock: cpu 0, msr 0:1ffdf001, primary cpu clock
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   DMA32 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Sabrina Dubroca
Hello,

I'm seeing the same symptoms on next-2015022{4,5}, also with systemd in a VM:

traps: fsck[99] general protection ip:7fccb2401270 sp:7fffea3b8938 error:0 in 
libc-2.21.so[7fccb2349000+199000]
traps: systemd-cgroups[100] general protection ip:7fdd8ff784f8 sp:7ffcf6e27ad8 
error:0 in ld-2.21.so[7fdd8ff6+22000]
traps: systemd-cgroups[94] general protection ip:7f9f23bd24f8 sp:74fc5578 
error:0 in ld-2.21.so[7f9f23bba000+22000]
traps: systemd-cgroups[102] general protection ip:7f211e6574f8 sp:7ffdb8e0d538 
error:0 in ld-2.21.so[7f211e63f000+22000]
traps: systemd-cgroups[103] general protection ip:7f80627c34f8 sp:7ffc7fa4cff8 
error:0 in ld-2.21.so[7f80627ab000+22000]


2015-02-25, 14:55:34 +0100, Denys Vlasenko wrote:
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
  2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
  My test vm doesn't boot with this patch. Could you help to investigate
  this issue?
 
 Hi Andrey, thanks for testing!
 
  I have attached a kernel config and console log.
 
 Looking at the logs, it seems that regular syscalls do work:
 systemd managed to function for some time, even spawned
 a few children.
 
 It might be that the bug is somewhere in signal delivery code.
 This would explain why oops got delayed.

It doesn't oops here, it just tries to load other bits of systemd and hangs.
I've noticed that ip: - the address after ld-2.21.so[ is always
the same value, I don't know if that's expected or relevant.

(full log below)

 I am trying to reproduce it. My gcc seems to be a bit old -
 it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config.
 
 I switched to using only CONFIG_CC_STACKPROTECTOR_REGULAR=y:
 
 CONFIG_CC_STACKPROTECTOR=y
 # CONFIG_CC_STACKPROTECTOR_NONE is not set
 CONFIG_CC_STACKPROTECTOR_REGULAR=y
 # CONFIG_CC_STACKPROTECTOR_STRONG is not set
 
 and resulting kernel works for me.

I don't have any STACKPROTECTOR in my config:

# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_CC_STACKPROTECTOR_NONE=y
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set

(full config after the log)


I can start systemd's emergency shell (systemd.unit=emergency.target),
if running test programs helps.


Thanks,
Sabrina

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.0-rc1-next-20150225 (zappy@kria) (gcc version 
4.9.2 20150204 (prerelease) (GCC) ) #636 SMP PREEMPT Wed Feb 25 13:45:23 CET 
2015
[0.00] Command line: root=/dev/sda1 
netconsole=@10.0.1.23/,@10.0.1.10/ console=ttyS0
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable
[0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] Hypervisor detected: KVM
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x1ffe0 max_arch_pfn = 0x4
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] found SMP MP-table at [mem 0x000f1010-0x000f101f] mapped at 
[880f1010]
[0.00] Scanning 1 areas for low memory corruption
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00] init_memory_mapping: [mem 0x1fc0-0x1fdf]
[0.00] init_memory_mapping: [mem 0x0010-0x1fbf]
[0.00] init_memory_mapping: [mem 0x1fe0-0x1ffd]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F0DD0 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x1FFE18BC 34 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x1FFE0E48 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x1FFE0040 000E08 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x1FFE 40
[0.00] ACPI: SSDT 0x1FFE0EBC 000948 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x1FFE1804 80 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] ACPI: HPET 0x1FFE1884 38 (v01 BOCHS  BXPCHPET 
0001 BXPC 0001)
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[0.00] kvm-clock: cpu 0, msr 0:1ffdf001, primary cpu clock
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   DMA32[mem 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote:
 On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote:
  On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
  On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote:
  2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
  On 02/25/2015 01:37 PM, Andrey Wagin wrote:
  2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
  64-bit code was using six stack slots less by not saving/restoring
  registers which are callee-preserved according to C ABI,
  and not allocating space for them.
  Only when syscall needed a complete struct pt_regs,
  the complete area was allocated and filled in.
  As an additional twist, on interrupt entry a slightly less truncated 
  pt_regs
  trick is used, to make nested interrupt stacks easier to unwind.
 
  This proved to be a source of significant obfuscation and subtle bugs.
  For example, stub_fork had to pop the return address,
  extend the struct, save registers, and push return address back. Ugly.
  ia32_ptregs_common pops return address and returns via jmp insn,
  throwing a wrench into CPU return stack cache.
 
  This patch changes code to always allocate a complete struct pt_regs.
  The saving of registers is still done lazily.
 
  Partial pt_regs trick on interrupt stack is retained.
 
  Macros which manipulate struct pt_regs on stack are reworked:
  ALLOC_PT_GPREGS_ON_STACK allocates the structure.
  SAVE_C_REGS saves to it those registers which are clobbered by C code.
  SAVE_EXTRA_REGS saves to it all other registers.
  Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse 
  it.
 
  ia32_ptregs_common, stub_fork and friends lost their ugly dance with
  return pointer.
 
  LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
  instead of magic numbers.
 
  error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
  instead of having it open-coded yet again.
 
  Patch was run-tested: 64-bit executables, 32-bit executables,
  strace works.
  Timing tests did not show measurable difference in 32-bit
  and 64-bit syscalls.
 
  Hello Denys,
 
  My test vm doesn't boot with this patch. Could you help to investigate
  this issue?
 
  I think I found it. This part of my patch is possibly wrong:
 
  @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
   #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
  TRACE_IRQS_ON; \
  sti; \
  -   SAVE_REST; \
  +   SAVE_EXTRA_REGS; \
  LOCKDEP_SYS_EXIT; \
  -   RESTORE_REST; \
  +   RESTORE_EXTRA_REGS; \
  cli; \
  TRACE_IRQS_OFF;
 
  The SAVE_REST here is intended to really *push* extra regs on stack,
  but the patch changed it so that they are written to existing stack
  slots above.
 
  From code inspection it should work in almost all cases, but some
  locations where it is used are really obscure.
 
  If there are places where *pushing* regs is really necessary,
  this can corrupt rbp,rbx,r12-15 registers.
 
  Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the 
  bug
  was here.
  Please find updated patch attached. Can you try it?
 
  It doesn't work
 
  Thanks for testing it anyway.
 
 
  [3.016262] traps: systemd-cgroups[390] general protection
  ip:7f456f7b6028 sp:7fffdc059718 error:0 in
  ld-2.18.so[7f456f79e000+2]
 
  This is what I know about these crashes. The SEGV itself is caused by
  HLT instruction executed by dynamic loader, ld-2.NN.so.
  The instruction is in _exit function, and is only reachable if
  exit_group and exit syscalls fail to terminate the process.
  So it seems that syscall execution is getting badly broken somehow
  at some point.
 
  This happens to both reporters.
 
  My theory that it is related to lockdep seems to be wrong, because
  Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
 
  Both kernels are paravirtualized, both are booted under KVM,
  Andrey runs it with four virtual CPUs, Sabrina runs with two.
 
  My next theory is that I missed something related to paravirt.
  I am looking at that code, so far I don't see anything suspicious.
 
  Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
  I run it under qemu-system-x86_64 -enable-kvm -smp 2,
  I see in dmesg that kernel does detect that it is being run under KVM,
  but it works for me. No mysterious segfaults.
 
  Andrey, can you send me your bzImage? Maybe it will trigger
  the problem for me.
 
 
  The change to stub_\func looks wrong to me.  It saves and restores
  regs, but those regs might already have been saved if we're on the
  slow path.  (Yes, all that code is quite buggy even without all these
  patches.)  So is execve.
 
  This means that, for example, execve called in the slow path will
  save/restore regs twice.  If the values in the regs after the first
  save and before the second save are different, then we corrupt user
  state.
 
  This part?
 
  .macro 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
 I can run some userspace programs, but I have no idea what would be
 helpful.
 I can also try booting a real machine with archlinux/systemd tomorrow.

I got a good boot out of kernels that normally fail.  I booted
systemd's emergency shell and enabled a few services, in the same
order they normally start.  journald started cleanly, but after that,
every single command produced a traps: output and an audit: line.

I disabled systemd-journald (chmod -x, because `systemctl disable`
didn't really disable it), and now it boots, no traps: in the log.
If I run it, everything fails again (zsh has traps for simply pressing
enter on an empty cmd).

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Feb 13

2015-02-13 Thread Sabrina Dubroca
2015-02-13, 16:56:15 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Please do not add any material destined for v3.21 to your linux-next
> included trees until after v3.20-rc1 has been released.
> 
> Changes since 20150212:

Hi Stephen,

Your conflict resolution in

8fe7fba50596 "Merge branch 'akpm-current/current'"

for mm/memory.c looks a bit off.  I get flooded with these messages:

  BUG: non-zero nr_pmds on freeing mm: 4

and fixed it with:


diff --git a/mm/memory.c b/mm/memory.c
index 450e4952c5ef..802adda2b0b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3350,7 +3350,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long address)
smp_wmb(); /* See comment in __pte_alloc */
 
spin_lock(>page_table_lock);
-   mm_inc_nr_pmds(mm);
 #ifndef __ARCH_HAS_4LEVEL_HACK
if (!pud_present(*pud)) {
mm_inc_nr_pmds(mm);


references:
http://www.spinics.net/lists/linux-mm/msg84294.html
dc6c9a35b66b "mm: account pmd page tables to the process"

[CC'ed Kirill A. Shutemov]


Thanks

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Feb 13

2015-02-13 Thread Sabrina Dubroca
2015-02-13, 16:56:15 +1100, Stephen Rothwell wrote:
 Hi all,
 
 Please do not add any material destined for v3.21 to your linux-next
 included trees until after v3.20-rc1 has been released.
 
 Changes since 20150212:

Hi Stephen,

Your conflict resolution in

8fe7fba50596 Merge branch 'akpm-current/current'

for mm/memory.c looks a bit off.  I get flooded with these messages:

  BUG: non-zero nr_pmds on freeing mm: 4

and fixed it with:


diff --git a/mm/memory.c b/mm/memory.c
index 450e4952c5ef..802adda2b0b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3350,7 +3350,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long address)
smp_wmb(); /* See comment in __pte_alloc */
 
spin_lock(mm-page_table_lock);
-   mm_inc_nr_pmds(mm);
 #ifndef __ARCH_HAS_4LEVEL_HACK
if (!pud_present(*pud)) {
mm_inc_nr_pmds(mm);


references:
http://www.spinics.net/lists/linux-mm/msg84294.html
dc6c9a35b66b mm: account pmd page tables to the process

[CC'ed Kirill A. Shutemov]


Thanks

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 21:28:33 +, Al Viro wrote:
> On Wed, Jan 21, 2015 at 01:03:20PM -0800, Guenter Roeck wrote:
> > ok case (putname commented out):
> > 
> > user_path_at_empty lookup usr flags 0x0
> > path_lookupat: calling path_init 'usr' flags=40
> > path_init: link_path_walk() returned 0
> > path_lookupat: path_init 'usr' flags=40[50] returned 0
> > walk_component: lookup_fast() returned 1
> > walk_component: lookup_slow() returned 0
> > walk_component: inode=  (null), negative=1
> > do_path_lookup(usr, 0x10)
> > path_lookupat: calling path_init 'usr' flags=50
> > path_init: link_path_walk() returned 0
> > path_lookupat: path_init 'usr' flags=50[50] returned 0
> > mkdir[c74012a0,/usr] => 0
> > user_path_at_empty lookup usr flags 0x1
> > path_lookupat: calling path_init 'usr' flags=41
> > path_init: link_path_walk() returned 0
> > path_lookupat: path_init 'usr' flags=41[51] returned 0
> > walk_component: inode=c74004a0, negative=0
> > user_path_at_empty lookup usr flags 0x1
> > path_lookupat: calling path_init 'usr' flags=41
> > path_init: link_path_walk() returned 0
> > path_lookupat: path_init 'usr' flags=41[51] returned 0
> > 
> > failing case:
> > 
> > path_lookupat: calling path_init 'usr' flags=40
> > path_init: link_path_walk() returned 0
> > path_lookupat: path_init 'usr' flags=40[50] returned 0
> > walk_component: lookup_fast() returned 1
> > walk_component: lookup_slow() returned 0
> > walk_component: inode=  (null), negative=1
> > do_path_lookup(usr, 0x10)
> > path_lookupat: calling path_init 'usr' flags=50
> > path_init: link_path_walk() returned 0
> > path_lookupat: path_init 'usr' flags=50[50] returned 0
> > mkdir[c74012a0,/kkk] => 0   < 
> > SIC!
> 
> Cute. 'k' being 0x6b, aka POISON_FREE...  OK, the next question is what's
> been freed under us - I don't believe that it's dentry itself...
> Oh, fuck.  OK, I see what happens.  Look at kern_path_create(); it does
> LOOKUP_PARENT walk, leaving nd->last pointing to the last component of
> the *COPY* of the name it's just created, walked and freed.
> 
> OK...  Fortunately, struct nameidata is completely opaque outside of 
> fs/namei.c,
> so we only need to care about a couple of codepaths.
> 
> Folks, could you check if the following on top of linux-next fixes the 
> problem?

Yes, it works.


-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 13:03:20 -0800, Guenter Roeck wrote:
> On 01/21/2015 12:06 PM, Al Viro wrote:
> >On Wed, Jan 21, 2015 at 11:06:27AM -0800, Guenter Roeck wrote:
> >>On 01/21/2015 10:29 AM, Al Viro wrote:
> >>>On Wed, Jan 21, 2015 at 05:32:13AM -0800, Guenter Roeck wrote:
> Another data point (though I have no idea if it is useful or what it 
> means):
> 
> In the working case, path_init sets nd->flags to 0x50 or 0x51.
> In the non-working case (ie for all files with a '/' in the name),
> it sets nd->flags to 0x10 or 0x11, even though it is always called
> with the LOOKUP_RCU bit set in flags.
> >>>
> >>>Umm...  Are those path_init() succeeding or failing?  Note that path_init()
> >>>includes "walk everything except for the last component", so your 
> >>>non-working
> >>>case is "have it walk anything at all".  What's failing there?  path_init()
> >>>or handling the remaining component?
> >>>
> >>path_init() returns -2. Guess that explains the unexpected flags ;-).
> >>The failuere is from
> >>link_path_walk()
> >>walk_component()
> >
> >Which is to say, lookup gave it a negative dentry.  OK, let's just make
> >vfs_mkdir() and walk_component() print what they are doing; on top of
> >linux-next
> >
> >diff --git a/fs/namei.c b/fs/namei.c
> >index 323957f..8a4e22f 100644
> >--- a/fs/namei.c
> >+++ b/fs/namei.c
> >@@ -1586,8 +1586,11 @@ static inline int walk_component(struct nameidata 
> >*nd, struct path *path,
> > inode = path->dentry->d_inode;
> > }
> > err = -ENOENT;
> >-if (!inode || d_is_negative(path->dentry))
> >+if (!inode || d_is_negative(path->dentry)) {
> >+printk(KERN_ERR "walk_component[%p,%pd4] -> negative\n",
> >+path->dentry, path->dentry);
> > goto out_path_put;
> >+}
> >
> > if (should_follow_link(path->dentry, follow)) {
> > if (nd->flags & LOOKUP_RCU) {
> >@@ -3521,6 +3524,7 @@ int vfs_mkdir(struct inode *dir, struct dentry 
> >*dentry, umode_t mode)
> > error = dir->i_op->mkdir(dir, dentry, mode);
> > if (!error)
> > fsnotify_mkdir(dir, dentry);
> >+printk(KERN_ERR "mkdir[%p,%pd4] => %d\n", dentry, dentry, error);
> > return error;
> >  }
> >  EXPORT_SYMBOL(vfs_mkdir);
> >
> 
> ok case (putname commented out):
> 
> user_path_at_empty lookup usr flags 0x0
> path_lookupat: calling path_init 'usr' flags=40
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=40[50] returned 0
> walk_component: lookup_fast() returned 1
> walk_component: lookup_slow() returned 0
> walk_component: inode=  (null), negative=1
> do_path_lookup(usr, 0x10)
> path_lookupat: calling path_init 'usr' flags=50
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=50[50] returned 0
> mkdir[c74012a0,/usr] => 0
> user_path_at_empty lookup usr flags 0x1
> path_lookupat: calling path_init 'usr' flags=41
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=41[51] returned 0
> walk_component: inode=c74004a0, negative=0
> user_path_at_empty lookup usr flags 0x1
> path_lookupat: calling path_init 'usr' flags=41
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=41[51] returned 0
> 
> failing case:
> 
> path_lookupat: calling path_init 'usr' flags=40
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=40[50] returned 0
> walk_component: lookup_fast() returned 1
> walk_component: lookup_slow() returned 0
> walk_component: inode=  (null), negative=1
> do_path_lookup(usr, 0x10)
> path_lookupat: calling path_init 'usr' flags=50
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=50[50] returned 0
> mkdir[c74012a0,/kkk] => 0 < 
> SIC!
> user_path_at_empty lookup usr flags 0x1
> path_lookupat: calling path_init 'usr' flags=41
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=41[51] returned 0
> walk_component: lookup_fast() returned 1
> walk_component: lookup_slow() returned 0
> walk_component: inode=  (null), negative=1
> user_path_at_empty lookup usr flags 0x1
> path_lookupat: calling path_init 'usr' flags=41
> path_init: link_path_walk() returned 0
> path_lookupat: path_init 'usr' flags=41[51] returned 0
> walk_component: lookup_fast() returned 1
> walk_component: lookup_slow() returned 0
> walk_component: inode=  (null), negative=1

Yep, I get some "kkk" too.

With that patch:

## panic

[0.544839] walk_component[88001d6edbd8,/dev] -> negative
[0.545507] mkdir[88001d6ed1b8,/kkk] => 0
[0.545886] sys_mkdir dev:40755 returned 0
[0.546275] walk_component[88001d6ec288,/dev] -> negative
[0.546835] walk_component[88001d6eca20,/dev] -> negative
[0.547403] walk_component[88001d6ed950,/dev] -> negative
[0.547954] walk_component[88001d6ed440,/dev] -> negative
[0.549260] 

Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 16:39:12 +0100, Thierry Reding wrote:
> On Wed, Jan 21, 2015 at 10:24:11AM -0500, Paul Moore wrote:
> > On Wednesday, January 21, 2015 03:42:16 PM Thierry Reding wrote:
> > > On Wed, Jan 21, 2015 at 12:05:39PM +0100, Sabrina Dubroca wrote:
> > > > 2015-01-21, 04:36:38 +, Al Viro wrote:
> > > > > On Tue, Jan 20, 2015 at 08:01:26PM -0800, Guenter Roeck wrote:
> > > > > > With this patch:
> > > > > > 
> > > > > > sys_mkdir .:40775 returned -17
> > > > > > sys_mkdir usr:40775 returned 0
> > > > > > sys_mkdir usr/lib:40775 returned 0
> > > > > > sys_mkdir usr/share:40755 returned 0
> > > > > > sys_mkdir usr/share/udhcpc:40755 returned 0
> > > > > > sys_mkdir usr/bin:40775 returned 0
> > > > > > sys_mkdir usr/sbin:40775 returned 0
> > > > > > sys_mkdir mnt:40775 returned 0
> > > > > > sys_mkdir proc:40775 returned 0
> > > > > > sys_mkdir root:40775 returned 0
> > > > > > sys_mkdir lib:40775 returned 0
> > > > > > sys_mkdir lib/modules:40775 returned 0
> > > > > > ...
> > > > > > 
> > > > > > and the problem is fixed.
> > > > 
> > > > This patch also works for me.
> > > > 
> > > > > ... except that it simply confirms that something's fishy with
> > > > > getname_kernel() of ->name of struct filename returned by getname(). 
> > > > > IOW, I still do not understand the mechanism of breakage there.
> > > > 
> > > > I'm not so sure about that.  I tried to copy name to a new string in
> > > > do_path_lookup and that didn't help.
> > > > 
> > > > Now, I've removed the
> > > > 
> > > > putname(filename);
> > > > 
> > > > line from do_path_lookup and I don't get the panic.
> > > 
> > > That would indicate that somehow the refcount got unbalanced. Looking
> > > more closely it seems like the various audit_*() function do take a
> > > reference, but maybe that's not enough.
> > 
> > I'm thinking the same thing and I think the problem may be that 
> > __audit_reusename() is not bumping the filename->refcnt.  Can someone who 
> > is 
> > seeing this problem bump the refcnt in __audit_reusename()?
> > 
> >   struct filename *
> >   __audit_reusename(const __user char *uptr)
> >   {
> > struct audit_context *context = current->audit_context;
> > struct audit_names *n;
> > 
> > list_for_each_entry(n, >names_list, list) {
> > if (!n->name)
> > continue;
> > if (n->name->uptr == uptr) {
> > +   n->name->refcnt++;
> > return n->name;
> > }
> > }
> > return NULL;
> >   }
> 
> That doesn't seem to help, at least in my case.

Same here.

Well, it's probably not an audit issue.  I tried audit=0 on the
commandline, and I just rebuilt a kernel with CONFIG_AUDIT=n, and it's
still panicing.  This should have fixed any audit-related issue,
right?

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 04:36:38 +, Al Viro wrote:
> On Tue, Jan 20, 2015 at 08:01:26PM -0800, Guenter Roeck wrote:
> > With this patch:
> > 
> > sys_mkdir .:40775 returned -17
> > sys_mkdir usr:40775 returned 0
> > sys_mkdir usr/lib:40775 returned 0
> > sys_mkdir usr/share:40755 returned 0
> > sys_mkdir usr/share/udhcpc:40755 returned 0
> > sys_mkdir usr/bin:40775 returned 0
> > sys_mkdir usr/sbin:40775 returned 0
> > sys_mkdir mnt:40775 returned 0
> > sys_mkdir proc:40775 returned 0
> > sys_mkdir root:40775 returned 0
> > sys_mkdir lib:40775 returned 0
> > sys_mkdir lib/modules:40775 returned 0
> > ...
> > 
> > and the problem is fixed.

This patch also works for me.


> ... except that it simply confirms that something's fishy with 
> getname_kernel()
> of ->name of struct filename returned by getname().  IOW, I still do not
> understand the mechanism of breakage there.

I'm not so sure about that.  I tried to copy name to a new string in
do_path_lookup and that didn't help.

Now, I've removed the

putname(filename);

line from do_path_lookup and I don't get the panic.


And BTW, I added Guenter's debugging to init/initramfs.c and got:
sys_mkdir dev:40755 returned 0
sys_mkdir root:40700 returned 0

even if it ends up panic'ing.

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 16:39:12 +0100, Thierry Reding wrote:
 On Wed, Jan 21, 2015 at 10:24:11AM -0500, Paul Moore wrote:
  On Wednesday, January 21, 2015 03:42:16 PM Thierry Reding wrote:
   On Wed, Jan 21, 2015 at 12:05:39PM +0100, Sabrina Dubroca wrote:
2015-01-21, 04:36:38 +, Al Viro wrote:
 On Tue, Jan 20, 2015 at 08:01:26PM -0800, Guenter Roeck wrote:
  With this patch:
  
  sys_mkdir .:40775 returned -17
  sys_mkdir usr:40775 returned 0
  sys_mkdir usr/lib:40775 returned 0
  sys_mkdir usr/share:40755 returned 0
  sys_mkdir usr/share/udhcpc:40755 returned 0
  sys_mkdir usr/bin:40775 returned 0
  sys_mkdir usr/sbin:40775 returned 0
  sys_mkdir mnt:40775 returned 0
  sys_mkdir proc:40775 returned 0
  sys_mkdir root:40775 returned 0
  sys_mkdir lib:40775 returned 0
  sys_mkdir lib/modules:40775 returned 0
  ...
  
  and the problem is fixed.

This patch also works for me.

 ... except that it simply confirms that something's fishy with
 getname_kernel() of -name of struct filename returned by getname(). 
 IOW, I still do not understand the mechanism of breakage there.

I'm not so sure about that.  I tried to copy name to a new string in
do_path_lookup and that didn't help.

Now, I've removed the

putname(filename);

line from do_path_lookup and I don't get the panic.
   
   That would indicate that somehow the refcount got unbalanced. Looking
   more closely it seems like the various audit_*() function do take a
   reference, but maybe that's not enough.
  
  I'm thinking the same thing and I think the problem may be that 
  __audit_reusename() is not bumping the filename-refcnt.  Can someone who 
  is 
  seeing this problem bump the refcnt in __audit_reusename()?
  
struct filename *
__audit_reusename(const __user char *uptr)
{
  struct audit_context *context = current-audit_context;
  struct audit_names *n;
  
  list_for_each_entry(n, context-names_list, list) {
  if (!n-name)
  continue;
  if (n-name-uptr == uptr) {
  +   n-name-refcnt++;
  return n-name;
  }
  }
  return NULL;
}
 
 That doesn't seem to help, at least in my case.

Same here.

Well, it's probably not an audit issue.  I tried audit=0 on the
commandline, and I just rebuilt a kernel with CONFIG_AUDIT=n, and it's
still panicing.  This should have fixed any audit-related issue,
right?

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 13:03:20 -0800, Guenter Roeck wrote:
 On 01/21/2015 12:06 PM, Al Viro wrote:
 On Wed, Jan 21, 2015 at 11:06:27AM -0800, Guenter Roeck wrote:
 On 01/21/2015 10:29 AM, Al Viro wrote:
 On Wed, Jan 21, 2015 at 05:32:13AM -0800, Guenter Roeck wrote:
 Another data point (though I have no idea if it is useful or what it 
 means):
 
 In the working case, path_init sets nd-flags to 0x50 or 0x51.
 In the non-working case (ie for all files with a '/' in the name),
 it sets nd-flags to 0x10 or 0x11, even though it is always called
 with the LOOKUP_RCU bit set in flags.
 
 Umm...  Are those path_init() succeeding or failing?  Note that path_init()
 includes walk everything except for the last component, so your 
 non-working
 case is have it walk anything at all.  What's failing there?  path_init()
 or handling the remaining component?
 
 path_init() returns -2. Guess that explains the unexpected flags ;-).
 The failuere is from
 link_path_walk()
 walk_component()
 
 Which is to say, lookup gave it a negative dentry.  OK, let's just make
 vfs_mkdir() and walk_component() print what they are doing; on top of
 linux-next
 
 diff --git a/fs/namei.c b/fs/namei.c
 index 323957f..8a4e22f 100644
 --- a/fs/namei.c
 +++ b/fs/namei.c
 @@ -1586,8 +1586,11 @@ static inline int walk_component(struct nameidata 
 *nd, struct path *path,
  inode = path-dentry-d_inode;
  }
  err = -ENOENT;
 -if (!inode || d_is_negative(path-dentry))
 +if (!inode || d_is_negative(path-dentry)) {
 +printk(KERN_ERR walk_component[%p,%pd4] - negative\n,
 +path-dentry, path-dentry);
  goto out_path_put;
 +}
 
  if (should_follow_link(path-dentry, follow)) {
  if (nd-flags  LOOKUP_RCU) {
 @@ -3521,6 +3524,7 @@ int vfs_mkdir(struct inode *dir, struct dentry 
 *dentry, umode_t mode)
  error = dir-i_op-mkdir(dir, dentry, mode);
  if (!error)
  fsnotify_mkdir(dir, dentry);
 +printk(KERN_ERR mkdir[%p,%pd4] = %d\n, dentry, dentry, error);
  return error;
   }
   EXPORT_SYMBOL(vfs_mkdir);
 
 
 ok case (putname commented out):
 
 user_path_at_empty lookup usr flags 0x0
 path_lookupat: calling path_init 'usr' flags=40
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=40[50] returned 0
 walk_component: lookup_fast() returned 1
 walk_component: lookup_slow() returned 0
 walk_component: inode=  (null), negative=1
 do_path_lookup(usr, 0x10)
 path_lookupat: calling path_init 'usr' flags=50
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=50[50] returned 0
 mkdir[c74012a0,/usr] = 0
 user_path_at_empty lookup usr flags 0x1
 path_lookupat: calling path_init 'usr' flags=41
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=41[51] returned 0
 walk_component: inode=c74004a0, negative=0
 user_path_at_empty lookup usr flags 0x1
 path_lookupat: calling path_init 'usr' flags=41
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=41[51] returned 0
 
 failing case:
 
 path_lookupat: calling path_init 'usr' flags=40
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=40[50] returned 0
 walk_component: lookup_fast() returned 1
 walk_component: lookup_slow() returned 0
 walk_component: inode=  (null), negative=1
 do_path_lookup(usr, 0x10)
 path_lookupat: calling path_init 'usr' flags=50
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=50[50] returned 0
 mkdir[c74012a0,/kkk] = 0  
 SIC!
 user_path_at_empty lookup usr flags 0x1
 path_lookupat: calling path_init 'usr' flags=41
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=41[51] returned 0
 walk_component: lookup_fast() returned 1
 walk_component: lookup_slow() returned 0
 walk_component: inode=  (null), negative=1
 user_path_at_empty lookup usr flags 0x1
 path_lookupat: calling path_init 'usr' flags=41
 path_init: link_path_walk() returned 0
 path_lookupat: path_init 'usr' flags=41[51] returned 0
 walk_component: lookup_fast() returned 1
 walk_component: lookup_slow() returned 0
 walk_component: inode=  (null), negative=1

Yep, I get some kkk too.

With that patch:

## panic

[0.544839] walk_component[88001d6edbd8,/dev] - negative
[0.545507] mkdir[88001d6ed1b8,/kkk] = 0
[0.545886] sys_mkdir dev:40755 returned 0
[0.546275] walk_component[88001d6ec288,/dev] - negative
[0.546835] walk_component[88001d6eca20,/dev] - negative
[0.547403] walk_component[88001d6ed950,/dev] - negative
[0.547954] walk_component[88001d6ed440,/dev] - negative
[0.549260] walk_component[88001d6ec510,/dev] - negative
[0.551161] walk_component[88001d6ec798,/dev] - negative
[0.551719] walk_component[88001d6ed6c8,/dev] - negative
[0.552281] walk_component[88001d6eef30,/root] - negative
[0.552866] 

Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 21:28:33 +, Al Viro wrote:
 On Wed, Jan 21, 2015 at 01:03:20PM -0800, Guenter Roeck wrote:
  ok case (putname commented out):
  
  user_path_at_empty lookup usr flags 0x0
  path_lookupat: calling path_init 'usr' flags=40
  path_init: link_path_walk() returned 0
  path_lookupat: path_init 'usr' flags=40[50] returned 0
  walk_component: lookup_fast() returned 1
  walk_component: lookup_slow() returned 0
  walk_component: inode=  (null), negative=1
  do_path_lookup(usr, 0x10)
  path_lookupat: calling path_init 'usr' flags=50
  path_init: link_path_walk() returned 0
  path_lookupat: path_init 'usr' flags=50[50] returned 0
  mkdir[c74012a0,/usr] = 0
  user_path_at_empty lookup usr flags 0x1
  path_lookupat: calling path_init 'usr' flags=41
  path_init: link_path_walk() returned 0
  path_lookupat: path_init 'usr' flags=41[51] returned 0
  walk_component: inode=c74004a0, negative=0
  user_path_at_empty lookup usr flags 0x1
  path_lookupat: calling path_init 'usr' flags=41
  path_init: link_path_walk() returned 0
  path_lookupat: path_init 'usr' flags=41[51] returned 0
  
  failing case:
  
  path_lookupat: calling path_init 'usr' flags=40
  path_init: link_path_walk() returned 0
  path_lookupat: path_init 'usr' flags=40[50] returned 0
  walk_component: lookup_fast() returned 1
  walk_component: lookup_slow() returned 0
  walk_component: inode=  (null), negative=1
  do_path_lookup(usr, 0x10)
  path_lookupat: calling path_init 'usr' flags=50
  path_init: link_path_walk() returned 0
  path_lookupat: path_init 'usr' flags=50[50] returned 0
  mkdir[c74012a0,/kkk] = 0    
  SIC!
 
 Cute. 'k' being 0x6b, aka POISON_FREE...  OK, the next question is what's
 been freed under us - I don't believe that it's dentry itself...
 Oh, fuck.  OK, I see what happens.  Look at kern_path_create(); it does
 LOOKUP_PARENT walk, leaving nd-last pointing to the last component of
 the *COPY* of the name it's just created, walked and freed.
 
 OK...  Fortunately, struct nameidata is completely opaque outside of 
 fs/namei.c,
 so we only need to care about a couple of codepaths.
 
 Folks, could you check if the following on top of linux-next fixes the 
 problem?

Yes, it works.


-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-21 Thread Sabrina Dubroca
2015-01-21, 04:36:38 +, Al Viro wrote:
 On Tue, Jan 20, 2015 at 08:01:26PM -0800, Guenter Roeck wrote:
  With this patch:
  
  sys_mkdir .:40775 returned -17
  sys_mkdir usr:40775 returned 0
  sys_mkdir usr/lib:40775 returned 0
  sys_mkdir usr/share:40755 returned 0
  sys_mkdir usr/share/udhcpc:40755 returned 0
  sys_mkdir usr/bin:40775 returned 0
  sys_mkdir usr/sbin:40775 returned 0
  sys_mkdir mnt:40775 returned 0
  sys_mkdir proc:40775 returned 0
  sys_mkdir root:40775 returned 0
  sys_mkdir lib:40775 returned 0
  sys_mkdir lib/modules:40775 returned 0
  ...
  
  and the problem is fixed.

This patch also works for me.


 ... except that it simply confirms that something's fishy with 
 getname_kernel()
 of -name of struct filename returned by getname().  IOW, I still do not
 understand the mechanism of breakage there.

I'm not so sure about that.  I tried to copy name to a new string in
do_path_lookup and that didn't help.

Now, I've removed the

putname(filename);

line from do_path_lookup and I don't get the panic.


And BTW, I added Guenter's debugging to init/initramfs.c and got:
sys_mkdir dev:40755 returned 0
sys_mkdir root:40700 returned 0

even if it ends up panic'ing.

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 23:17:25 +, Al Viro wrote:
> On Tue, Jan 20, 2015 at 10:50:41PM +, Al Viro wrote:
> > doesn't look at _anything_ other than name->name other than for 
> > audit_inode().
> > And name->name is apparently the same.
> > 
> > It looks like something ends up buggering name->name in process, but then
> > the damn thing appears to be normal after return from filename_lookup()...
> 
> If my reconstruction of what's going on is correct, the call chain here
> is do_path_lookup() <- kern_path() <- lookup_bdev() <- blkdev_get_by_path()
> <- mount_bdev() <- some_type.mount() <- mount_fs()
> <- vfs_kern_mount() <- do_new_mount() <- do_mount() <- sys_mount()
> <- do_mount_root() <- mount_block_root() <- mount_root().  Which is
> obscenely long, BTW, but that's a separate story...
> 
> Could you slap
>   struct stat buf;
>   int n = sys_newstat(name, );
>   printk(KERN_ERR "stat(\"%s\") -> %d\n", name, n);
>   n = sys_newstat("/dev", );
>   printk(KERN_ERR "stat(\"dev\") -> %d\n", n);
> 
> in the beginning of mount_block_root() (init/do_mounts.c) and see what it
> prints?

I get

stat("/dev/root") -> -2
stat("dev") -> -2
with the patch applied (+panic)


and:

stat("/dev/root") -> 0
stat("dev") -> 0
with the old version of do_path_lookup.

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 21:58:31 +, Al Viro wrote:
> On Tue, Jan 20, 2015 at 10:38:58PM +0100, Sabrina Dubroca wrote:
> 
> > [1.538646] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
> > [1.539704] fn_lookup bsg 0, 88001f718000 bsg
> > [1.540559] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
> > [1.552611] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
> > [1.553689] fn_lookup bsg 0, 88001f718000 bsg
> > [1.554505] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
> > [1.557554] fn_lookup sda 0, 88001f718000 sda
> > [1.558368] fn_lookup sda 0, 88001f718000 sda
> > [1.564190] fn_lookup sda1 0, 88001f718000 sda1
> > [1.565008] fn_lookup sda1 0, 88001f718000 sda1
> > [1.570751] fn_lookup /dev/ram -2, 88001f71a300 /dev/ram
> > [1.571786] fn_lookup /dev/root -2, 88001f71b480 /dev/root
> 
> Nuts...  Is reverting just this (do_path_lookup()) part of commit sufficient
> to recover the normal behaviour?

Yes.

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 21:02:03 +, Al Viro wrote:
> On Tue, Jan 20, 2015 at 09:45:04PM +0100, Sabrina Dubroca wrote:
> 
> > printk(KERN_ERR "fn_lookup %s %d\n", name, retval);
> > 
> > and I get:
> > 
> > [1.618558] fn_lookup bsg/0:0:0:0 -2
> > [1.619437] fn_lookup bsg 0
> > [1.620236] fn_lookup bsg/0:0:0:0 -2
> > [1.625996] fn_lookup sda 0
> > [1.626609] fn_lookup sda 0
> > [1.639007] fn_lookup sda1 0
> > [1.639691] fn_lookup sda1 0
> > [1.643656] fn_lookup bsg/1:0:0:0 -2
> > [1.644974] fn_lookup bsg 0
> > [1.645928] fn_lookup bsg/1:0:0:0 -2
> > [1.649483] fn_lookup /dev/ram -2
> > [1.650424] fn_lookup /dev/root -2
> > [1.651234] VFS: Cannot open root device "sda1" or unknown-block(8,1): 
> > error -2
> 
> That -2 is -ENOENT...  Wait a sec, what's in filename, filename->name and
> what do you get from your printk on kernel with that commit reverted?

filename->name matches name. with
printk(KERN_ERR "fn_lookup %s %d, %p %s\n", name, retval, filename, 
filename->name);

[1.538646] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
[1.539704] fn_lookup bsg 0, 88001f718000 bsg
[1.540559] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
[1.552611] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
[1.553689] fn_lookup bsg 0, 88001f718000 bsg
[1.554505] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
[1.557554] fn_lookup sda 0, 88001f718000 sda
[1.558368] fn_lookup sda 0, 88001f718000 sda
[1.564190] fn_lookup sda1 0, 88001f718000 sda1
[1.565008] fn_lookup sda1 0, 88001f718000 sda1
[1.570751] fn_lookup /dev/ram -2, 88001f71a300 /dev/ram
[1.571786] fn_lookup /dev/root -2, 88001f71b480 /dev/root


and with
printk(KERN_ERR "fn_lookup %s %d, %s\n", name, retval, filename.name);
in the original do_path_lookup:

[1.426101] fn_lookup bsg/0:0:0:0 -2, bsg/0:0:0:0
[1.426893] fn_lookup bsg 0, bsg
[1.427406] fn_lookup bsg/0:0:0:0 0, bsg/0:0:0:0
[1.431530] fn_lookup sda 0, sda
[1.438346] fn_lookup bsg/1:0:0:0 0, bsg/1:0:0:0
[1.443658] fn_lookup sda1 0, sda1
[1.448344] fn_lookup /dev/ram 0, /dev/ram
[1.449148] fn_lookup /dev/root 0, /dev/root
[1.449835] fn_lookup /dev/root 0, /dev/root
[1.451586] EXT4-fs (sda1): couldn't mount as ext3 due to feature 
incompatibilities
[1.452954] fn_lookup /dev/root 0, /dev/root
[1.454292] EXT4-fs (sda1): couldn't mount as ext2 due to feature 
incompatibilities
[1.456331] fn_lookup /dev/root 0, /dev/root
[1.480208] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: 
(null)
[1.481323] VFS: Mounted root (ext4 filesystem) readonly on device 8:1.


-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 19:54:32 +, Al Viro wrote:
> On Tue, Jan 20, 2015 at 06:51:35PM +0100, Sabrina Dubroca wrote:
> > 2015-01-20, 12:39:08 -0500, Paul Moore wrote:
> > > On Tuesday, January 20, 2015 05:56:55 PM Sabrina Dubroca wrote:
> > > > Hello,
> > > > 
> > > > Today's linux-next doesn't boot on my qemu VM:
> > > 
> > > ...
> > >  
> > > > I bisected it down to:
> > > > 
> > > > 5dc5218840e1  fs: create proper filename objects using getname_kernel()
> > > > 
> > > > I reverted then reapplied each part of that patch.  It works if I
> > > > leave out the hunk for do_path_lookup:
> > > > 
> > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > index eeb3b83661f8..c3d21b79090e 100644
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -2001,9 +2001,15 @@ static int filename_lookup(int dfd, struct 
> > > > filename
> > > > *name, static int do_path_lookup(int dfd, const char *name,
> > > > unsigned int flags, struct nameidata 
> > > > *nd)
> > > >  {
> > > > -   struct filename filename = { .name = name };
> > > > +   int retval;
> > > > +   struct filename *filename;
> > > > 
> > > > -   return filename_lookup(dfd, , flags, nd);
> > > > +   filename = getname_kernel(name);
> > > > +   if (unlikely(IS_ERR(filename)))
> > > > +   return PTR_ERR(filename);
> > > > +   retval = filename_lookup(dfd, filename, flags, nd);
> > > > +   putname(filename);
> > > > +   return retval;
> > > >  }
> > > > 
> > > > I don't know what other info you may need.
> > > > Full dmesg for the failed boot included below.
> > > 
> > > Thanks for testing this and reporting the problem, especially such a 
> > > small 
> > > bisection.  Unfortunately nothing is immediately obvious to me, would you 
> > > mind 
> > > sharing your kernel config so I can try to reproduce and debug the 
> > > problem?
> > 
> > Sure.
> > 
> > I run qemu with:
> > 
> > qemu-system-x86_64 -enable-kvm -cpu host  -m 512 -kernel bzImage -append 
> > 'root=/dev/sda1' $IMG
> > 
> > and the image contains a single ext4 partition with a basic ArchLinux 
> > install.
> 
> Could you turn that return PTR_ERR(filename); into 
> {
>   printk(KERN_ERR "failed(%p -> %d)", name, PTR_ERR(filename));
>   return PTR_ERR(filename);
> }
> reproduce the panic and see what has it produced?

Nothing.

Not sure if it helps, but I added after filename_lookup:

printk(KERN_ERR "fn_lookup %s %d\n", name, retval);

and I get:

[1.618558] fn_lookup bsg/0:0:0:0 -2
[1.619437] fn_lookup bsg 0
[1.620236] fn_lookup bsg/0:0:0:0 -2
[1.625996] fn_lookup sda 0
[1.626609] fn_lookup sda 0
[1.639007] fn_lookup sda1 0
[1.639691] fn_lookup sda1 0
[1.643656] fn_lookup bsg/1:0:0:0 -2
[1.644974] fn_lookup bsg 0
[1.645928] fn_lookup bsg/1:0:0:0 -2
[1.649483] fn_lookup /dev/ram -2
[1.650424] fn_lookup /dev/root -2
[1.651234] VFS: Cannot open root device "sda1" or unknown-block(8,1): error 
-2


-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 12:39:08 -0500, Paul Moore wrote:
> On Tuesday, January 20, 2015 05:56:55 PM Sabrina Dubroca wrote:
> > Hello,
> > 
> > Today's linux-next doesn't boot on my qemu VM:
> 
> ...
>  
> > I bisected it down to:
> > 
> > 5dc5218840e1  fs: create proper filename objects using getname_kernel()
> > 
> > I reverted then reapplied each part of that patch.  It works if I
> > leave out the hunk for do_path_lookup:
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index eeb3b83661f8..c3d21b79090e 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2001,9 +2001,15 @@ static int filename_lookup(int dfd, struct filename
> > *name, static int do_path_lookup(int dfd, const char *name,
> > unsigned int flags, struct nameidata *nd)
> >  {
> > -   struct filename filename = { .name = name };
> > +   int retval;
> > +   struct filename *filename;
> > 
> > -   return filename_lookup(dfd, , flags, nd);
> > +   filename = getname_kernel(name);
> > +   if (unlikely(IS_ERR(filename)))
> > +   return PTR_ERR(filename);
> > +   retval = filename_lookup(dfd, filename, flags, nd);
> > +   putname(filename);
> > +   return retval;
> >  }
> > 
> > I don't know what other info you may need.
> > Full dmesg for the failed boot included below.
> 
> Thanks for testing this and reporting the problem, especially such a small 
> bisection.  Unfortunately nothing is immediately obvious to me, would you 
> mind 
> sharing your kernel config so I can try to reproduce and debug the problem?

Sure.

I run qemu with:

qemu-system-x86_64 -enable-kvm -cpu host  -m 512 -kernel bzImage -append 
'root=/dev/sda1' $IMG

and the image contains a single ext4 partition with a basic ArchLinux install.

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.19.0-rc5 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="earth"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFI

Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
Hello,

Today's linux-next doesn't boot on my qemu VM:

[1.248357] scsi 0:0:0:0: Direct-Access ATA  QEMU HARDDISK0
PQ: 0 ANSI: 5
[1.255899] sd 0:0:0:0: [sda] 8388608 512-byte logical blocks: (4.29 GB/4.00 
GiB)
[1.258333] sd 0:0:0:0: [sda] Write Protect is off
[1.259475] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[1.268417] scsi 1:0:0:0: CD-ROMQEMU QEMU DVD-ROM 2.2. 
PQ: 0 ANSI: 5
[1.271673]  sda: sda1
[1.281061] sd 0:0:0:0: [sda] Attached SCSI disk
[1.282320] VFS: Cannot open root device "sda1" or unknown-block(8,1): error 
-2
[1.283484] Please append a correct "root=" boot option; here are the 
available partitions:
[1.284748] 01004096 ram0  (driver?)
[1.285479] 01014096 ram1  (driver?)
[1.286218] 01024096 ram2  (driver?)
[1.286992] 01034096 ram3  (driver?)
[1.287741] 01044096 ram4  (driver?)
[1.288640] 01054096 ram5  (driver?)
[1.289394] 01064096 ram6  (driver?)
[1.290195] 01074096 ram7  (driver?)
[1.290962] 01084096 ram8  (driver?)
[1.291695] 01094096 ram9  (driver?)
[1.292404] 010a4096 ram10  (driver?)
[1.293114] 010b4096 ram11  (driver?)
[1.293922] 010c4096 ram12  (driver?)
[1.294643] 010d4096 ram13  (driver?)
[1.295401] 010e4096 ram14  (driver?)
[1.296167] 010f4096 ram15  (driver?)
[1.296975] 0800 4194304 sda  driver: sd
[1.297697]   0801 4194272 sda1 -01
[1.298418] Kernel panic - not syncing: VFS: Unable to mount root fs on 
unknown-block(8,1)
[1.300034] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
3.19.0-rc5-next-20150120-dirty #410
[1.300039] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140617_173321-var-lib-archbuild-testing-x86_64-tobias 04/01/2014
[1.300039]  ea001340 88001f673dd8 817e1197 
004e
[1.300039]  81a6f2c8 88001f673e58 817dfd43 
81c9a860
[1.300039]  8810 88001f673e68 88001f673e08 
31616473
[1.300039] Call Trace:
[1.300039]  [] dump_stack+0x4f/0x7b
[1.300039]  [] panic+0xd2/0x217
[1.300039]  [] mount_block_root+0x200/0x28d
[1.300039]  [] mount_root+0x54/0x58
[1.300039]  [] prepare_namespace+0x168/0x1a1
[1.300039]  [] kernel_init_freeable+0x29d/0x2ad
[1.300039]  [] ? rest_init+0x140/0x140
[1.300039]  [] kernel_init+0xe/0xf0
[1.300039]  [] ret_from_fork+0x7c/0xb0
[1.300039]  [] ? rest_init+0x140/0x140
[1.300039] Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff)
[1.300039] ---[ end Kernel panic - not syncing: VFS: Unable to mount root 
fs on unknown-block(8,1)


I bisected it down to:

5dc5218840e1  fs: create proper filename objects using getname_kernel()

I reverted then reapplied each part of that patch.  It works if I
leave out the hunk for do_path_lookup:

diff --git a/fs/namei.c b/fs/namei.c
index eeb3b83661f8..c3d21b79090e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2001,9 +2001,15 @@ static int filename_lookup(int dfd, struct filename 
*name,
 static int do_path_lookup(int dfd, const char *name,
unsigned int flags, struct nameidata *nd)
 {
-   struct filename filename = { .name = name };
+   int retval;
+   struct filename *filename;
 
-   return filename_lookup(dfd, , flags, nd);
+   filename = getname_kernel(name);
+   if (unlikely(IS_ERR(filename)))
+   return PTR_ERR(filename);
+   retval = filename_lookup(dfd, filename, flags, nd);
+   putname(filename);
+   return retval;
 }
 
 /* does lookup, returns the object with parent locked */



I don't know what other info you may need.
Full dmesg for the failed boot included below.

Thanks.


[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 3.19.0-rc5-next-20150120-dirty (zappy@kria) (gcc 
version 4.9.2 20141224 (prerelease) (GCC) ) #410 SMP PREEMPT Tue Jan 20 
17:27:49 CET 2015
[0.00] Command line: root=/dev/sda1 console=ttyS0
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable
[0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 

Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
Hello,

Today's linux-next doesn't boot on my qemu VM:

[1.248357] scsi 0:0:0:0: Direct-Access ATA  QEMU HARDDISK0
PQ: 0 ANSI: 5
[1.255899] sd 0:0:0:0: [sda] 8388608 512-byte logical blocks: (4.29 GB/4.00 
GiB)
[1.258333] sd 0:0:0:0: [sda] Write Protect is off
[1.259475] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[1.268417] scsi 1:0:0:0: CD-ROMQEMU QEMU DVD-ROM 2.2. 
PQ: 0 ANSI: 5
[1.271673]  sda: sda1
[1.281061] sd 0:0:0:0: [sda] Attached SCSI disk
[1.282320] VFS: Cannot open root device sda1 or unknown-block(8,1): error 
-2
[1.283484] Please append a correct root= boot option; here are the 
available partitions:
[1.284748] 01004096 ram0  (driver?)
[1.285479] 01014096 ram1  (driver?)
[1.286218] 01024096 ram2  (driver?)
[1.286992] 01034096 ram3  (driver?)
[1.287741] 01044096 ram4  (driver?)
[1.288640] 01054096 ram5  (driver?)
[1.289394] 01064096 ram6  (driver?)
[1.290195] 01074096 ram7  (driver?)
[1.290962] 01084096 ram8  (driver?)
[1.291695] 01094096 ram9  (driver?)
[1.292404] 010a4096 ram10  (driver?)
[1.293114] 010b4096 ram11  (driver?)
[1.293922] 010c4096 ram12  (driver?)
[1.294643] 010d4096 ram13  (driver?)
[1.295401] 010e4096 ram14  (driver?)
[1.296167] 010f4096 ram15  (driver?)
[1.296975] 0800 4194304 sda  driver: sd
[1.297697]   0801 4194272 sda1 -01
[1.298418] Kernel panic - not syncing: VFS: Unable to mount root fs on 
unknown-block(8,1)
[1.300034] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
3.19.0-rc5-next-20150120-dirty #410
[1.300039] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140617_173321-var-lib-archbuild-testing-x86_64-tobias 04/01/2014
[1.300039]  ea001340 88001f673dd8 817e1197 
004e
[1.300039]  81a6f2c8 88001f673e58 817dfd43 
81c9a860
[1.300039]  8810 88001f673e68 88001f673e08 
31616473
[1.300039] Call Trace:
[1.300039]  [817e1197] dump_stack+0x4f/0x7b
[1.300039]  [817dfd43] panic+0xd2/0x217
[1.300039]  [81efd58b] mount_block_root+0x200/0x28d
[1.300039]  [81efd78b] mount_root+0x54/0x58
[1.300039]  [81efd8f7] prepare_namespace+0x168/0x1a1
[1.300039]  [81efd2b1] kernel_init_freeable+0x29d/0x2ad
[1.300039]  [817d7440] ? rest_init+0x140/0x140
[1.300039]  [817d744e] kernel_init+0xe/0xf0
[1.300039]  [817eb87c] ret_from_fork+0x7c/0xb0
[1.300039]  [817d7440] ? rest_init+0x140/0x140
[1.300039] Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff)
[1.300039] ---[ end Kernel panic - not syncing: VFS: Unable to mount root 
fs on unknown-block(8,1)


I bisected it down to:

5dc5218840e1  fs: create proper filename objects using getname_kernel()

I reverted then reapplied each part of that patch.  It works if I
leave out the hunk for do_path_lookup:

diff --git a/fs/namei.c b/fs/namei.c
index eeb3b83661f8..c3d21b79090e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2001,9 +2001,15 @@ static int filename_lookup(int dfd, struct filename 
*name,
 static int do_path_lookup(int dfd, const char *name,
unsigned int flags, struct nameidata *nd)
 {
-   struct filename filename = { .name = name };
+   int retval;
+   struct filename *filename;
 
-   return filename_lookup(dfd, filename, flags, nd);
+   filename = getname_kernel(name);
+   if (unlikely(IS_ERR(filename)))
+   return PTR_ERR(filename);
+   retval = filename_lookup(dfd, filename, flags, nd);
+   putname(filename);
+   return retval;
 }
 
 /* does lookup, returns the object with parent locked */



I don't know what other info you may need.
Full dmesg for the failed boot included below.

Thanks.


[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 3.19.0-rc5-next-20150120-dirty (zappy@kria) (gcc 
version 4.9.2 20141224 (prerelease) (GCC) ) #410 SMP PREEMPT Tue Jan 20 
17:27:49 CET 2015
[0.00] Command line: root=/dev/sda1 console=ttyS0
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable
[0.00] BIOS-e820: [mem 

Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 12:39:08 -0500, Paul Moore wrote:
 On Tuesday, January 20, 2015 05:56:55 PM Sabrina Dubroca wrote:
  Hello,
  
  Today's linux-next doesn't boot on my qemu VM:
 
 ...
  
  I bisected it down to:
  
  5dc5218840e1  fs: create proper filename objects using getname_kernel()
  
  I reverted then reapplied each part of that patch.  It works if I
  leave out the hunk for do_path_lookup:
  
  diff --git a/fs/namei.c b/fs/namei.c
  index eeb3b83661f8..c3d21b79090e 100644
  --- a/fs/namei.c
  +++ b/fs/namei.c
  @@ -2001,9 +2001,15 @@ static int filename_lookup(int dfd, struct filename
  *name, static int do_path_lookup(int dfd, const char *name,
  unsigned int flags, struct nameidata *nd)
   {
  -   struct filename filename = { .name = name };
  +   int retval;
  +   struct filename *filename;
  
  -   return filename_lookup(dfd, filename, flags, nd);
  +   filename = getname_kernel(name);
  +   if (unlikely(IS_ERR(filename)))
  +   return PTR_ERR(filename);
  +   retval = filename_lookup(dfd, filename, flags, nd);
  +   putname(filename);
  +   return retval;
   }
  
  I don't know what other info you may need.
  Full dmesg for the failed boot included below.
 
 Thanks for testing this and reporting the problem, especially such a small 
 bisection.  Unfortunately nothing is immediately obvious to me, would you 
 mind 
 sharing your kernel config so I can try to reproduce and debug the problem?

Sure.

I run qemu with:

qemu-system-x86_64 -enable-kvm -cpu host  -m 512 -kernel bzImage -append 
'root=/dev/sda1' $IMG

and the image contains a single ext4 partition with a basic ArchLinux install.

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.19.0-rc5 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT=elf64-x86-64
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME=earth
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 21:02:03 +, Al Viro wrote:
 On Tue, Jan 20, 2015 at 09:45:04PM +0100, Sabrina Dubroca wrote:
 
  printk(KERN_ERR fn_lookup %s %d\n, name, retval);
  
  and I get:
  
  [1.618558] fn_lookup bsg/0:0:0:0 -2
  [1.619437] fn_lookup bsg 0
  [1.620236] fn_lookup bsg/0:0:0:0 -2
  [1.625996] fn_lookup sda 0
  [1.626609] fn_lookup sda 0
  [1.639007] fn_lookup sda1 0
  [1.639691] fn_lookup sda1 0
  [1.643656] fn_lookup bsg/1:0:0:0 -2
  [1.644974] fn_lookup bsg 0
  [1.645928] fn_lookup bsg/1:0:0:0 -2
  [1.649483] fn_lookup /dev/ram -2
  [1.650424] fn_lookup /dev/root -2
  [1.651234] VFS: Cannot open root device sda1 or unknown-block(8,1): 
  error -2
 
 That -2 is -ENOENT...  Wait a sec, what's in filename, filename-name and
 what do you get from your printk on kernel with that commit reverted?

filename-name matches name. with
printk(KERN_ERR fn_lookup %s %d, %p %s\n, name, retval, filename, 
filename-name);

[1.538646] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
[1.539704] fn_lookup bsg 0, 88001f718000 bsg
[1.540559] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
[1.552611] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
[1.553689] fn_lookup bsg 0, 88001f718000 bsg
[1.554505] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
[1.557554] fn_lookup sda 0, 88001f718000 sda
[1.558368] fn_lookup sda 0, 88001f718000 sda
[1.564190] fn_lookup sda1 0, 88001f718000 sda1
[1.565008] fn_lookup sda1 0, 88001f718000 sda1
[1.570751] fn_lookup /dev/ram -2, 88001f71a300 /dev/ram
[1.571786] fn_lookup /dev/root -2, 88001f71b480 /dev/root


and with
printk(KERN_ERR fn_lookup %s %d, %s\n, name, retval, filename.name);
in the original do_path_lookup:

[1.426101] fn_lookup bsg/0:0:0:0 -2, bsg/0:0:0:0
[1.426893] fn_lookup bsg 0, bsg
[1.427406] fn_lookup bsg/0:0:0:0 0, bsg/0:0:0:0
[1.431530] fn_lookup sda 0, sda
[1.438346] fn_lookup bsg/1:0:0:0 0, bsg/1:0:0:0
[1.443658] fn_lookup sda1 0, sda1
[1.448344] fn_lookup /dev/ram 0, /dev/ram
[1.449148] fn_lookup /dev/root 0, /dev/root
[1.449835] fn_lookup /dev/root 0, /dev/root
[1.451586] EXT4-fs (sda1): couldn't mount as ext3 due to feature 
incompatibilities
[1.452954] fn_lookup /dev/root 0, /dev/root
[1.454292] EXT4-fs (sda1): couldn't mount as ext2 due to feature 
incompatibilities
[1.456331] fn_lookup /dev/root 0, /dev/root
[1.480208] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: 
(null)
[1.481323] VFS: Mounted root (ext4 filesystem) readonly on device 8:1.


-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 21:58:31 +, Al Viro wrote:
 On Tue, Jan 20, 2015 at 10:38:58PM +0100, Sabrina Dubroca wrote:
 
  [1.538646] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
  [1.539704] fn_lookup bsg 0, 88001f718000 bsg
  [1.540559] fn_lookup bsg/0:0:0:0 -2, 88001f718000 bsg/0:0:0:0
  [1.552611] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
  [1.553689] fn_lookup bsg 0, 88001f718000 bsg
  [1.554505] fn_lookup bsg/1:0:0:0 -2, 88001f718000 bsg/1:0:0:0
  [1.557554] fn_lookup sda 0, 88001f718000 sda
  [1.558368] fn_lookup sda 0, 88001f718000 sda
  [1.564190] fn_lookup sda1 0, 88001f718000 sda1
  [1.565008] fn_lookup sda1 0, 88001f718000 sda1
  [1.570751] fn_lookup /dev/ram -2, 88001f71a300 /dev/ram
  [1.571786] fn_lookup /dev/root -2, 88001f71b480 /dev/root
 
 Nuts...  Is reverting just this (do_path_lookup()) part of commit sufficient
 to recover the normal behaviour?

Yes.

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 23:17:25 +, Al Viro wrote:
 On Tue, Jan 20, 2015 at 10:50:41PM +, Al Viro wrote:
  doesn't look at _anything_ other than name-name other than for 
  audit_inode().
  And name-name is apparently the same.
  
  It looks like something ends up buggering name-name in process, but then
  the damn thing appears to be normal after return from filename_lookup()...
 
 If my reconstruction of what's going on is correct, the call chain here
 is do_path_lookup() - kern_path() - lookup_bdev() - blkdev_get_by_path()
 - mount_bdev() - some_type.mount() - mount_fs()
 - vfs_kern_mount() - do_new_mount() - do_mount() - sys_mount()
 - do_mount_root() - mount_block_root() - mount_root().  Which is
 obscenely long, BTW, but that's a separate story...
 
 Could you slap
   struct stat buf;
   int n = sys_newstat(name, buf);
   printk(KERN_ERR stat(\%s\) - %d\n, name, n);
   n = sys_newstat(/dev, buf);
   printk(KERN_ERR stat(\dev\) - %d\n, n);
 
 in the beginning of mount_block_root() (init/do_mounts.c) and see what it
 prints?

I get

stat(/dev/root) - -2
stat(dev) - -2
with the patch applied (+panic)


and:

stat(/dev/root) - 0
stat(dev) - 0
with the old version of do_path_lookup.

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Jan 20 -- Kernel panic - Unable to mount root fs

2015-01-20 Thread Sabrina Dubroca
2015-01-20, 19:54:32 +, Al Viro wrote:
 On Tue, Jan 20, 2015 at 06:51:35PM +0100, Sabrina Dubroca wrote:
  2015-01-20, 12:39:08 -0500, Paul Moore wrote:
   On Tuesday, January 20, 2015 05:56:55 PM Sabrina Dubroca wrote:
Hello,

Today's linux-next doesn't boot on my qemu VM:
   
   ...

I bisected it down to:

5dc5218840e1  fs: create proper filename objects using getname_kernel()

I reverted then reapplied each part of that patch.  It works if I
leave out the hunk for do_path_lookup:

diff --git a/fs/namei.c b/fs/namei.c
index eeb3b83661f8..c3d21b79090e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2001,9 +2001,15 @@ static int filename_lookup(int dfd, struct 
filename
*name, static int do_path_lookup(int dfd, const char *name,
unsigned int flags, struct nameidata 
*nd)
 {
-   struct filename filename = { .name = name };
+   int retval;
+   struct filename *filename;

-   return filename_lookup(dfd, filename, flags, nd);
+   filename = getname_kernel(name);
+   if (unlikely(IS_ERR(filename)))
+   return PTR_ERR(filename);
+   retval = filename_lookup(dfd, filename, flags, nd);
+   putname(filename);
+   return retval;
 }

I don't know what other info you may need.
Full dmesg for the failed boot included below.
   
   Thanks for testing this and reporting the problem, especially such a 
   small 
   bisection.  Unfortunately nothing is immediately obvious to me, would you 
   mind 
   sharing your kernel config so I can try to reproduce and debug the 
   problem?
  
  Sure.
  
  I run qemu with:
  
  qemu-system-x86_64 -enable-kvm -cpu host  -m 512 -kernel bzImage -append 
  'root=/dev/sda1' $IMG
  
  and the image contains a single ext4 partition with a basic ArchLinux 
  install.
 
 Could you turn that return PTR_ERR(filename); into 
 {
   printk(KERN_ERR failed(%p - %d), name, PTR_ERR(filename));
   return PTR_ERR(filename);
 }
 reproduce the panic and see what has it produced?

Nothing.

Not sure if it helps, but I added after filename_lookup:

printk(KERN_ERR fn_lookup %s %d\n, name, retval);

and I get:

[1.618558] fn_lookup bsg/0:0:0:0 -2
[1.619437] fn_lookup bsg 0
[1.620236] fn_lookup bsg/0:0:0:0 -2
[1.625996] fn_lookup sda 0
[1.626609] fn_lookup sda 0
[1.639007] fn_lookup sda1 0
[1.639691] fn_lookup sda1 0
[1.643656] fn_lookup bsg/1:0:0:0 -2
[1.644974] fn_lookup bsg 0
[1.645928] fn_lookup bsg/1:0:0:0 -2
[1.649483] fn_lookup /dev/ram -2
[1.650424] fn_lookup /dev/root -2
[1.651234] VFS: Cannot open root device sda1 or unknown-block(8,1): error 
-2


-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >