from:"Petr Tesarik"

Re: drivers/char/random.c needs a (new) maintainer

2020-12-23 Thread Petr Tesarik

On Wed, 23 Dec 2020 15:32:55 +0100
"Jason A. Donenfeld"  wrote:

> On Wed, Dec 23, 2020 at 3:17 PM Petr Tesarik  wrote:
> > Upfront, let me admit that SUSE has a vested interest in a FIPS-certifiable 
> > Linux kernel.  
> 
> Sorry, but just because you have a "vested interest", or a financial
> interest, or because you want it does not suddenly make it a good
> idea. The idea is to have good crypto, not to merely check some boxes

I never suggested that this should serve as a supportive argument. I was just 
trying to be honest about our motivations.

I'm a bit sad that this discussion has quickly gone back to the choice of 
algorithms and how they can be implemented. The real issue is that the RNG 
subsystem has not developed as fast as it could. This had not been much of an 
issue as long as nobody was really interested in making any substantial changes 
to that code, but it is more apparent now. Torsten believes it can be partly 
because of a maintainer who is too busy with other tasks, and he suggested we 
try to improve the situation by giving the RNG-related tasks to someone else.

I have not seen a clear answer to this suggestion, except Jason offering his 
helping hand with Nicolai's cleanup patches, but nothing wrt Stephan's patches. 
So, what is the plan?

Petr Tesarik
SUSE HW Enablement Team

pgpw9IwlgRr2W.pgp
Description: Digitální podpis OpenPGP

Re: drivers/char/random.c needs a (new) maintainer

2020-12-23 Thread Petr Tesarik

On Wed, 23 Dec 2020 13:28:51 +0100
Torsten Duwe  wrote:

>[...]
> > collaboration and disengage people. More than simply reviewing patches
> > I would expect a maintainer to give directions and drive the
> > community. Asking Jason to review Nicolai's patches was a step towards
> > that, but I believe we still could benefit from better communication.  
> 
> Even regarding this I'm not so sure it was a good idea. Jason seems to
> narrow the proposed changes down to "FIPS certification", when it
> actually is a lot more. I think his motivation suffers because of his
> personal dislike.

Upfront, let me admit that SUSE has a vested interest in a FIPS-certifiable 
Linux kernel.

However, it seems to me that nobody can be happy about keeping the current 
status quo forever. Even in the hypothetical case that the RNG maintainer 
rejected the whole idea merely because it makes it possible to achieve NIST 
compliance, and he detests standards compliance, it would still be better than 
no decision at all. The silence is paralyzing, as it blocks any changes in 
upstream, while also making it difficult to maintain an out-of-tree 
implementation that aims at becoming upstream eventually.

The only option ATM is a fork (similar to what the Xen folks did with XenLinux 
many years ago). IOW the current situation demotivates contributors from being 
good citizens. I hope we can find a better solution together.

Petr Tesarik
SUSE HW Enablement Team

pgpFeg_6a6sxB.pgp
Description: Digitální podpis OpenPGP

Re: [PATCH 01/11] kexec_file: allow archs to handle special regions while locating memory hole

2020-06-29 Thread Petr Tesarik

Hi Hari,

is there any good reason to add two more functions with a very similar
name to an existing function? AFAICS all you need is a way to call a
PPC64-specific function from within kexec_add_buffer (PATCH 4/11), so
you could add something like this:

int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
{
return 0;
}

Call this function from kexec_add_buffer where appropriate and then
override it for PPC64 (it roughly corresponds to your
kexec_locate_mem_hole_ppc64() from PATCH 4/11).

FWIW it would make it easier for me to follow the resulting code.

Petr T

On Sat, 27 Jun 2020 00:34:43 +0530
Hari Bathini  wrote:

> Some archs can have special memory regions, within the given memory
> range, which can't be used for the buffer in a kexec segment. As
> kexec_add_buffer() function is being called from generic code as well,
> add weak arch_kexec_add_buffer definition for archs to override & take
> care of special regions before trying to locate a memory hole.
> 
> Signed-off-by: Hari Bathini 
> ---
>  include/linux/kexec.h |5 +
>  kernel/kexec_file.c   |   37 +
>  2 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 1776eb2..1237682 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -195,6 +195,11 @@ int __weak arch_kexec_apply_relocations(struct 
> purgatory_info *pi,
>   const Elf_Shdr *relsec,
>   const Elf_Shdr *symtab);
>  
> +extern int arch_kexec_add_buffer(struct kexec_buf *kbuf);
> +
> +/* arch_kexec_add_buffer calls this when it is ready */
> +extern int __kexec_add_buffer(struct kexec_buf *kbuf);
> +
>  extern int kexec_add_buffer(struct kexec_buf *kbuf);
>  int kexec_locate_mem_hole(struct kexec_buf *kbuf);
>  
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index bb05fd5..a0b4f7f 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -669,10 +669,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
>   */
>  int kexec_add_buffer(struct kexec_buf *kbuf)
>  {
> -
> - struct kexec_segment *ksegment;
> - int ret;
> -
>   /* Currently adding segment this way is allowed only in file mode */
>   if (!kbuf->image->file_mode)
>   return -EINVAL;
> @@ -696,6 +692,25 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
>   kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE);
>   kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
>  
> + return arch_kexec_add_buffer(kbuf);
> +}
> +
> +/**
> + * __kexec_add_buffer - arch_kexec_add_buffer would call this function after
> + *  updating kbuf, to place a buffer in a kexec segment.
> + * @kbuf:   Buffer contents and memory parameters.
> + *
> + * This function assumes that kexec_mutex is held.
> + * On successful return, @kbuf->mem will have the physical address of
> + * the buffer in memory.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int __kexec_add_buffer(struct kexec_buf *kbuf)
> +{
> + struct kexec_segment *ksegment;
> + int ret;
> +
>   /* Walk the RAM ranges and allocate a suitable range for the buffer */
>   ret = kexec_locate_mem_hole(kbuf);
>   if (ret)
> @@ -711,6 +726,20 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
>   return 0;
>  }
>  
> +/**
> + * arch_kexec_add_buffer - Some archs have memory regions within the given
> + * range that can't be used to place a kexec segment.
> + * Such archs can override this function to take care
> + * of them before trying to locate the memory hole.
> + * @kbuf:  Buffer contents and memory parameters.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int __weak arch_kexec_add_buffer(struct kexec_buf *kbuf)
> +{
> + return __kexec_add_buffer(kbuf);
> +}
> +
>  /* Calculate and store the digest of segments */
>  static int kexec_calculate_store_digests(struct kimage *image)
>  {
> 



pgpS2PzQ0a79y.pgp
Description: Digitální podpis OpenPGP

Re: [PATCH 1/1] s390/pci: Log new handle in clp_disable_fh()

2020-05-28 Thread Petr Tesarik

Hi all,

just a gentle ping.

If the current behaviour (logging the original handle) was intended,
then it was worth mentioning in the commit message for 17cdec960cf77,
which made the change, but since that's no longer an option, I'd be
happy with an explanation in email.

Petr T

On Fri, 22 May 2020 20:39:22 +0200
Petr Tesarik  wrote:

> After disabling a function, the original handle is logged instead of
> the disabled handle.
> 
> Fixes: 17cdec960cf77 (s390/pci: Recover handle in clp_set_pci_fn())
> Signed-off-by: Petr Tesarik 
> ---
>  arch/s390/pci/pci_clp.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
> index ea794ae755ae..179bcecefdee 100644
> --- a/arch/s390/pci/pci_clp.c
> +++ b/arch/s390/pci/pci_clp.c
> @@ -309,14 +309,13 @@ int clp_enable_fh(struct zpci_dev *zdev, u8 nr_dma_as)
>  
>  int clp_disable_fh(struct zpci_dev *zdev)
>  {
> - u32 fh = zdev->fh;
>   int rc;
>  
>   if (!zdev_enabled(zdev))
>   return 0;
>  
>   rc = clp_set_pci_fn(zdev, 0, CLP_SET_DISABLE_PCI_FN);
> - zpci_dbg(3, "dis fid:%x, fh:%x, rc:%d\n", zdev->fid, fh, rc);
> + zpci_dbg(3, "dis fid:%x, fh:%x, rc:%d\n", zdev->fid, zdev->fh, rc);
>   return rc;
>  }
>  



pgp_NrWiSJwQm.pgp
Description: Digitální podpis OpenPGP

[PATCH 1/1] s390/pci: Log new handle in clp_disable_fh()

2020-05-22 Thread Petr Tesarik

After disabling a function, the original handle is logged instead of
the disabled handle.

Fixes: 17cdec960cf77 (s390/pci: Recover handle in clp_set_pci_fn())
Signed-off-by: Petr Tesarik 
---
 arch/s390/pci/pci_clp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index ea794ae755ae..179bcecefdee 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -309,14 +309,13 @@ int clp_enable_fh(struct zpci_dev *zdev, u8 nr_dma_as)
 
 int clp_disable_fh(struct zpci_dev *zdev)
 {
-   u32 fh = zdev->fh;
int rc;
 
if (!zdev_enabled(zdev))
return 0;
 
rc = clp_set_pci_fn(zdev, 0, CLP_SET_DISABLE_PCI_FN);
-   zpci_dbg(3, "dis fid:%x, fh:%x, rc:%d\n", zdev->fid, fh, rc);
+   zpci_dbg(3, "dis fid:%x, fh:%x, rc:%d\n", zdev->fid, zdev->fh, rc);
return rc;
 }
 
-- 
2.26.1

Re: [PATCH] s390: enable detection of kernel version from bzImage

2019-07-16 Thread Petr Tesarik

On Tue, 16 Jul 2019 15:11:38 +0200
Vasily Gorbik  wrote:

> On Tue, Jul 16, 2019 at 10:30:14AM +0000, Petr Tesarik wrote:
> > On Tue, 16 Jul 2019 00:12:19 +0200
> > Vasily Gorbik  wrote:
> >   
> > > Extend "parmarea" to include an offset of the version string, which is
> > > stored as 8-byte big endian value.
> > > 
> > > To retrieve version string from bzImage reliably, one should check the
> > > presence of "S390EP" ascii string at 0x10008 (available since v3.2),
> > > then read the version string offset from 0x10428 (which has been 0
> > > since v3.2 up to now). The string is null terminated.
> > > 
> > > Could be retrieved with the following "file" command magic (requires
> > > file v5.34):
> > > 8 string 
> > > \x02\x00\x00\x18\x60\x00\x00\x50\x02\x00\x00\x68\x60\x00\x00\x50\x40\x40\x40\x40\x40\x40\x40\x40
> > >  Linux S390  
> > > >0x10008   string  S390EP
> > > >>0x10428  bequad  >0
> > > >>>(0x10428.Q) string  >\0 \b, version %s    
> > > 
> > > Signed-off-by: Vasily Gorbik   
> > 
> > This looks great! Much cleaner than the original approach.
> > 
> > Thank you,
> > Petr T  
> 
> Then I'll add
> Reported-by: Petr Tesarik 
> Suggested-by: Petr Tesarik 
> if you don't mind and try to queue that for 5.3.

Oh, sure, please add these lines and go ahead.

Thank you again,
Petr T


pgp9O8CIofvAm.pgp
Description: Digitální podpis OpenPGP

Re: [PATCH] s390: enable detection of kernel version from bzImage

2019-07-16 Thread Petr Tesarik

On Tue, 16 Jul 2019 00:12:19 +0200
Vasily Gorbik  wrote:

> Extend "parmarea" to include an offset of the version string, which is
> stored as 8-byte big endian value.
> 
> To retrieve version string from bzImage reliably, one should check the
> presence of "S390EP" ascii string at 0x10008 (available since v3.2),
> then read the version string offset from 0x10428 (which has been 0
> since v3.2 up to now). The string is null terminated.
> 
> Could be retrieved with the following "file" command magic (requires
> file v5.34):
> 8 string 
> \x02\x00\x00\x18\x60\x00\x00\x50\x02\x00\x00\x68\x60\x00\x00\x50\x40\x40\x40\x40\x40\x40\x40\x40
>  Linux S390
> >0x10008   string  S390EP  
> >>0x10428  bequad  >0  
> >>>(0x10428.Q) string  >\0 \b, version %s  
> 
> Signed-off-by: Vasily Gorbik 

This looks great! Much cleaner than the original approach.

Thank you,
Petr T

> ---
>  arch/s390/boot/Makefile   | 2 +-
>  arch/s390/boot/head.S | 1 +
>  arch/s390/boot/version.c  | 6 ++
>  arch/s390/include/asm/setup.h | 4 +++-
>  4 files changed, 11 insertions(+), 2 deletions(-)
>  create mode 100644 arch/s390/boot/version.c
> 
> diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
> index 7cba96e7587b..4cf0bddb7d92 100644
> --- a/arch/s390/boot/Makefile
> +++ b/arch/s390/boot/Makefile
> @@ -36,7 +36,7 @@ CFLAGS_sclp_early_core.o += -I$(srctree)/drivers/s390/char
>  
>  obj-y:= head.o als.o startup.o mem_detect.o ipl_parm.o ipl_report.o
>  obj-y+= string.o ebcdic.o sclp_early_core.o mem.o ipl_vmparm.o 
> cmdline.o
> -obj-y+= ctype.o text_dma.o
> +obj-y+= version.o ctype.o text_dma.o
>  obj-$(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) += uv.o
>  obj-$(CONFIG_RELOCATABLE)+= machine_kexec_reloc.o
>  obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
> diff --git a/arch/s390/boot/head.S b/arch/s390/boot/head.S
> index 028aab03a9e7..2087bed6e60f 100644
> --- a/arch/s390/boot/head.S
> +++ b/arch/s390/boot/head.S
> @@ -361,6 +361,7 @@ ENTRY(startup_kdump)
>   .quad   0   # INITRD_SIZE
>   .quad   0   # OLDMEM_BASE
>   .quad   0   # OLDMEM_SIZE
> + .quad   kernel_version  # points to kernel version string
>  
>   .orgCOMMAND_LINE
>   .byte   "root=/dev/ram0 ro"
> diff --git a/arch/s390/boot/version.c b/arch/s390/boot/version.c
> new file mode 100644
> index ..ea5e49651931
> --- /dev/null
> +++ b/arch/s390/boot/version.c
> @@ -0,0 +1,6 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include 
> +
> +const char kernel_version[] = UTS_RELEASE
> + " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") " UTS_VERSION;
> diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
> index 925889d360c1..e5d28a475f76 100644
> --- a/arch/s390/include/asm/setup.h
> +++ b/arch/s390/include/asm/setup.h
> @@ -54,6 +54,7 @@
>  #define INITRD_SIZE_OFFSET   0x10410
>  #define OLDMEM_BASE_OFFSET   0x10418
>  #define OLDMEM_SIZE_OFFSET   0x10420
> +#define KERNEL_VERSION_OFFSET0x10428
>  #define COMMAND_LINE_OFFSET  0x10480
>  
>  #ifndef __ASSEMBLY__
> @@ -74,7 +75,8 @@ struct parmarea {
>   unsigned long initrd_size;  /* 0x10410 */
>   unsigned long oldmem_base;  /* 0x10418 */
>   unsigned long oldmem_size;  /* 0x10420 */
> - char pad1[0x10480 - 0x10428];   /* 0x10428 - 0x10480 */
> + unsigned long kernel_version;   /* 0x10428 */
> + char pad1[0x10480 - 0x10430];   /* 0x10430 - 0x10480 */
>   char command_line[ARCH_COMMAND_LINE_SIZE];  /* 0x10480 */
>  };
>

Re: [PATCH 2/2] s390: add Linux banner to the compressed image

2019-07-14 Thread Petr Tesarik

On Sun, 14 Jul 2019 16:35:33 +0200
Vasily Gorbik  wrote:

> On Fri, Jul 12, 2019 at 07:21:01PM +0200, Petr Tesarik wrote:
> > Various tools determine the kernel version from a given binary by
> > scanning for the Linux banner string. This does not work if the
> > banner string is compressed, but we can link it once more into the
> > uncompressed portion of bzImage.
> > 
> > Signed-off-by: Petr Tesarik 
> > ---
> >  arch/s390/boot/compressed/Makefile | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/s390/boot/compressed/Makefile 
> > b/arch/s390/boot/compressed/Makefile
> > index fa529c5b4486..9bc4685477c5 100644
> > --- a/arch/s390/boot/compressed/Makefile
> > +++ b/arch/s390/boot/compressed/Makefile
> > @@ -11,6 +11,7 @@ UBSAN_SANITIZE := n
> >  KASAN_SANITIZE := n
> >  
> >  obj-y  := $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) piggy.o 
> > info.o
> > +obj-y   += ../../../../init/banner.o  
> 
> We don't reuse objects from another build stage, we rebuild them with
> distinct decompressor's build flags.
> $ git grep "ctype.[oc]" -- arch/s390/boot
> arch/s390/boot/Makefile:obj-y   += ctype.o text_dma.o
> arch/s390/boot/ctype.c:#include "../../../lib/ctype.c"

Those flags do not make a difference for a simple object file like the
Linux banner, but I get your point, and I cannot see an issues with
rebuilding the banner for the decompressor.

> Besides that, there is a special CONFIG_KERNEL_UNCOMPRESSED mode, with
> which "strings vmlinuz | grep 'Linux version'" I assume you are using
> would still yield result. Adding the second version of banner would
> produce duplicated result in this case.

Sure, and AFAICT that's not a problem. But the point here is that the
production kernel should be compressed for all those well-known
reasons, but such image is then not recognized.

> But even before discussing solutions I would like to understand the
> problem first. Which specific tools are you referring to? What are they
> good for? And how do they get the kernel version from other architectures
> compressed images?

The tool I'm aware of is called get_kernel_version. It's built as part
of openSUSE aaa_base and is used at install time. I'm not quite sure
how it is used, but I have added Raymund Will to Cc; he can provide
more information. There's also an open bug for it:

  https://bugzilla.opensuse.org/show_bug.cgi?id=1139939

As for other architectures, I only know about x86. The x86 compressed
binary contains a header with various pointers, among them a pointer to
the kernel version string, so it must be added to the decompressor (cf.
arch/x86/boot/version.c).

If you prefer to have a similar header for other architectures, then
that would be fine with me, but it's a bit more involved, because it
would set up a new ABI...

HTH,
Petr T

[PATCH 2/2] s390: add Linux banner to the compressed image

2019-07-12 Thread Petr Tesarik

Various tools determine the kernel version from a given binary by
scanning for the Linux banner string. This does not work if the
banner string is compressed, but we can link it once more into the
uncompressed portion of bzImage.

Signed-off-by: Petr Tesarik 
---
 arch/s390/boot/compressed/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/boot/compressed/Makefile 
b/arch/s390/boot/compressed/Makefile
index fa529c5b4486..9bc4685477c5 100644
--- a/arch/s390/boot/compressed/Makefile
+++ b/arch/s390/boot/compressed/Makefile
@@ -11,6 +11,7 @@ UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 
 obj-y  := $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) piggy.o info.o
+obj-y   += ../../../../init/banner.o
 targets:= vmlinux.lds vmlinux vmlinux.bin vmlinux.bin.gz 
vmlinux.bin.bz2
 targets += vmlinux.bin.xz vmlinux.bin.lzma vmlinux.bin.lzo vmlinux.bin.lz4
 targets += info.bin $(obj-y)
-- 
2.16.4

[PATCH 1/2] init: Separate banner from init_uts_ns

2019-07-12 Thread Petr Tesarik

The Linux banner is self-contained, so it could be theoretically
added to separately linked binaries (e.g. the kernel decompressor).
Unfortunately, it lives in the same object file as init_uts_ns,
which needs additional symbols from other object files.

Let's divorce the two.

Signed-off-by: Petr Tesarik 
---
 init/Makefile  |  2 +-
 init/banner.c  | 21 +
 init/version.c | 10 --
 3 files changed, 22 insertions(+), 11 deletions(-)
 create mode 100644 init/banner.c

diff --git a/init/Makefile b/init/Makefile
index a3e5ce2bcf08..5009ea808ce4 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -5,7 +5,7 @@
 
 ccflags-y := -fno-function-sections -fno-data-sections
 
-obj-y  := main.o version.o mounts.o
+obj-y  := main.o version.o banner.o mounts.o
 ifneq ($(CONFIG_BLK_DEV_INITRD),y)
 obj-y  += noinitramfs.o
 else
diff --git a/init/banner.c b/init/banner.c
new file mode 100644
index ..653ee081d474
--- /dev/null
+++ b/init/banner.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/init/banner.c
+ *
+ *  Copyright (C) 1992  Theodore Ts'o
+ *
+ *  May be freely distributed as part of Linux.
+ */
+
+#include 
+#include 
+
+/* FIXED STRINGS! Don't touch! */
+const char linux_banner[] =
+   "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"
+   LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";
+
+const char linux_proc_banner[] =
+   "%s version %s"
+   " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ")"
+   " (" LINUX_COMPILER ") %s\n";
diff --git a/init/version.c b/init/version.c
index cba341161b58..989cf9f9fe2c 100644
--- a/init/version.c
+++ b/init/version.c
@@ -42,14 +42,4 @@ struct uts_namespace init_uts_ns = {
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
-/* FIXED STRINGS! Don't touch! */
-const char linux_banner[] =
-   "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"
-   LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";
-
-const char linux_proc_banner[] =
-   "%s version %s"
-   " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ")"
-   " (" LINUX_COMPILER ") %s\n";
-
 BUILD_SALT;
-- 
2.16.4

[PATCH 0/2] Add uncompressed Linux banner to s390 bzImage

2019-07-12 Thread Petr Tesarik

These patches make it easy to determine the kernel version from a
compressed binary by scanning for the Linux banner string in the
uncompressed portion of bzImage.

Petr Tesarik (2):
  init: Separate banner from init_uts_ns
  s390: add Linux banner to the compressed image

 arch/s390/boot/compressed/Makefile |  1 +
 init/Makefile  |  2 +-
 init/banner.c  | 21 +
 init/version.c | 10 --
 4 files changed, 23 insertions(+), 11 deletions(-)
 create mode 100644 init/banner.c

-- 
2.16.4

Re: [RFC v2 3/5] clk: bcm2835: use firmware interface to update pllb

2019-05-21 Thread Petr Tesarik

On Tue, 21 May 2019 13:39:31 +0200
Nicolas Saenz Julienne  wrote:

> Hi Oliver, thanks for the review.
> 
> On Mon, 2019-05-20 at 14:43 +0200, Oliver Neukum wrote:
> > On Mo, 2019-05-20 at 12:47 +0200, Nicolas Saenz Julienne wrote:  
> > > + * For more information on the firmware interface check:
> > > + * 
> > > https://github.com/raspberrypi/firmware/wiki/Mailbox-property-interface
> > > + */
> > > +struct bcm2835_firmware_prop {
> > > +   u32 id;
> > > +   u32 val;
> > > +   u32 disable_turbo;
> > > +} __packed;  
> > 
> > Hi,
> > 
> > technically we are not in arch and those fields have a defined
> > endianness.
> >   
> 
> Well I set it as packed since it's 'sent' through a memory mapped firmware
> interface. Hence the need for the structure format to be fixed. So I guessed
> we're safer with it, as I'm not 100% sure what the different compilers are
> going to do with it (although it's very likely it'll stay the same). BTW this
> will be built both for arm & arm64.

I believe that's not the point Oliver was trying to make. You should
use __le32 instead of u32.

That's because u32 means "host byte order" and this code is not located
under arch/, so host endianness is unknown, but the mailbox interface
requires little-endian.

It's nit-picking, and that's why Oliver writes 'technically'; there is
probably no way this firmware interface could be used on a big-endian
CPU...

Petr T


pgp591b1KgKR4.pgp
Description: Digitální podpis OpenPGP

Re: [PATCH 1/1] s390: vfio-ap: include for test_facility()

2018-11-16 Thread Petr Tesarik

On Fri, 16 Nov 2018 13:20:33 +0100
Halil Pasic  wrote:

> On Fri, 16 Nov 2018 11:47:48 +0100
> Petr Tesarik  wrote:
> 
> > The driver uses test_facility(), but does not include the
> > corresponding include file explicitly. The driver currently builds
> > only thanks to the following include chain:
> > 
> >   vfio_ap_drv.c
> > 
> >   
> > 
> >   
> > 
> >   
> > 
> > Files should not rely on such fragile implicit includes.
> > 
> > Signed-off-by: Petr Tesarik 
> > ---
> >  drivers/s390/crypto/vfio_ap_drv.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/s390/crypto/vfio_ap_drv.c
> > b/drivers/s390/crypto/vfio_ap_drv.c index 7667b38728f0..31c6c847eaca
> > 100644 --- a/drivers/s390/crypto/vfio_ap_drv.c
> > +++ b/drivers/s390/crypto/vfio_ap_drv.c
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "vfio_ap_private.h"
> >  
> >  #define VFIO_AP_ROOT_NAME "vfio_ap"  
> 
> Applied. Is going to go via Martins S390 tree.

Thank you for the quick turnaround.

Petr Tesarik
SUSE HW Enablement Team

Re: [PATCH 1/1] s390: vfio-ap: include for test_facility()

2018-11-16 Thread Petr Tesarik

On Fri, 16 Nov 2018 13:20:33 +0100
Halil Pasic  wrote:

> On Fri, 16 Nov 2018 11:47:48 +0100
> Petr Tesarik  wrote:
> 
> > The driver uses test_facility(), but does not include the
> > corresponding include file explicitly. The driver currently builds
> > only thanks to the following include chain:
> > 
> >   vfio_ap_drv.c
> > 
> >   
> > 
> >   
> > 
> >   
> > 
> > Files should not rely on such fragile implicit includes.
> > 
> > Signed-off-by: Petr Tesarik 
> > ---
> >  drivers/s390/crypto/vfio_ap_drv.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/s390/crypto/vfio_ap_drv.c
> > b/drivers/s390/crypto/vfio_ap_drv.c index 7667b38728f0..31c6c847eaca
> > 100644 --- a/drivers/s390/crypto/vfio_ap_drv.c
> > +++ b/drivers/s390/crypto/vfio_ap_drv.c
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "vfio_ap_private.h"
> >  
> >  #define VFIO_AP_ROOT_NAME "vfio_ap"  
> 
> Applied. Is going to go via Martins S390 tree.

Thank you for the quick turnaround.

Petr Tesarik
SUSE HW Enablement Team

[PATCH 1/1] s390: vfio-ap: include for test_facility()

2018-11-16 Thread Petr Tesarik

The driver uses test_facility(), but does not include the
corresponding include file explicitly. The driver currently builds
only thanks to the following include chain:

  vfio_ap_drv.c

  

  

  

Files should not rely on such fragile implicit includes.

Signed-off-by: Petr Tesarik 
---
 drivers/s390/crypto/vfio_ap_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/s390/crypto/vfio_ap_drv.c 
b/drivers/s390/crypto/vfio_ap_drv.c
index 7667b38728f0..31c6c847eaca 100644
--- a/drivers/s390/crypto/vfio_ap_drv.c
+++ b/drivers/s390/crypto/vfio_ap_drv.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "vfio_ap_private.h"
 
 #define VFIO_AP_ROOT_NAME "vfio_ap"
-- 
2.16.4

[PATCH 1/1] s390: vfio-ap: include for test_facility()

2018-11-16 Thread Petr Tesarik

The driver uses test_facility(), but does not include the
corresponding include file explicitly. The driver currently builds
only thanks to the following include chain:

  vfio_ap_drv.c

  

  

  

Files should not rely on such fragile implicit includes.

Signed-off-by: Petr Tesarik 
---
 drivers/s390/crypto/vfio_ap_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/s390/crypto/vfio_ap_drv.c 
b/drivers/s390/crypto/vfio_ap_drv.c
index 7667b38728f0..31c6c847eaca 100644
--- a/drivers/s390/crypto/vfio_ap_drv.c
+++ b/drivers/s390/crypto/vfio_ap_drv.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "vfio_ap_private.h"
 
 #define VFIO_AP_ROOT_NAME "vfio_ap"
-- 
2.16.4

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-28 Thread Petr Tesarik

On Fri, 25 May 2018 15:00:13 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

>[...]
> The ultimate point is that the absolute best we can do is to run a
> kernel in memory that we never use for anything else and then we have a
> fighting chance of getting the system working and getting a report of
> the failure out to somewhere.
>
> > Anyway, of course we would still have to keep the current method,
> > because user pages are not always filtered. For example, a major SUSE
> > account runs a database in user space and also inspects its data
> > structures in case of a system crash.  
> 
> And I understand the memory pressures that will encourage people to use
> user pages for extra memory to run the dump capture kernel in.  Short of
> the presence of an IOMMU that all DMA transfers must go through I don't
> see how those user pages could reliably be used.

Absolutely. I fully understand that a system which reuses first
kernel's memory in some way must be less reliable than the current
state. However, some people are willing to trade less reliability for
reduced resource consumption.

Note that we're not talking about reserving a few gigs on a single
machine with some terabytes of memory (i.e. less than 1% of total RAM),
rather a few hundred megs of each 4-gig VM on an s390x machine (i.e.
about 10% of total RAM).

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-28 Thread Petr Tesarik

On Fri, 25 May 2018 15:00:13 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

>[...]
> The ultimate point is that the absolute best we can do is to run a
> kernel in memory that we never use for anything else and then we have a
> fighting chance of getting the system working and getting a report of
> the failure out to somewhere.
>
> > Anyway, of course we would still have to keep the current method,
> > because user pages are not always filtered. For example, a major SUSE
> > account runs a database in user space and also inspects its data
> > structures in case of a system crash.  
> 
> And I understand the memory pressures that will encourage people to use
> user pages for extra memory to run the dump capture kernel in.  Short of
> the presence of an IOMMU that all DMA transfers must go through I don't
> see how those user pages could reliably be used.

Absolutely. I fully understand that a system which reuses first
kernel's memory in some way must be less reliable than the current
state. However, some people are willing to trade less reliability for
reduced resource consumption.

Note that we're not talking about reserving a few gigs on a single
machine with some terabytes of memory (i.e. less than 1% of total RAM),
rather a few hundred megs of each 4-gig VM on an s390x machine (i.e.
about 10% of total RAM).

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Petr Tesarik

V Thu, 24 May 2018 11:34:05 -0500
ebied...@xmission.com (Eric W. Biederman) napsáno:

> Petr Tesarik <ptesa...@suse.cz> writes:
> 
> 2> On Thu, 24 May 2018 09:49:05 +0800
> > Dave Young <dyo...@redhat.com> wrote:
> >  
> >> Hi Petr,
> >> 
> >> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >>[...]  
> >> > In short, if one size fits none, what good is it to hardcode that "one
> >> > size" into the kernel image?
> >> 
> >> I agreed with all the things that we can not know the exact memory
> >> requirement for 100% use cases.  But that does not means this is useless
> >> it is still useful for common use cases of no special and memory hog
> >> requirements as I mentioned in another reply it can simplify the kdump
> >> deployment for those people who do not need the special setup.  
> >
> > I still tend to disagree. This "common-case" reservation depends on
> > things that are defined by user space. It surely does not make it
> > easier to build a distribution kernel. Today, I get bug reports that
> > the number calculated and added to the boot loader configuration by the
> > installer is inaccurate. If I put a fixed number into a kernel config
> > option, I will start getting bugs that this number is incorrect (for
> > some systems).
> >  
> >> For example, if this is a workstation I just want to break into a shell
> >> to collect some panic info, then I just need a very minimal initrd, then
> >> the Kconfig will work just fine.  
> >
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
> >
> > Anyway, if you want to improve the "common case", then look how IBM
> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
> >
> > https://patchwork.ozlabs.org/patch/905026/
> >
> > The main idea is:
> >  
> >> Instead of setting aside a significant chunk of memory nobody can use,
> >> [...] reserve a significant chunk of memory that the kernel is prevented
> >> from using [...], but applications are free to use it.  
> >
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.  
> 
> They absolutely can not be used in the kdump case.
> 
> The kdump requirement is that they are pages no-one initiates any I/O
> to.  To avoid the problem of devices doing DMA as the new kernel starts
> and runs.

Good point. This means that memory reserved for this purpose would also
have to be excluded from allocations that may be eventually used for
DMA transfers.

>  Secondarily to avoid problems with cpus that refused to halt.

Let's face it - if some CPUs refused to halt, all bets are off. The
code running on such a CPU can break many other things besides memory,
most importantly, it may meddle with the HW registers of crucial
devices in the system. To be less abstract, I have seen a failure to
stop a CPU in the crashed kernel a few times, and the panic kernel
could never successfully save anything; it always crashed at boot or a
little bit later.

Anyway, of course we would still have to keep the current method,
because user pages are not always filtered. For example, a major SUSE
account runs a database in user space and also inspects its data
structures in case of a system crash.

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Petr Tesarik

V Thu, 24 May 2018 11:34:05 -0500
ebied...@xmission.com (Eric W. Biederman) napsáno:

> Petr Tesarik  writes:
> 
> 2> On Thu, 24 May 2018 09:49:05 +0800
> > Dave Young  wrote:
> >  
> >> Hi Petr,
> >> 
> >> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >>[...]  
> >> > In short, if one size fits none, what good is it to hardcode that "one
> >> > size" into the kernel image?
> >> 
> >> I agreed with all the things that we can not know the exact memory
> >> requirement for 100% use cases.  But that does not means this is useless
> >> it is still useful for common use cases of no special and memory hog
> >> requirements as I mentioned in another reply it can simplify the kdump
> >> deployment for those people who do not need the special setup.  
> >
> > I still tend to disagree. This "common-case" reservation depends on
> > things that are defined by user space. It surely does not make it
> > easier to build a distribution kernel. Today, I get bug reports that
> > the number calculated and added to the boot loader configuration by the
> > installer is inaccurate. If I put a fixed number into a kernel config
> > option, I will start getting bugs that this number is incorrect (for
> > some systems).
> >  
> >> For example, if this is a workstation I just want to break into a shell
> >> to collect some panic info, then I just need a very minimal initrd, then
> >> the Kconfig will work just fine.  
> >
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
> >
> > Anyway, if you want to improve the "common case", then look how IBM
> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
> >
> > https://patchwork.ozlabs.org/patch/905026/
> >
> > The main idea is:
> >  
> >> Instead of setting aside a significant chunk of memory nobody can use,
> >> [...] reserve a significant chunk of memory that the kernel is prevented
> >> from using [...], but applications are free to use it.  
> >
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.  
> 
> They absolutely can not be used in the kdump case.
> 
> The kdump requirement is that they are pages no-one initiates any I/O
> to.  To avoid the problem of devices doing DMA as the new kernel starts
> and runs.

Good point. This means that memory reserved for this purpose would also
have to be excluded from allocations that may be eventually used for
DMA transfers.

>  Secondarily to avoid problems with cpus that refused to halt.

Let's face it - if some CPUs refused to halt, all bets are off. The
code running on such a CPU can break many other things besides memory,
most importantly, it may meddle with the HW registers of crucial
devices in the system. To be less abstract, I have seen a failure to
stop a CPU in the crashed kernel a few times, and the panic kernel
could never successfully save anything; it always crashed at boot or a
little bit later.

Anyway, of course we would still have to keep the current method,
because user pages are not always filtered. For example, a major SUSE
account runs a database in user space and also inspects its data
structures in case of a system crash.

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Petr Tesarik

On Thu, 24 May 2018 15:26:27 +0800
Dave Young <dyo...@redhat.com> wrote:

> On 05/24/18 at 08:57am, Petr Tesarik wrote:
>[...]
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).  
> 
> Still I think we have agreement and same feeling about the userspace
> memory requirement.   I think although it is hard, we have been still
> trying to shrink the initramfs memory use.
> 
> Besides of distribution use,  why people can not use some minimal
> initrd?  For example only a basic shell and some necessary tools and
> basic storage eg. raw disks supported, and he/she can just collect the
> panic infomation by himself in a shell.

Again, I'm having trouble with the definition of a "minimal initrd" and
also with the definition of a "workstation". I have already seen a sad
case where kdump started going OOM after connecting a 4K monitor,
because, well, it needed a bigger framebuffer...

OTOH you wrote in another mail that RH has tested some values on a
variety of hardware, so you seem to have a clue. Good for you. I still
believe it is moving policy into the kernel.

Based on past experience, I expect that certain users will argue that
"crashkernel=auto" should work out of the box on their HPE Superdome
with 600+ LUNs attached...

As you wrote elsewhere in the thread:

> I means this patch is not trying to force add a fixed value for crashkernel
> in kernel code. It provides another way one can use on kernel build time
> the value just works.

I don't mind if it is added, although I don't find it very useful.

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Petr Tesarik

On Thu, 24 May 2018 15:26:27 +0800
Dave Young  wrote:

> On 05/24/18 at 08:57am, Petr Tesarik wrote:
>[...]
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).  
> 
> Still I think we have agreement and same feeling about the userspace
> memory requirement.   I think although it is hard, we have been still
> trying to shrink the initramfs memory use.
> 
> Besides of distribution use,  why people can not use some minimal
> initrd?  For example only a basic shell and some necessary tools and
> basic storage eg. raw disks supported, and he/she can just collect the
> panic infomation by himself in a shell.

Again, I'm having trouble with the definition of a "minimal initrd" and
also with the definition of a "workstation". I have already seen a sad
case where kdump started going OOM after connecting a 4K monitor,
because, well, it needed a bigger framebuffer...

OTOH you wrote in another mail that RH has tested some values on a
variety of hardware, so you seem to have a clue. Good for you. I still
believe it is moving policy into the kernel.

Based on past experience, I expect that certain users will argue that
"crashkernel=auto" should work out of the box on their HPE Superdome
with 600+ LUNs attached...

As you wrote elsewhere in the thread:

> I means this patch is not trying to force add a fixed value for crashkernel
> in kernel code. It provides another way one can use on kernel build time
> the value just works.

I don't mind if it is added, although I don't find it very useful.

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Petr Tesarik

On Thu, 24 May 2018 09:49:05 +0800
Dave Young <dyo...@redhat.com> wrote:

> Hi Petr,
> 
> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
>[...]
> > In short, if one size fits none, what good is it to hardcode that "one
> > size" into the kernel image?  
> 
> I agreed with all the things that we can not know the exact memory
> requirement for 100% use cases.  But that does not means this is useless
> it is still useful for common use cases of no special and memory hog
> requirements as I mentioned in another reply it can simplify the kdump
> deployment for those people who do not need the special setup.

I still tend to disagree. This "common-case" reservation depends on
things that are defined by user space. It surely does not make it
easier to build a distribution kernel. Today, I get bug reports that
the number calculated and added to the boot loader configuration by the
installer is inaccurate. If I put a fixed number into a kernel config
option, I will start getting bugs that this number is incorrect (for
some systems).

> For example, if this is a workstation I just want to break into a shell
> to collect some panic info, then I just need a very minimal initrd, then
> the Kconfig will work just fine.

What is "a very minimal initrd"? Last time I had to make a significant
adjustment to the estimation for openSUSE, this was caused by growing
user-space requirements (systemd in this case, but I don't want to
start flamewars on that topic, please).

Anyway, if you want to improve the "common case", then look how IBM
tries to solve it for firmware-assisted dump (fadump) on powerpc:

https://patchwork.ozlabs.org/patch/905026/

The main idea is:

> Instead of setting aside a significant chunk of memory nobody can use,
> [...] reserve a significant chunk of memory that the kernel is prevented
> from using [...], but applications are free to use it.

That works great, because user space pages are filtered out in the
common case, so they can be used freely by the panic kernel.

Just my two cents,
Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Petr Tesarik

On Thu, 24 May 2018 09:49:05 +0800
Dave Young  wrote:

> Hi Petr,
> 
> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
>[...]
> > In short, if one size fits none, what good is it to hardcode that "one
> > size" into the kernel image?  
> 
> I agreed with all the things that we can not know the exact memory
> requirement for 100% use cases.  But that does not means this is useless
> it is still useful for common use cases of no special and memory hog
> requirements as I mentioned in another reply it can simplify the kdump
> deployment for those people who do not need the special setup.

I still tend to disagree. This "common-case" reservation depends on
things that are defined by user space. It surely does not make it
easier to build a distribution kernel. Today, I get bug reports that
the number calculated and added to the boot loader configuration by the
installer is inaccurate. If I put a fixed number into a kernel config
option, I will start getting bugs that this number is incorrect (for
some systems).

> For example, if this is a workstation I just want to break into a shell
> to collect some panic info, then I just need a very minimal initrd, then
> the Kconfig will work just fine.

What is "a very minimal initrd"? Last time I had to make a significant
adjustment to the estimation for openSUSE, this was caused by growing
user-space requirements (systemd in this case, but I don't want to
start flamewars on that topic, please).

Anyway, if you want to improve the "common case", then look how IBM
tries to solve it for firmware-assisted dump (fadump) on powerpc:

https://patchwork.ozlabs.org/patch/905026/

The main idea is:

> Instead of setting aside a significant chunk of memory nobody can use,
> [...] reserve a significant chunk of memory that the kernel is prevented
> from using [...], but applications are free to use it.

That works great, because user space pages are filtered out in the
common case, so they can be used freely by the panic kernel.

Just my two cents,
Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Petr Tesarik

On Wed, 23 May 2018 10:53:55 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

> Dave Young  writes:
> 
> > [snip]
> >  
> >> >  
> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> > +int "System memory size threshold for kdump memory default 
> >> > reserving"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> > +  the system memory size is equal or bigger than the threshold. 
> >> >  
> >> 
> >> "the threshold" is rather vague.  Can it be clarified?
> >> 
> >> In fact I'm really struggling to understand the logic here
> >> 
> >>   
> >> > +config CRASHKERNEL_DEFAULT_MB
> >> > +int "Default crashkernel memory size reserved for kdump"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  This is used as the default kdump reserved memory size in MB.
> >> > +  crashkernel=X kernel cmdline can overwrite this value.
> >> > +
> >> >  config HAVE_IMA_KEXEC
> >> >  bool
> >> >  
> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >  return 0;
> >> >  }
> >> >  
> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> > +  unsigned long long *size)
> >> > +{
> >> > +unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> > +unsigned long long thres = 
> >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> > +
> >> > +thres *= SZ_1M;
> >> > +sz *= SZ_1M;
> >> > +
> >> > +if (sz >= system_ram || system_ram < thres) {
> >> > +pr_debug("crashkernel default size can not be used.\n");
> >> > +return -EINVAL;  
> >> 
> >> In other words,
> >> 
> >>if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >>system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >>fail;
> >> 
> >> yes?
> >> 
> >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> >> is needed.  And clearer Kconfig text.
> >> 
> >> All confused :(  
> >
> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > the size is too large and kernel can not find enough memory it will
> > still fail in latter code.
> >
> > Is below version looks clearer?  
> 
> What is the advantage of providing this in a kconfig option rather
> than on the kernel command line as we can now?

Yeah, I was about to ask the very same question.

Having spent quite some time on estimating RAM required to save a crash
dump, I can tell you that there is no silver bullet. My main objection
is that core dumps are saved from user space, and the kernel cannot
have a clue what it is going to be.

First, the primary kernel cannot know how much memory will be needed
for the panic kernel (not necessarily same as the primary kernel) and
the panic initrd. If you build a minimal initrd for your system, then
at least it depends on which modules must be included, which in turn
depends on where you want to store the resulting dump. Mounting a local
ext2 partition will require less software than mounting an LVM logical
volume in a PV accessed through iSCSI over two bonded Ethernet NICs.

Second, run-time requirements may vary wildly. While sending the data
over a simple TCP connection (e.g. using FTP) consumes just a few
megabytes even on 10G Ethernet, dm block devices tend to consume much
more, because of the additional buffers allocated by device mapper.

Third, systems should be treated as "big" not so much because of the
amount of RAM, but more so because of the amount of attached devices.
I've seen a machine with devices from /dev/sda to /dev/sdvm; try to
calculate how much kernel memory is taken just by their in-kernel
representation...

Fourth, quite often there is a trade-off between how much memory is
reserved for the panic environment, and how long dumping will take. For
example, you may take advantage of multi-threading in makedumpfile, but
obviously, the additional threads need more memory (or makedumpfile
will have to do its job in more cycles, reducing speed again). Oh, did
I mention that even bringing up more CPUs has an impact on kernel
runtime memory requirements?

In short, if one size fits none, what good is it to hardcode that "one
size" into the kernel image?

Petr T

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Petr Tesarik

On Wed, 23 May 2018 10:53:55 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

> Dave Young  writes:
> 
> > [snip]
> >  
> >> >  
> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> > +int "System memory size threshold for kdump memory default 
> >> > reserving"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> > +  the system memory size is equal or bigger than the threshold. 
> >> >  
> >> 
> >> "the threshold" is rather vague.  Can it be clarified?
> >> 
> >> In fact I'm really struggling to understand the logic here
> >> 
> >>   
> >> > +config CRASHKERNEL_DEFAULT_MB
> >> > +int "Default crashkernel memory size reserved for kdump"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  This is used as the default kdump reserved memory size in MB.
> >> > +  crashkernel=X kernel cmdline can overwrite this value.
> >> > +
> >> >  config HAVE_IMA_KEXEC
> >> >  bool
> >> >  
> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >  return 0;
> >> >  }
> >> >  
> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> > +  unsigned long long *size)
> >> > +{
> >> > +unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> > +unsigned long long thres = 
> >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> > +
> >> > +thres *= SZ_1M;
> >> > +sz *= SZ_1M;
> >> > +
> >> > +if (sz >= system_ram || system_ram < thres) {
> >> > +pr_debug("crashkernel default size can not be used.\n");
> >> > +return -EINVAL;  
> >> 
> >> In other words,
> >> 
> >>if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >>system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >>fail;
> >> 
> >> yes?
> >> 
> >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> >> is needed.  And clearer Kconfig text.
> >> 
> >> All confused :(  
> >
> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > the size is too large and kernel can not find enough memory it will
> > still fail in latter code.
> >
> > Is below version looks clearer?  
> 
> What is the advantage of providing this in a kconfig option rather
> than on the kernel command line as we can now?

Yeah, I was about to ask the very same question.

Having spent quite some time on estimating RAM required to save a crash
dump, I can tell you that there is no silver bullet. My main objection
is that core dumps are saved from user space, and the kernel cannot
have a clue what it is going to be.

First, the primary kernel cannot know how much memory will be needed
for the panic kernel (not necessarily same as the primary kernel) and
the panic initrd. If you build a minimal initrd for your system, then
at least it depends on which modules must be included, which in turn
depends on where you want to store the resulting dump. Mounting a local
ext2 partition will require less software than mounting an LVM logical
volume in a PV accessed through iSCSI over two bonded Ethernet NICs.

Second, run-time requirements may vary wildly. While sending the data
over a simple TCP connection (e.g. using FTP) consumes just a few
megabytes even on 10G Ethernet, dm block devices tend to consume much
more, because of the additional buffers allocated by device mapper.

Third, systems should be treated as "big" not so much because of the
amount of RAM, but more so because of the amount of attached devices.
I've seen a machine with devices from /dev/sda to /dev/sdvm; try to
calculate how much kernel memory is taken just by their in-kernel
representation...

Fourth, quite often there is a trade-off between how much memory is
reserved for the panic environment, and how long dumping will take. For
example, you may take advantage of multi-threading in makedumpfile, but
obviously, the additional threads need more memory (or makedumpfile
will have to do its job in more cycles, reducing speed again). Oh, did
I mention that even bringing up more CPUs has an impact on kernel
runtime memory requirements?

In short, if one size fits none, what good is it to hardcode that "one
size" into the kernel image?

Petr T

[tip:x86/urgent] x86/setup: Do not reserve a crash kernel region if booted on Xen PV

2018-04-27 Thread tip-bot for Petr Tesarik

Commit-ID:  3db3eb285259ac129f7aec6b814b3e9f6c1b372b
Gitweb: https://git.kernel.org/tip/3db3eb285259ac129f7aec6b814b3e9f6c1b372b
Author: Petr Tesarik <ptesa...@suse.cz>
AuthorDate: Wed, 25 Apr 2018 12:08:35 +0200
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Fri, 27 Apr 2018 17:06:28 +0200

x86/setup: Do not reserve a crash kernel region if booted on Xen PV

Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.

A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
Reviewed-by: Juergen Gross <jgr...@suse.com>
Cc: Tom Lendacky <thomas.lenda...@amd.com>
Cc: Dou Liyang <douly.f...@cn.fujitsu.com>
Cc: Mikulas Patocka <mpato...@redhat.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: xen-de...@lists.xenproject.org
Cc: Boris Ostrovsky <boris.ostrov...@oracle.com>
Cc: Borislav Petkov <b...@suse.de>
Cc: Jean Delvare <jdelv...@suse.de>
Link: https://lkml.kernel.org/r/20180425120835.23cef...@ezekiel.suse.cz

---
 arch/x86/kernel/setup.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..5c623dfe39d1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -534,6 +535,11 @@ static void __init reserve_crashkernel(void)
high = true;
}
 
+   if (xen_pv_domain()) {
+   pr_info("Ignoring crashkernel for a Xen PV domain\n");
+   return;
+   }
+
/* 0 means: find the address automatically */
if (crash_base <= 0) {
/*

[tip:x86/urgent] x86/setup: Do not reserve a crash kernel region if booted on Xen PV

2018-04-27 Thread tip-bot for Petr Tesarik

Commit-ID:  3db3eb285259ac129f7aec6b814b3e9f6c1b372b
Gitweb: https://git.kernel.org/tip/3db3eb285259ac129f7aec6b814b3e9f6c1b372b
Author: Petr Tesarik 
AuthorDate: Wed, 25 Apr 2018 12:08:35 +0200
Committer:  Thomas Gleixner 
CommitDate: Fri, 27 Apr 2018 17:06:28 +0200

x86/setup: Do not reserve a crash kernel region if booted on Xen PV

Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.

A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.

Signed-off-by: Petr Tesarik 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Juergen Gross 
Cc: Tom Lendacky 
Cc: Dou Liyang 
Cc: Mikulas Patocka 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: xen-de...@lists.xenproject.org
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Jean Delvare 
Link: https://lkml.kernel.org/r/20180425120835.23cef...@ezekiel.suse.cz

---
 arch/x86/kernel/setup.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..5c623dfe39d1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -534,6 +535,11 @@ static void __init reserve_crashkernel(void)
high = true;
}
 
+   if (xen_pv_domain()) {
+   pr_info("Ignoring crashkernel for a Xen PV domain\n");
+   return;
+   }
+
/* 0 means: find the address automatically */
if (crash_base <= 0) {
/*

[PATCH] x86: Do not reserve a crash kernel region if booted on Xen PV

2018-04-25 Thread Petr Tesarik

Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.

A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
---
 arch/x86/kernel/setup.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..5c623dfe39d1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -534,6 +535,11 @@ static void __init reserve_crashkernel(void)
high = true;
}
 
+   if (xen_pv_domain()) {
+   pr_info("Ignoring crashkernel for a Xen PV domain\n");
+   return;
+   }
+
/* 0 means: find the address automatically */
if (crash_base <= 0) {
/*
-- 
2.13.6

[PATCH] x86: Do not reserve a crash kernel region if booted on Xen PV

2018-04-25 Thread Petr Tesarik

Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.

A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.

Signed-off-by: Petr Tesarik 
---
 arch/x86/kernel/setup.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..5c623dfe39d1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -534,6 +535,11 @@ static void __init reserve_crashkernel(void)
high = true;
}
 
+   if (xen_pv_domain()) {
+   pr_info("Ignoring crashkernel for a Xen PV domain\n");
+   return;
+   }
+
/* 0 means: find the address automatically */
if (crash_base <= 0) {
/*
-- 
2.13.6

[PATCH] kexec: export PG_swapbacked to VMCOREINFO

2018-04-10 Thread Petr Tesarik

Since commit 6326fec1122cde256bd2a8c63f2606e08e44ce1d PG_swapcache
is an alias for PG_owner_priv_1, which may be also used for other
purposes. To know whether the bit indeed has the PG_swapcache
meaning, it is necessary to check PG_swapbacked, hence this bit must
be exported.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
---
 kernel/crash_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index a93590cdd9e1..f7674d676889 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -454,6 +454,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
VMCOREINFO_NUMBER(PG_swapcache);
+   VMCOREINFO_NUMBER(PG_swapbacked);
VMCOREINFO_NUMBER(PG_slab);
 #ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
-- 
2.13.6

[PATCH] kexec: export PG_swapbacked to VMCOREINFO

2018-04-10 Thread Petr Tesarik

Since commit 6326fec1122cde256bd2a8c63f2606e08e44ce1d PG_swapcache
is an alias for PG_owner_priv_1, which may be also used for other
purposes. To know whether the bit indeed has the PG_swapcache
meaning, it is necessary to check PG_swapbacked, hence this bit must
be exported.

Signed-off-by: Petr Tesarik 
---
 kernel/crash_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index a93590cdd9e1..f7674d676889 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -454,6 +454,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
VMCOREINFO_NUMBER(PG_swapcache);
+   VMCOREINFO_NUMBER(PG_swapbacked);
VMCOREINFO_NUMBER(PG_slab);
 #ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
-- 
2.13.6

[PATCH v2] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-25 Thread Petr Tesarik

The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this
would only guarantee that the 2 lowest bits must be zero. On the
other hand, it claims that at least 3 bits are available, and 3 bits
are actually used.

This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many
bits are available and why.

Although memmap arrays are allocated in various places, the
resulting pointer is encoded eventually, so I am adding a BUG_ON()
here to enforce at runtime that all expected bits are indeed
available.

I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
sufficient, because this part of the calculation can be easily
checked at build time.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
---
 include/linux/mmzone.h | 12 ++--
 mm/sparse.c|  6 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..7522a6987595 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
 
 /*
  * We use the lower bits of the mem_map pointer to store
- * a little bit of information.  There should be at least
- * 3 bits here due to 32-bit alignment.
+ * a little bit of information.  The pointer is calculated
+ * as mem_map - section_nr_to_pfn(pnum).  The result is
+ * aligned to the minimum alignment of the two values:
+ *   1. All mem_map arrays are page-aligned.
+ *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
+ *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
+ *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
+ *  worst combination is powerpc with 256k pages,
+ *  which results in PFN_SECTION_SHIFT equal 6.
+ * To sum it up, at least 6 bits are available.
  */
 #defineSECTION_MARKED_PRESENT  (1UL<<0)
 #define SECTION_HAS_MEM_MAP(1UL<<1)
diff --git a/mm/sparse.c b/mm/sparse.c
index 2609aba121e8..6b8b5e91ceef 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -264,7 +264,11 @@ unsigned long __init node_memmap_size_bytes(int nid, 
unsigned long start_pfn,
  */
 static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long 
pnum)
 {
-   return (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
+   unsigned long coded_mem_map =
+   (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
+   BUILD_BUG_ON(SECTION_MAP_LAST_BIT > (1UL<<PFN_SECTION_SHIFT));
+   BUG_ON(coded_mem_map & ~SECTION_MAP_MASK);
+   return coded_mem_map;
 }
 
 /*
-- 
2.13.6

[PATCH v2] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-25 Thread Petr Tesarik

The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this
would only guarantee that the 2 lowest bits must be zero. On the
other hand, it claims that at least 3 bits are available, and 3 bits
are actually used.

This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many
bits are available and why.

Although memmap arrays are allocated in various places, the
resulting pointer is encoded eventually, so I am adding a BUG_ON()
here to enforce at runtime that all expected bits are indeed
available.

I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
sufficient, because this part of the calculation can be easily
checked at build time.

Signed-off-by: Petr Tesarik 
---
 include/linux/mmzone.h | 12 ++--
 mm/sparse.c|  6 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..7522a6987595 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
 
 /*
  * We use the lower bits of the mem_map pointer to store
- * a little bit of information.  There should be at least
- * 3 bits here due to 32-bit alignment.
+ * a little bit of information.  The pointer is calculated
+ * as mem_map - section_nr_to_pfn(pnum).  The result is
+ * aligned to the minimum alignment of the two values:
+ *   1. All mem_map arrays are page-aligned.
+ *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
+ *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
+ *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
+ *  worst combination is powerpc with 256k pages,
+ *  which results in PFN_SECTION_SHIFT equal 6.
+ * To sum it up, at least 6 bits are available.
  */
 #defineSECTION_MARKED_PRESENT  (1UL<<0)
 #define SECTION_HAS_MEM_MAP(1UL<<1)
diff --git a/mm/sparse.c b/mm/sparse.c
index 2609aba121e8..6b8b5e91ceef 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -264,7 +264,11 @@ unsigned long __init node_memmap_size_bytes(int nid, 
unsigned long start_pfn,
  */
 static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long 
pnum)
 {
-   return (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
+   unsigned long coded_mem_map =
+   (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
+   BUILD_BUG_ON(SECTION_MAP_LAST_BIT > (1UL<

Re: [PATCH] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-24 Thread Petr Tesarik

On Wed, 24 Jan 2018 13:43:53 +0100
Michal Hocko <mho...@kernel.org> wrote:

> On Fri 19-01-18 14:21:33, Petr Tesarik wrote:
> > On Fri, 19 Jan 2018 13:39:56 +0100
> > Michal Hocko <mho...@kernel.org> wrote:
> >   
> > > On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> > > [...]  
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 67f2e3c38939..7522a6987595 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> > > >  
> > > >  /*
> > > >   * We use the lower bits of the mem_map pointer to store
> > > > - * a little bit of information.  There should be at least
> > > > - * 3 bits here due to 32-bit alignment.
> > > > + * a little bit of information.  The pointer is calculated
> > > > + * as mem_map - section_nr_to_pfn(pnum).  The result is
> > > > + * aligned to the minimum alignment of the two values:
> > > > + *   1. All mem_map arrays are page-aligned.
> > > > + *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > > > + *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
> > > > + *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > > > + *  worst combination is powerpc with 256k pages,
> > > > + *  which results in PFN_SECTION_SHIFT equal 6.
> > > > + * To sum it up, at least 6 bits are available.
> > > >   */
> > > 
> > > This is _much_ better indeed. Do you think we can go one step further
> > > and add BUG_ON into the sparse code to guarantee that every mmemap
> > > is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> > > used?  
> > 
> > This is easy for the section_nr_to_pfn() part. I'd just add:
> > 
> >   BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);
> > 
> > But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
> > all allocation paths?
> > 
> > Note that mem_map arrays can be allocated by:
> > 
> >   a) __earlyonly_bootmem_alloc
> >   b) memblock_virt_alloc_try_nid
> >   c) memblock_virt_alloc_try_nid_raw
> >   d) alloc_remap (only arch/tile still has it)
> > 
> > Some allocation paths are in mm/sparse.c, others are
> > mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
> > a single line in each, it may work.  
> 
> Yeah, it is a mess. So I will leave it up to you. I do not want to block
> your comment update which is a nice improvement. So with or without the
> runtime check feel free to add

Hell, since I have already taken the time to review all the allocation
paths, I can just also add those BUG_ONs. I was just curious if you had
a better idea than spraying them all around the place, but it seems you
don't. ;-)

In short, stay tuned for v2, which is now WIP.

Petr T

Re: [PATCH] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-24 Thread Petr Tesarik

On Wed, 24 Jan 2018 13:43:53 +0100
Michal Hocko  wrote:

> On Fri 19-01-18 14:21:33, Petr Tesarik wrote:
> > On Fri, 19 Jan 2018 13:39:56 +0100
> > Michal Hocko  wrote:
> >   
> > > On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> > > [...]  
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 67f2e3c38939..7522a6987595 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> > > >  
> > > >  /*
> > > >   * We use the lower bits of the mem_map pointer to store
> > > > - * a little bit of information.  There should be at least
> > > > - * 3 bits here due to 32-bit alignment.
> > > > + * a little bit of information.  The pointer is calculated
> > > > + * as mem_map - section_nr_to_pfn(pnum).  The result is
> > > > + * aligned to the minimum alignment of the two values:
> > > > + *   1. All mem_map arrays are page-aligned.
> > > > + *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > > > + *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
> > > > + *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > > > + *  worst combination is powerpc with 256k pages,
> > > > + *  which results in PFN_SECTION_SHIFT equal 6.
> > > > + * To sum it up, at least 6 bits are available.
> > > >   */
> > > 
> > > This is _much_ better indeed. Do you think we can go one step further
> > > and add BUG_ON into the sparse code to guarantee that every mmemap
> > > is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> > > used?  
> > 
> > This is easy for the section_nr_to_pfn() part. I'd just add:
> > 
> >   BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);
> > 
> > But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
> > all allocation paths?
> > 
> > Note that mem_map arrays can be allocated by:
> > 
> >   a) __earlyonly_bootmem_alloc
> >   b) memblock_virt_alloc_try_nid
> >   c) memblock_virt_alloc_try_nid_raw
> >   d) alloc_remap (only arch/tile still has it)
> > 
> > Some allocation paths are in mm/sparse.c, others are
> > mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
> > a single line in each, it may work.  
> 
> Yeah, it is a mess. So I will leave it up to you. I do not want to block
> your comment update which is a nice improvement. So with or without the
> runtime check feel free to add

Hell, since I have already taken the time to review all the allocation
paths, I can just also add those BUG_ONs. I was just curious if you had
a better idea than spraying them all around the place, but it seems you
don't. ;-)

In short, stay tuned for v2, which is now WIP.

Petr T

Re: [PATCH] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-19 Thread Petr Tesarik

On Fri, 19 Jan 2018 13:39:56 +0100
Michal Hocko <mho...@kernel.org> wrote:

> On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> [...]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 67f2e3c38939..7522a6987595 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> >  
> >  /*
> >   * We use the lower bits of the mem_map pointer to store
> > - * a little bit of information.  There should be at least
> > - * 3 bits here due to 32-bit alignment.
> > + * a little bit of information.  The pointer is calculated
> > + * as mem_map - section_nr_to_pfn(pnum).  The result is
> > + * aligned to the minimum alignment of the two values:
> > + *   1. All mem_map arrays are page-aligned.
> > + *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > + *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
> > + *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > + *  worst combination is powerpc with 256k pages,
> > + *  which results in PFN_SECTION_SHIFT equal 6.
> > + * To sum it up, at least 6 bits are available.
> >   */  
> 
> This is _much_ better indeed. Do you think we can go one step further
> and add BUG_ON into the sparse code to guarantee that every mmemap
> is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> used?

This is easy for the section_nr_to_pfn() part. I'd just add:

  BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);

But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
all allocation paths?

Note that mem_map arrays can be allocated by:

  a) __earlyonly_bootmem_alloc
  b) memblock_virt_alloc_try_nid
  c) memblock_virt_alloc_try_nid_raw
  d) alloc_remap (only arch/tile still has it)

Some allocation paths are in mm/sparse.c, others are
mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
a single line in each, it may work.

Petr T

Re: [PATCH] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-19 Thread Petr Tesarik

On Fri, 19 Jan 2018 13:39:56 +0100
Michal Hocko  wrote:

> On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> [...]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 67f2e3c38939..7522a6987595 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> >  
> >  /*
> >   * We use the lower bits of the mem_map pointer to store
> > - * a little bit of information.  There should be at least
> > - * 3 bits here due to 32-bit alignment.
> > + * a little bit of information.  The pointer is calculated
> > + * as mem_map - section_nr_to_pfn(pnum).  The result is
> > + * aligned to the minimum alignment of the two values:
> > + *   1. All mem_map arrays are page-aligned.
> > + *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > + *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
> > + *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > + *  worst combination is powerpc with 256k pages,
> > + *  which results in PFN_SECTION_SHIFT equal 6.
> > + * To sum it up, at least 6 bits are available.
> >   */  
> 
> This is _much_ better indeed. Do you think we can go one step further
> and add BUG_ON into the sparse code to guarantee that every mmemap
> is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> used?

This is easy for the section_nr_to_pfn() part. I'd just add:

  BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);

But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
all allocation paths?

Note that mem_map arrays can be allocated by:

  a) __earlyonly_bootmem_alloc
  b) memblock_virt_alloc_try_nid
  c) memblock_virt_alloc_try_nid_raw
  d) alloc_remap (only arch/tile still has it)

Some allocation paths are in mm/sparse.c, others are
mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
a single line in each, it may work.

Petr T

[PATCH] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-18 Thread Petr Tesarik

The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this
would only guarantee that the 2 lowest bits must be zero. On the
other hand, it claims that at least 3 bits are available, and 3 bits
are actually used.

This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many
bits are available and why.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
---
 include/linux/mmzone.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..7522a6987595 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
 
 /*
  * We use the lower bits of the mem_map pointer to store
- * a little bit of information.  There should be at least
- * 3 bits here due to 32-bit alignment.
+ * a little bit of information.  The pointer is calculated
+ * as mem_map - section_nr_to_pfn(pnum).  The result is
+ * aligned to the minimum alignment of the two values:
+ *   1. All mem_map arrays are page-aligned.
+ *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
+ *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
+ *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
+ *  worst combination is powerpc with 256k pages,
+ *  which results in PFN_SECTION_SHIFT equal 6.
+ * To sum it up, at least 6 bits are available.
  */
 #defineSECTION_MARKED_PRESENT  (1UL<<0)
 #define SECTION_HAS_MEM_MAP(1UL<<1)
-- 
2.13.6

[PATCH] Fix explanation of lower bits in the SPARSEMEM mem_map pointer

2018-01-18 Thread Petr Tesarik

The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this
would only guarantee that the 2 lowest bits must be zero. On the
other hand, it claims that at least 3 bits are available, and 3 bits
are actually used.

This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many
bits are available and why.

Signed-off-by: Petr Tesarik 
---
 include/linux/mmzone.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..7522a6987595 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
 
 /*
  * We use the lower bits of the mem_map pointer to store
- * a little bit of information.  There should be at least
- * 3 bits here due to 32-bit alignment.
+ * a little bit of information.  The pointer is calculated
+ * as mem_map - section_nr_to_pfn(pnum).  The result is
+ * aligned to the minimum alignment of the two values:
+ *   1. All mem_map arrays are page-aligned.
+ *   2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
+ *  lowest bits.  PFN_SECTION_SHIFT is arch-specific
+ *  (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
+ *  worst combination is powerpc with 256k pages,
+ *  which results in PFN_SECTION_SHIFT equal 6.
+ * To sum it up, at least 6 bits are available.
  */
 #defineSECTION_MARKED_PRESENT  (1UL<<0)
 #define SECTION_HAS_MEM_MAP(1UL<<1)
-- 
2.13.6

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-15 Thread Petr Tesarik

On Sat, 15 Apr 2017 00:26:05 +0200
Daniel Kiper <daniel.ki...@oracle.com> wrote:

> On Fri, Apr 14, 2017 at 06:53:36PM +0200, Petr Tesarik wrote:
>[...]
> > shifted towards libkdumpfile (https://github.com/ptesarik/libkdumpfile),
> > and this library can open PV guest dump files without any issues.
> 
> Great! AIUI, it reminds my idea to make such think. However, I have not
> have time to make it happen. Is it based on makedumpfile or written from
> scratch? Do you plan support for Linux kernel dumps and/or Xen ones?

Some ideas are borrowed from existing tools (makedumpfile, crash). All
code is written from scratch, however.

The kdumpfile library itself is designed for use with any platform and
operating system. Xen is treated as just another type of operating
system. Based on which OS type is initialized, the library is able to
provide a hypervisor view or a Dom0 view.

There is another project led by Jeff Mahoney, which extends standard
gdb with semantic commands. This project supports only x86_64 Linux
right now. See https://github.com/jeffmahoney/crash-python.

Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-15 Thread Petr Tesarik

On Sat, 15 Apr 2017 00:26:05 +0200
Daniel Kiper  wrote:

> On Fri, Apr 14, 2017 at 06:53:36PM +0200, Petr Tesarik wrote:
>[...]
> > shifted towards libkdumpfile (https://github.com/ptesarik/libkdumpfile),
> > and this library can open PV guest dump files without any issues.
> 
> Great! AIUI, it reminds my idea to make such think. However, I have not
> have time to make it happen. Is it based on makedumpfile or written from
> scratch? Do you plan support for Linux kernel dumps and/or Xen ones?

Some ideas are borrowed from existing tools (makedumpfile, crash). All
code is written from scratch, however.

The kdumpfile library itself is designed for use with any platform and
operating system. Xen is treated as just another type of operating
system. Based on which OS type is initialized, the library is able to
provide a hypervisor view or a Dom0 view.

There is another project led by Jeff Mahoney, which extends standard
gdb with semantic commands. This project supports only x86_64 Linux
right now. See https://github.com/jeffmahoney/crash-python.

Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-14 Thread Petr Tesarik

On Tue, 11 Apr 2017 19:20:08 +0200
Daniel Kiper <daniel.ki...@oracle.com> wrote:

> On Tue, Apr 11, 2017 at 04:59:16PM +0200, Petr Tesarik wrote:
>[...]
> > Tested-by: Petr Tesarik <ptesa...@suse.com>
> >
> > I copied the complete /proc/vmcore to a directory on disk. Exactly
> > as expected, crash works both without the patch and with the patch, as
> > it does not use VMCOREINFO at all (instead, crash obtains the
> > information from kernel debuginfo directly).
> 
> Thanks for doing the tests. I suppose that you have tested HVM guests.

Not really. I crashed Dom0, which is in turn sent to the hypervisor, so
the result is a complete host dump, including Xen hypervisor data and
all domains.

> IIRC, PV guests are not supported by crash right now due to p2m VMA
> mapping. At least it was an issue some time ago. Is it still valid?

Yes, this is correct. I tested this behaviour a few weeks ago.

> Anyway, one guy in Oracle works on fix for that issue and I do review.
> We are going to post it in 2-3 weeks.

All right. FYI I do not plan to put much effort into it, as my focus has
shifted towards libkdumpfile (https://github.com/ptesarik/libkdumpfile),
and this library can open PV guest dump files without any issues.

Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-14 Thread Petr Tesarik

On Tue, 11 Apr 2017 19:20:08 +0200
Daniel Kiper  wrote:

> On Tue, Apr 11, 2017 at 04:59:16PM +0200, Petr Tesarik wrote:
>[...]
> > Tested-by: Petr Tesarik 
> >
> > I copied the complete /proc/vmcore to a directory on disk. Exactly
> > as expected, crash works both without the patch and with the patch, as
> > it does not use VMCOREINFO at all (instead, crash obtains the
> > information from kernel debuginfo directly).
> 
> Thanks for doing the tests. I suppose that you have tested HVM guests.

Not really. I crashed Dom0, which is in turn sent to the hypervisor, so
the result is a complete host dump, including Xen hypervisor data and
all domains.

> IIRC, PV guests are not supported by crash right now due to p2m VMA
> mapping. At least it was an issue some time ago. Is it still valid?

Yes, this is correct. I tested this behaviour a few weeks ago.

> Anyway, one guy in Oracle works on fix for that issue and I do review.
> We are going to post it in 2-3 weeks.

All right. FYI I do not plan to put much effort into it, as my focus has
shifted towards libkdumpfile (https://github.com/ptesarik/libkdumpfile),
and this library can open PV guest dump files without any issues.

Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-11 Thread Petr Tesarik

On Tue, 11 Apr 2017 15:00:58 +0200
Daniel Kiper <daniel.ki...@oracle.com> wrote:

> On Tue, Apr 11, 2017 at 02:45:43PM +0200, Juergen Gross wrote:
> > On 03/04/17 14:42, Daniel Kiper wrote:
> > > On Fri, Mar 31, 2017 at 12:14:38PM +0200, Juergen Gross wrote:
> > >> For kdump to work correctly it needs the physical address of
> > >> vmcoreinfo_note. When running as dom0 this means the virtual address
> > >> has to be translated to the related machine address.
> > >>
> > >> paddr_vmcoreinfo_note() is meant to do the translation via
> > >> __pa_symbol() only, but being attributed "weak" it can be replaced
> > >> easily in Xen case.
> > >>
> > >> Signed-off-by: Juergen Gross <jgr...@suse.com>
> > >
> > > Have you tested this patch with latest crash tool? Do dom0 and Xen
> > > hypervisor analysis work without any issue (at least basic commands
> > > like dmesg, bt, ps, etc.)? If yes for both you can add:
> > >
> > > Reviewed-by: Daniel Kiper <daniel.ki...@oracle.com>
> >
> > So can I add your R-b: now?
> 
> My R-b is still valid. Though, let's wait for Petr's Tested-by. He
> did makedumpfile tests but I asked him to do crash tool tests too.
> I think it is important.

Tested-by: Petr Tesarik <ptesa...@suse.com>

I copied the complete /proc/vmcore to a directory on disk. Exactly
as expected, crash works both without the patch and with the patch, as
it does not use VMCOREINFO at all (instead, crash obtains the
information from kernel debuginfo directly).

Without the patch:

morricone:/abuild/dumps/2017-04-11-16:38/:[0]# crash vmlinux vmcore 

crash 7.1.8
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

crash: vmlinux: no .gnu_debuglink section
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

  KERNEL: vmlinux   
DUMPFILE: vmcore
CPUS: 12 [OFFLINE: 11]
DATE: Tue Apr 11 16:37:34 2017
  UPTIME: 00:11:51
LOAD AVERAGE: 0.04, 0.08, 0.08
   TASKS: 269
NODENAME: morricone
 RELEASE: 4.11.0-rc5-default-pt+
 VERSION: #1 SMP PREEMPT Thu Apr 6 17:40:53 CEST 2017
 MACHINE: x86_64  (2400 Mhz)
  MEMORY: 16 GB
   PANIC: "sysrq: SysRq : Trigger a crash"
 PID: 12009
 COMMAND: "bash"
TASK: 8804065d2080  [THREAD_INFO: 8804065d2080]
 CPU: 7
   STATE: TASK_RUNNING (SYSRQ)

crash> 

With the patch included:

morricone:/abuild/dumps/2017-04-11-12:52/:[0]# crash vmlinux vmcore  

crash 7.1.8
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

crash: vmlinux: no .gnu_debuglink section
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

  KERNEL: vmlinux   
DUMPFILE: vmcore
CPUS: 12 [OFFLINE: 11]

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-11 Thread Petr Tesarik

On Tue, 11 Apr 2017 15:00:58 +0200
Daniel Kiper  wrote:

> On Tue, Apr 11, 2017 at 02:45:43PM +0200, Juergen Gross wrote:
> > On 03/04/17 14:42, Daniel Kiper wrote:
> > > On Fri, Mar 31, 2017 at 12:14:38PM +0200, Juergen Gross wrote:
> > >> For kdump to work correctly it needs the physical address of
> > >> vmcoreinfo_note. When running as dom0 this means the virtual address
> > >> has to be translated to the related machine address.
> > >>
> > >> paddr_vmcoreinfo_note() is meant to do the translation via
> > >> __pa_symbol() only, but being attributed "weak" it can be replaced
> > >> easily in Xen case.
> > >>
> > >> Signed-off-by: Juergen Gross 
> > >
> > > Have you tested this patch with latest crash tool? Do dom0 and Xen
> > > hypervisor analysis work without any issue (at least basic commands
> > > like dmesg, bt, ps, etc.)? If yes for both you can add:
> > >
> > > Reviewed-by: Daniel Kiper 
> >
> > So can I add your R-b: now?
> 
> My R-b is still valid. Though, let's wait for Petr's Tested-by. He
> did makedumpfile tests but I asked him to do crash tool tests too.
> I think it is important.

Tested-by: Petr Tesarik 

I copied the complete /proc/vmcore to a directory on disk. Exactly
as expected, crash works both without the patch and with the patch, as
it does not use VMCOREINFO at all (instead, crash obtains the
information from kernel debuginfo directly).

Without the patch:

morricone:/abuild/dumps/2017-04-11-16:38/:[0]# crash vmlinux vmcore 

crash 7.1.8
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

crash: vmlinux: no .gnu_debuglink section
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

  KERNEL: vmlinux   
DUMPFILE: vmcore
CPUS: 12 [OFFLINE: 11]
DATE: Tue Apr 11 16:37:34 2017
  UPTIME: 00:11:51
LOAD AVERAGE: 0.04, 0.08, 0.08
   TASKS: 269
NODENAME: morricone
 RELEASE: 4.11.0-rc5-default-pt+
 VERSION: #1 SMP PREEMPT Thu Apr 6 17:40:53 CEST 2017
 MACHINE: x86_64  (2400 Mhz)
  MEMORY: 16 GB
   PANIC: "sysrq: SysRq : Trigger a crash"
 PID: 12009
 COMMAND: "bash"
TASK: 8804065d2080  [THREAD_INFO: 8804065d2080]
 CPU: 7
   STATE: TASK_RUNNING (SYSRQ)

crash> 

With the patch included:

morricone:/abuild/dumps/2017-04-11-12:52/:[0]# crash vmlinux vmcore  

crash 7.1.8
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

crash: vmlinux: no .gnu_debuglink section
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

  KERNEL: vmlinux   
DUMPFILE: vmcore
CPUS: 12 [OFFLINE: 11]
DATE: Tue Apr 11 12:51:15 2017
  UPTIME: 4 days, 01:31:42
LOAD AVERAGE: 0.04, 0.01, 0.00
   T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-11 Thread Petr Tesarik

On Mon, 10 Apr 2017 22:49:33 +0200
Daniel Kiper <daniel.ki...@oracle.com> wrote:

> On Fri, Apr 07, 2017 at 11:16:22AM +0200, Petr Tesarik wrote:
> > On Wed, 5 Apr 2017 13:13:00 +0200
> > Petr Tesarik <ptesa...@suse.com> wrote:
> >
> > > On Tue, 4 Apr 2017 12:42:53 -0700 (PDT)
> > > Daniel Kiper <daniel.ki...@oracle.com> wrote:
> > >
> > >[...]
> > > > So, if Petr did relevant tests that is nice. However, then, IMO, this
> > > > patch begs Petr Tested-by.
> > >
> > > Actually, I tested with this patch applied on top of kernel 4.4 (SLES
> > > 12 SP2). It matches what traditional Xen had always done, so I am quite
> > > confident it will work with a later kernel, but to give my Tested-by,
> > > let me first re-run the test on master, hopefully until today EOB.
> >
> > It took me much longer than anticipated (I had some trouble setting up
> > the host again), but I can confirm that the patch works as expected on
> 
> No problem. I know how it works.
> 
> > top of 4.11-rc5.
> 
> Great!
> 
> > Without the patch, makedumpfile in the crash kernel complains:
> >
> > /proc/vmcore doesn't contain vmcoreinfo.
> 
> Though, I would like to ask you to do crash tool tests too.
> Could you do that?

I'm not sure what kind of tests you want me to perform. I'm not able to
save a dump file without the patch.

Well ... I could reserve enough memory to launch a crash session
directly in the kdump environment. I'll get to you later.

> > With the patch applied, dumping still fails later because of an
> > unrelated bug in makedumpfile, but I was able to extract the kernel
> > message buffer with "makedumpfile --dump-dmesg". This already confirms
> > VMCOREINFO presence and usability.
> 
> Is it Xen specific issue or more generic one?

It is Xen-specific, and I even have a patch now, about to be sent to
the kexec ML.

Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-11 Thread Petr Tesarik

On Mon, 10 Apr 2017 22:49:33 +0200
Daniel Kiper  wrote:

> On Fri, Apr 07, 2017 at 11:16:22AM +0200, Petr Tesarik wrote:
> > On Wed, 5 Apr 2017 13:13:00 +0200
> > Petr Tesarik  wrote:
> >
> > > On Tue, 4 Apr 2017 12:42:53 -0700 (PDT)
> > > Daniel Kiper  wrote:
> > >
> > >[...]
> > > > So, if Petr did relevant tests that is nice. However, then, IMO, this
> > > > patch begs Petr Tested-by.
> > >
> > > Actually, I tested with this patch applied on top of kernel 4.4 (SLES
> > > 12 SP2). It matches what traditional Xen had always done, so I am quite
> > > confident it will work with a later kernel, but to give my Tested-by,
> > > let me first re-run the test on master, hopefully until today EOB.
> >
> > It took me much longer than anticipated (I had some trouble setting up
> > the host again), but I can confirm that the patch works as expected on
> 
> No problem. I know how it works.
> 
> > top of 4.11-rc5.
> 
> Great!
> 
> > Without the patch, makedumpfile in the crash kernel complains:
> >
> > /proc/vmcore doesn't contain vmcoreinfo.
> 
> Though, I would like to ask you to do crash tool tests too.
> Could you do that?

I'm not sure what kind of tests you want me to perform. I'm not able to
save a dump file without the patch.

Well ... I could reserve enough memory to launch a crash session
directly in the kdump environment. I'll get to you later.

> > With the patch applied, dumping still fails later because of an
> > unrelated bug in makedumpfile, but I was able to extract the kernel
> > message buffer with "makedumpfile --dump-dmesg". This already confirms
> > VMCOREINFO presence and usability.
> 
> Is it Xen specific issue or more generic one?

It is Xen-specific, and I even have a patch now, about to be sent to
the kexec ML.

Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-07 Thread Petr Tesarik

On Wed, 5 Apr 2017 13:13:00 +0200
Petr Tesarik <ptesa...@suse.com> wrote:

> On Tue, 4 Apr 2017 12:42:53 -0700 (PDT)
> Daniel Kiper <daniel.ki...@oracle.com> wrote:
> 
>[...]
> > So, if Petr did relevant tests that is nice. However, then, IMO, this
> > patch begs Petr Tested-by.
> 
> Actually, I tested with this patch applied on top of kernel 4.4 (SLES
> 12 SP2). It matches what traditional Xen had always done, so I am quite
> confident it will work with a later kernel, but to give my Tested-by,
> let me first re-run the test on master, hopefully until today EOB.

It took me much longer than anticipated (I had some trouble setting up
the host again), but I can confirm that the patch works as expected on
top of 4.11-rc5.

Without the patch, makedumpfile in the crash kernel complains:

/proc/vmcore doesn't contain vmcoreinfo.

With the patch applied, dumping still fails later because of an
unrelated bug in makedumpfile, but I was able to extract the kernel
message buffer with "makedumpfile --dump-dmesg". This already confirms
VMCOREINFO presence and usability.

So finally,

Tested-by: Petr Tesarik <ptesa...@suse.com>

Petr Tesarik

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-07 Thread Petr Tesarik

On Wed, 5 Apr 2017 13:13:00 +0200
Petr Tesarik  wrote:

> On Tue, 4 Apr 2017 12:42:53 -0700 (PDT)
> Daniel Kiper  wrote:
> 
>[...]
> > So, if Petr did relevant tests that is nice. However, then, IMO, this
> > patch begs Petr Tested-by.
> 
> Actually, I tested with this patch applied on top of kernel 4.4 (SLES
> 12 SP2). It matches what traditional Xen had always done, so I am quite
> confident it will work with a later kernel, but to give my Tested-by,
> let me first re-run the test on master, hopefully until today EOB.

It took me much longer than anticipated (I had some trouble setting up
the host again), but I can confirm that the patch works as expected on
top of 4.11-rc5.

Without the patch, makedumpfile in the crash kernel complains:

/proc/vmcore doesn't contain vmcoreinfo.

With the patch applied, dumping still fails later because of an
unrelated bug in makedumpfile, but I was able to extract the kernel
message buffer with "makedumpfile --dump-dmesg". This already confirms
VMCOREINFO presence and usability.

So finally,

Tested-by: Petr Tesarik 

Petr Tesarik

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-05 Thread Petr Tesarik

On Tue, 4 Apr 2017 12:42:53 -0700 (PDT)
Daniel Kiper  wrote:

> > On 03/04/17 14:42, Daniel Kiper wrote:
> > > On Fri, Mar 31, 2017 at 12:14:38PM +0200, Juergen Gross wrote:
> > >> For kdump to work correctly it needs the physical address of
> > >> vmcoreinfo_note. When running as dom0 this means the virtual address
> > >> has to be translated to the related machine address.
> > >>
> > >> paddr_vmcoreinfo_note() is meant to do the translation via
> > >> __pa_symbol() only, but being attributed "weak" it can be replaced
> > >> easily in Xen case.
> > >>
> > >> Signed-off-by: Juergen Gross 
> > >
> > > Have you tested this patch with latest crash tool? Do dom0 and Xen
> > > hypervisor analysis work without any issue (at least basic commands
> > > like dmesg, bt, ps, etc.)? If yes for both you can add:
> > >
> > > Reviewed-by: Daniel Kiper 
> >
> > This patch isn't for dump analysis, but for dump creation. Petr has
> 
> I know that. However, it may have impact on crash analysis. So,
> I would expect that you or anybody else in your behalf will do
> at least minimal crash tool tests.
> 
> > verified that the dump is in the expected format. Please ask Petr
> > for further details, e.g. user side modifications being necessary.
> 
> So, if Petr did relevant tests that is nice. However, then, IMO, this
> patch begs Petr Tested-by.

Actually, I tested with this patch applied on top of kernel 4.4 (SLES
12 SP2). It matches what traditional Xen had always done, so I am quite
confident it will work with a later kernel, but to give my Tested-by,
let me first re-run the test on master, hopefully until today EOB.

HTH,
Petr T

Re: [Xen-devel] [PATCH v2] xen, kdump: handle pv domain in paddr_vmcoreinfo_note()

2017-04-05 Thread Petr Tesarik

On Tue, 4 Apr 2017 12:42:53 -0700 (PDT)
Daniel Kiper  wrote:

> > On 03/04/17 14:42, Daniel Kiper wrote:
> > > On Fri, Mar 31, 2017 at 12:14:38PM +0200, Juergen Gross wrote:
> > >> For kdump to work correctly it needs the physical address of
> > >> vmcoreinfo_note. When running as dom0 this means the virtual address
> > >> has to be translated to the related machine address.
> > >>
> > >> paddr_vmcoreinfo_note() is meant to do the translation via
> > >> __pa_symbol() only, but being attributed "weak" it can be replaced
> > >> easily in Xen case.
> > >>
> > >> Signed-off-by: Juergen Gross 
> > >
> > > Have you tested this patch with latest crash tool? Do dom0 and Xen
> > > hypervisor analysis work without any issue (at least basic commands
> > > like dmesg, bt, ps, etc.)? If yes for both you can add:
> > >
> > > Reviewed-by: Daniel Kiper 
> >
> > This patch isn't for dump analysis, but for dump creation. Petr has
> 
> I know that. However, it may have impact on crash analysis. So,
> I would expect that you or anybody else in your behalf will do
> at least minimal crash tool tests.
> 
> > verified that the dump is in the expected format. Please ask Petr
> > for further details, e.g. user side modifications being necessary.
> 
> So, if Petr did relevant tests that is nice. However, then, IMO, this
> patch begs Petr Tested-by.

Actually, I tested with this patch applied on top of kernel 4.4 (SLES
12 SP2). It matches what traditional Xen had always done, so I am quite
confident it will work with a later kernel, but to give my Tested-by,
let me first re-run the test on master, hopefully until today EOB.

HTH,
Petr T

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-21 Thread Petr Tesarik

On Mon, 20 Mar 2017 13:50:31 +0800
Xunlei Pang  wrote:

> As Eric said,
> "what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
> 
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section."
> 
> This patch allocates extra pages for these vmcoreinfo_XXX variables,
> one advantage is that it enhances some safety of vmcoreinfo, because
> vmcoreinfo now is kept far away from other kernel data structures.

Yes, I like this patch set very much now. Thank you!

Petr T

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-21 Thread Petr Tesarik

On Mon, 20 Mar 2017 13:50:31 +0800
Xunlei Pang  wrote:

> As Eric said,
> "what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
> 
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section."
> 
> This patch allocates extra pages for these vmcoreinfo_XXX variables,
> one advantage is that it enhances some safety of vmcoreinfo, because
> vmcoreinfo now is kept far away from other kernel data structures.

Yes, I like this patch set very much now. Thank you!

Petr T

Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-20 Thread Petr Tesarik

On Mon, 20 Mar 2017 10:17:42 +0800
Xunlei Pang <xp...@redhat.com> wrote:

> On 03/19/2017 at 02:23 AM, Petr Tesarik wrote:
> > On Thu, 16 Mar 2017 21:40:58 +0800
> > Xunlei Pang <xp...@redhat.com> wrote:
> >
> >> On 03/16/2017 at 09:18 PM, Baoquan He wrote:
> >>> On 03/16/17 at 08:36pm, Xunlei Pang wrote:
> >>>> On 03/16/2017 at 08:27 PM, Baoquan He wrote:
> >>>>> Hi Xunlei,
> >>>>>
> >>>>> Did you really see this ever happened? Because the vmcore size estimate
> >>>>> feature, namely --mem-usage option of makedumpfile, depends on the
> >>>>> vmcoreinfo in 1st kernel, your change will break it.
> >>>> Hi Baoquan,
> >>>>
> >>>> I can reproduce it using a kernel module which modifies the vmcoreinfo,
> >>>> so it's a problem can actually happen.
> >>>>
> >>>>> If not, it could be not good to change that.
> >>>> That's a good point, then I guess we can keep the 
> >>>> crash_save_vmcoreinfo_init(),
> >>>> and store again all the vmcoreinfo after crash. What do you think?
> >>> Well, then it will make makedumpfile segfault happen too when execute
> >>> below command in 1st kernel if it existed:
> >>>   makedumpfile --mem-usage /proc/kcore
> >> Yes, if the initial vmcoreinfo data was modified before "makedumpfile 
> >> --mem-usage", it might happen,
> >> after all the system is going something wrong. And that's why we deploy 
> >> kdump service at the very
> >> beginning when the system has a low possibility of going wrong.
> >>
> >> But we have to guarantee kdump vmcore can be generated correctly as 
> >> possible as it can.
> >>
> >>> So we still need to face that problem and need fix it. vmcoreinfo_note
> >>> is in kernel data area, how does module intrude into this area? And can
> >>> we fix the module code?
> >>>
> >> Bugs always exist in products, we can't know what will happen and fix all 
> >> the errors,
> >> that's why we need kdump.
> >>
> >> I think the following update should guarantee the correct vmcoreinfo for 
> >> kdump.
> > I'm still not convinced. I would probably have more trust in a clean
> > kernel (after boot) than a kernel that has already crashed (presumably
> > because of a serious bug). How can be reliability improved by running
> > more code in unsafe environment?
> 
> Correct, I realized that, so used crc32 to protect the original data,
> but since Eric left a more reasonable idea, I will try that later.
> 
> >
> > If some code overwrites reserved areas (such as vmcoreinfo), then it's
> > seriously buggy. And in my opinion, it is more difficult to identify
> > such bugs if they are masked by re-initializing vmcoreinfo after crash.
> > In fact, if makedumpfile in the kexec'ed kernel complains that it
> > didn't find valid VMCOREINFO content, that's already a hint.
> >
> > As a side note, if you're debugging a vmcoreinfo corruption, it's
> > possible to use a standalone VMCOREINFO file with makedumpfile, so you
> > can pre-generate it and save it in the kdump initrd.
> >
> > In short, I don't see a compelling case for this change.
> 
> E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
> system; 3) trigger kdump, then we obviously will fail to recognize the
> crash context correctly due to the corrupted vmcoreinfo.  Everyone
> will get confused if met such unfortunate customer-side issue.
> 
> Although it's corner case, if it's easy to fix, then I think we better do it.
> 
> Now except for vmcoreinfo, all the crash data is well protected (including
> cpu note which is fully updated in the crash path, thus its correctness is
> guaranteed).

Hm, I think we shouldn't combine the two things.

Protecting VMCOREINFO with SHA (just as the other information passed to
the secondary kernel) sounds right to me. Re-creating the info while
the kernel is already crashing does not sound particularly good.

Yes, your patch may help in some scenarios, but in general it also
increases the amount of code that must reliably work in a crashed
environment. I can still recall why the LKCD approach (save the dump
directly from the crashed kernel) was abandoned...

Apart, there's a lot of other information that might be corrupted (e.g.
the purgatory code, elfcorehdr, secondary kernel, or the initrd).

Why is this VMCOREINFO so special?

Regards,
Petr Tesarik

Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-20 Thread Petr Tesarik

On Mon, 20 Mar 2017 10:17:42 +0800
Xunlei Pang  wrote:

> On 03/19/2017 at 02:23 AM, Petr Tesarik wrote:
> > On Thu, 16 Mar 2017 21:40:58 +0800
> > Xunlei Pang  wrote:
> >
> >> On 03/16/2017 at 09:18 PM, Baoquan He wrote:
> >>> On 03/16/17 at 08:36pm, Xunlei Pang wrote:
> >>>> On 03/16/2017 at 08:27 PM, Baoquan He wrote:
> >>>>> Hi Xunlei,
> >>>>>
> >>>>> Did you really see this ever happened? Because the vmcore size estimate
> >>>>> feature, namely --mem-usage option of makedumpfile, depends on the
> >>>>> vmcoreinfo in 1st kernel, your change will break it.
> >>>> Hi Baoquan,
> >>>>
> >>>> I can reproduce it using a kernel module which modifies the vmcoreinfo,
> >>>> so it's a problem can actually happen.
> >>>>
> >>>>> If not, it could be not good to change that.
> >>>> That's a good point, then I guess we can keep the 
> >>>> crash_save_vmcoreinfo_init(),
> >>>> and store again all the vmcoreinfo after crash. What do you think?
> >>> Well, then it will make makedumpfile segfault happen too when execute
> >>> below command in 1st kernel if it existed:
> >>>   makedumpfile --mem-usage /proc/kcore
> >> Yes, if the initial vmcoreinfo data was modified before "makedumpfile 
> >> --mem-usage", it might happen,
> >> after all the system is going something wrong. And that's why we deploy 
> >> kdump service at the very
> >> beginning when the system has a low possibility of going wrong.
> >>
> >> But we have to guarantee kdump vmcore can be generated correctly as 
> >> possible as it can.
> >>
> >>> So we still need to face that problem and need fix it. vmcoreinfo_note
> >>> is in kernel data area, how does module intrude into this area? And can
> >>> we fix the module code?
> >>>
> >> Bugs always exist in products, we can't know what will happen and fix all 
> >> the errors,
> >> that's why we need kdump.
> >>
> >> I think the following update should guarantee the correct vmcoreinfo for 
> >> kdump.
> > I'm still not convinced. I would probably have more trust in a clean
> > kernel (after boot) than a kernel that has already crashed (presumably
> > because of a serious bug). How can be reliability improved by running
> > more code in unsafe environment?
> 
> Correct, I realized that, so used crc32 to protect the original data,
> but since Eric left a more reasonable idea, I will try that later.
> 
> >
> > If some code overwrites reserved areas (such as vmcoreinfo), then it's
> > seriously buggy. And in my opinion, it is more difficult to identify
> > such bugs if they are masked by re-initializing vmcoreinfo after crash.
> > In fact, if makedumpfile in the kexec'ed kernel complains that it
> > didn't find valid VMCOREINFO content, that's already a hint.
> >
> > As a side note, if you're debugging a vmcoreinfo corruption, it's
> > possible to use a standalone VMCOREINFO file with makedumpfile, so you
> > can pre-generate it and save it in the kdump initrd.
> >
> > In short, I don't see a compelling case for this change.
> 
> E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
> system; 3) trigger kdump, then we obviously will fail to recognize the
> crash context correctly due to the corrupted vmcoreinfo.  Everyone
> will get confused if met such unfortunate customer-side issue.
> 
> Although it's corner case, if it's easy to fix, then I think we better do it.
> 
> Now except for vmcoreinfo, all the crash data is well protected (including
> cpu note which is fully updated in the crash path, thus its correctness is
> guaranteed).

Hm, I think we shouldn't combine the two things.

Protecting VMCOREINFO with SHA (just as the other information passed to
the secondary kernel) sounds right to me. Re-creating the info while
the kernel is already crashing does not sound particularly good.

Yes, your patch may help in some scenarios, but in general it also
increases the amount of code that must reliably work in a crashed
environment. I can still recall why the LKCD approach (save the dump
directly from the crashed kernel) was abandoned...

Apart, there's a lot of other information that might be corrupted (e.g.
the purgatory code, elfcorehdr, secondary kernel, or the initrd).

Why is this VMCOREINFO so special?

Regards,
Petr Tesarik

Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-18 Thread Petr Tesarik

On Thu, 16 Mar 2017 21:40:58 +0800
Xunlei Pang  wrote:

> On 03/16/2017 at 09:18 PM, Baoquan He wrote:
> > On 03/16/17 at 08:36pm, Xunlei Pang wrote:
> >> On 03/16/2017 at 08:27 PM, Baoquan He wrote:
> >>> Hi Xunlei,
> >>>
> >>> Did you really see this ever happened? Because the vmcore size estimate
> >>> feature, namely --mem-usage option of makedumpfile, depends on the
> >>> vmcoreinfo in 1st kernel, your change will break it.
> >> Hi Baoquan,
> >>
> >> I can reproduce it using a kernel module which modifies the vmcoreinfo,
> >> so it's a problem can actually happen.
> >>
> >>> If not, it could be not good to change that.
> >> That's a good point, then I guess we can keep the 
> >> crash_save_vmcoreinfo_init(),
> >> and store again all the vmcoreinfo after crash. What do you think?
> > Well, then it will make makedumpfile segfault happen too when execute
> > below command in 1st kernel if it existed:
> > makedumpfile --mem-usage /proc/kcore
> 
> Yes, if the initial vmcoreinfo data was modified before "makedumpfile 
> --mem-usage", it might happen,
> after all the system is going something wrong. And that's why we deploy kdump 
> service at the very
> beginning when the system has a low possibility of going wrong.
> 
> But we have to guarantee kdump vmcore can be generated correctly as possible 
> as it can.
> 
> >
> > So we still need to face that problem and need fix it. vmcoreinfo_note
> > is in kernel data area, how does module intrude into this area? And can
> > we fix the module code?
> >
> 
> Bugs always exist in products, we can't know what will happen and fix all the 
> errors,
> that's why we need kdump.
> 
> I think the following update should guarantee the correct vmcoreinfo for 
> kdump.

I'm still not convinced. I would probably have more trust in a clean
kernel (after boot) than a kernel that has already crashed (presumably
because of a serious bug). How can be reliability improved by running
more code in unsafe environment?

If some code overwrites reserved areas (such as vmcoreinfo), then it's
seriously buggy. And in my opinion, it is more difficult to identify
such bugs if they are masked by re-initializing vmcoreinfo after crash.
In fact, if makedumpfile in the kexec'ed kernel complains that it
didn't find valid VMCOREINFO content, that's already a hint.

As a side note, if you're debugging a vmcoreinfo corruption, it's
possible to use a standalone VMCOREINFO file with makedumpfile, so you
can pre-generate it and save it in the kdump initrd.

In short, I don't see a compelling case for this change.

Just my two cents,
Petr T

Re: [PATCH] kexec: Update vmcoreinfo after crash happened

2017-03-18 Thread Petr Tesarik

On Thu, 16 Mar 2017 21:40:58 +0800
Xunlei Pang  wrote:

> On 03/16/2017 at 09:18 PM, Baoquan He wrote:
> > On 03/16/17 at 08:36pm, Xunlei Pang wrote:
> >> On 03/16/2017 at 08:27 PM, Baoquan He wrote:
> >>> Hi Xunlei,
> >>>
> >>> Did you really see this ever happened? Because the vmcore size estimate
> >>> feature, namely --mem-usage option of makedumpfile, depends on the
> >>> vmcoreinfo in 1st kernel, your change will break it.
> >> Hi Baoquan,
> >>
> >> I can reproduce it using a kernel module which modifies the vmcoreinfo,
> >> so it's a problem can actually happen.
> >>
> >>> If not, it could be not good to change that.
> >> That's a good point, then I guess we can keep the 
> >> crash_save_vmcoreinfo_init(),
> >> and store again all the vmcoreinfo after crash. What do you think?
> > Well, then it will make makedumpfile segfault happen too when execute
> > below command in 1st kernel if it existed:
> > makedumpfile --mem-usage /proc/kcore
> 
> Yes, if the initial vmcoreinfo data was modified before "makedumpfile 
> --mem-usage", it might happen,
> after all the system is going something wrong. And that's why we deploy kdump 
> service at the very
> beginning when the system has a low possibility of going wrong.
> 
> But we have to guarantee kdump vmcore can be generated correctly as possible 
> as it can.
> 
> >
> > So we still need to face that problem and need fix it. vmcoreinfo_note
> > is in kernel data area, how does module intrude into this area? And can
> > we fix the module code?
> >
> 
> Bugs always exist in products, we can't know what will happen and fix all the 
> errors,
> that's why we need kdump.
> 
> I think the following update should guarantee the correct vmcoreinfo for 
> kdump.

I'm still not convinced. I would probably have more trust in a clean
kernel (after boot) than a kernel that has already crashed (presumably
because of a serious bug). How can be reliability improved by running
more code in unsafe environment?

If some code overwrites reserved areas (such as vmcoreinfo), then it's
seriously buggy. And in my opinion, it is more difficult to identify
such bugs if they are masked by re-initializing vmcoreinfo after crash.
In fact, if makedumpfile in the kexec'ed kernel complains that it
didn't find valid VMCOREINFO content, that's already a hint.

As a side note, if you're debugging a vmcoreinfo corruption, it's
possible to use a standalone VMCOREINFO file with makedumpfile, so you
can pre-generate it and save it in the kdump initrd.

In short, I don't see a compelling case for this change.

Just my two cents,
Petr T

Re: [PATCH 1/2] Add a kexec_crash_loaded() function

2016-07-13 Thread Petr Tesarik

On Wed, 13 Jul 2016 05:52:33 -0700
Josh Triplett <j...@joshtriplett.org> wrote:

> On Wed, Jul 13, 2016 at 02:19:55PM +0200, Petr Tesarik wrote:
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -95,6 +95,12 @@ int kexec_should_crash(struct task_struct *p)
> > return 0;
> >  }
> >  
> > +int kexec_crash_loaded(void)
> > +{
> > +   return !!kexec_crash_image;
> > +}
> 
> Nit: this function should return bool rather than int, since it
> effectively returns true/false.

I thought about this for a moment. Since the return value should also
be used for the corresponding sysctl, which is an integer, I wasn't
sure if bool is the correct type, especially since other boolean
functions in kexec.h also return int... But that could be legacy.

I've got no problem changing the type to 'bool' if that's what it takes.

Petr T

Re: [PATCH 1/2] Add a kexec_crash_loaded() function

2016-07-13 Thread Petr Tesarik

On Wed, 13 Jul 2016 05:52:33 -0700
Josh Triplett  wrote:

> On Wed, Jul 13, 2016 at 02:19:55PM +0200, Petr Tesarik wrote:
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -95,6 +95,12 @@ int kexec_should_crash(struct task_struct *p)
> > return 0;
> >  }
> >  
> > +int kexec_crash_loaded(void)
> > +{
> > +   return !!kexec_crash_image;
> > +}
> 
> Nit: this function should return bool rather than int, since it
> effectively returns true/false.

I thought about this for a moment. Since the return value should also
be used for the corresponding sysctl, which is an integer, I wasn't
sure if bool is the correct type, especially since other boolean
functions in kexec.h also return int... But that could be legacy.

I've got no problem changing the type to 'bool' if that's what it takes.

Petr T

Re: [PATCH 0/2] Allow crash dumps with crash_kexec_post_notifiers

2016-07-13 Thread Petr Tesarik

On Wed, 13 Jul 2016 14:19:50 +0200
Petr Tesarik <ptesa...@suse.com> wrote:

> Hello all,
> 
> this patch series makes it possible to save a kernel crash dump when the
> kernel command line includes "crash_kexec_post_notifiers".

Oh ... I forgot to add: This only applies to running Linux under Xen.
If you run on bare metal, you can always save the dump already, as you
certainly know.

Petr T

> There might
> be other approaches, but mine has the advantage that no new sysctl is
> required, and the behaviour is the same whether panic notifiers are run
> or not: If you load a crash kernel with kexec, it will be used, otherwise
> the hypervisor facility is used (using a hypercall).
> 
> Feedback welcome!
> 
> Petr T
> 
> ---
> 
> Petr Tesarik (2):
>   Add a kexec_crash_loaded() function
>   Allow kdump with crash_kexec_post_notifiers
> 
> 
>  arch/x86/xen/enlighten.c |3 ++-
>  include/linux/kexec.h|2 ++
>  kernel/kexec_core.c  |5 +
>  kernel/ksysfs.c  |2 +-
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> --
> Signature
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] Allow crash dumps with crash_kexec_post_notifiers

2016-07-13 Thread Petr Tesarik

On Wed, 13 Jul 2016 14:19:50 +0200
Petr Tesarik  wrote:

> Hello all,
> 
> this patch series makes it possible to save a kernel crash dump when the
> kernel command line includes "crash_kexec_post_notifiers".

Oh ... I forgot to add: This only applies to running Linux under Xen.
If you run on bare metal, you can always save the dump already, as you
certainly know.

Petr T

> There might
> be other approaches, but mine has the advantage that no new sysctl is
> required, and the behaviour is the same whether panic notifiers are run
> or not: If you load a crash kernel with kexec, it will be used, otherwise
> the hypervisor facility is used (using a hypercall).
> 
> Feedback welcome!
> 
> Petr T
> 
> ---
> 
> Petr Tesarik (2):
>   Add a kexec_crash_loaded() function
>   Allow kdump with crash_kexec_post_notifiers
> 
> 
>  arch/x86/xen/enlighten.c |3 ++-
>  include/linux/kexec.h|2 ++
>  kernel/kexec_core.c  |5 +
>  kernel/ksysfs.c  |2 +-
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> --
> Signature
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 0/2] Allow crash dumps with crash_kexec_post_notifiers

2016-07-13 Thread Petr Tesarik

Hello all,

this patch series makes it possible to save a kernel crash dump when the
kernel command line includes "crash_kexec_post_notifiers". There might
be other approaches, but mine has the advantage that no new sysctl is
required, and the behaviour is the same whether panic notifiers are run
or not: If you load a crash kernel with kexec, it will be used, otherwise
the hypervisor facility is used (using a hypercall).

Feedback welcome!

Petr T

---

Petr Tesarik (2):
  Add a kexec_crash_loaded() function
  Allow kdump with crash_kexec_post_notifiers


 arch/x86/xen/enlighten.c |3 ++-
 include/linux/kexec.h|2 ++
 kernel/kexec_core.c  |5 +
 kernel/ksysfs.c  |2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)

--
Signature

[PATCH 1/2] Add a kexec_crash_loaded() function

2016-07-13 Thread Petr Tesarik

Provide a wrapper function to be used by kernel code to check whether
a crash kernel is loaded. It returns the same value that can be seen
in /sys/kernel/kexec_crash_loaded by userspace programs.

I'm exporting the function, because it will be used by Xen, and it is
possible to compile Xen modules separately to enable the use of PV
drivers with unmodified bare-metal kernels.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
---
 include/linux/kexec.h |2 ++
 kernel/kexec_core.c   |6 ++
 kernel/ksysfs.c   |2 +-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index e8acb2b..d461d02 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -228,6 +228,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage 
*image,
 extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
+int kexec_crash_loaded(void);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
 void crash_save_vmcoreinfo(void);
 void arch_crash_save_vmcoreinfo(void);
@@ -324,6 +325,7 @@ struct task_struct;
 static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
+static inline int kexec_crash_loaded(void) { return 0; }
 #define kexec_in_progress false
 #endif /* CONFIG_KEXEC_CORE */
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 56b3ed0..a303dce 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -95,6 +95,12 @@ int kexec_should_crash(struct task_struct *p)
return 0;
 }
 
+int kexec_crash_loaded(void)
+{
+   return !!kexec_crash_image;
+}
+EXPORT_SYMBOL_GPL(kexec_crash_loaded);
+
 /*
  * When kexec transitions to the new kernel there is a one-to-one
  * mapping between physical and virtual addresses.  On processors
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 152da4a..bf9fc9d 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -101,7 +101,7 @@ KERNEL_ATTR_RO(kexec_loaded);
 static ssize_t kexec_crash_loaded_show(struct kobject *kobj,
   struct kobj_attribute *attr, char *buf)
 {
-   return sprintf(buf, "%d\n", !!kexec_crash_image);
+   return sprintf(buf, "%d\n", kexec_crash_loaded());
 }
 KERNEL_ATTR_RO(kexec_crash_loaded);

[PATCH 0/2] Allow crash dumps with crash_kexec_post_notifiers

2016-07-13 Thread Petr Tesarik

Hello all,

this patch series makes it possible to save a kernel crash dump when the
kernel command line includes "crash_kexec_post_notifiers". There might
be other approaches, but mine has the advantage that no new sysctl is
required, and the behaviour is the same whether panic notifiers are run
or not: If you load a crash kernel with kexec, it will be used, otherwise
the hypervisor facility is used (using a hypercall).

Feedback welcome!

Petr T

---

Petr Tesarik (2):
  Add a kexec_crash_loaded() function
  Allow kdump with crash_kexec_post_notifiers


 arch/x86/xen/enlighten.c |3 ++-
 include/linux/kexec.h|2 ++
 kernel/kexec_core.c  |5 +
 kernel/ksysfs.c  |2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)

--
Signature

[PATCH 1/2] Add a kexec_crash_loaded() function

2016-07-13 Thread Petr Tesarik

Provide a wrapper function to be used by kernel code to check whether
a crash kernel is loaded. It returns the same value that can be seen
in /sys/kernel/kexec_crash_loaded by userspace programs.

I'm exporting the function, because it will be used by Xen, and it is
possible to compile Xen modules separately to enable the use of PV
drivers with unmodified bare-metal kernels.

Signed-off-by: Petr Tesarik 
---
 include/linux/kexec.h |2 ++
 kernel/kexec_core.c   |6 ++
 kernel/ksysfs.c   |2 +-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index e8acb2b..d461d02 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -228,6 +228,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage 
*image,
 extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
+int kexec_crash_loaded(void);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
 void crash_save_vmcoreinfo(void);
 void arch_crash_save_vmcoreinfo(void);
@@ -324,6 +325,7 @@ struct task_struct;
 static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
+static inline int kexec_crash_loaded(void) { return 0; }
 #define kexec_in_progress false
 #endif /* CONFIG_KEXEC_CORE */
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 56b3ed0..a303dce 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -95,6 +95,12 @@ int kexec_should_crash(struct task_struct *p)
return 0;
 }
 
+int kexec_crash_loaded(void)
+{
+   return !!kexec_crash_image;
+}
+EXPORT_SYMBOL_GPL(kexec_crash_loaded);
+
 /*
  * When kexec transitions to the new kernel there is a one-to-one
  * mapping between physical and virtual addresses.  On processors
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 152da4a..bf9fc9d 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -101,7 +101,7 @@ KERNEL_ATTR_RO(kexec_loaded);
 static ssize_t kexec_crash_loaded_show(struct kobject *kobj,
   struct kobj_attribute *attr, char *buf)
 {
-   return sprintf(buf, "%d\n", !!kexec_crash_image);
+   return sprintf(buf, "%d\n", kexec_crash_loaded());
 }
 KERNEL_ATTR_RO(kexec_crash_loaded);

[PATCH 2/2] Allow kdump with crash_kexec_post_notifiers

2016-07-13 Thread Petr Tesarik

If a crash kernel is loaded, do not crash the running domain. This is
needed if the kernel is loaded with crash_kexec_post_notifiers, because
panic notifiers are run before __crash_kexec() in that case, and this
Xen hook prevents its being called later.

Signed-off-by: Petr Tesarik <ptesa...@suse.com>
---
 arch/x86/xen/enlighten.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 760789a..6e3e7c6 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1325,7 +1325,8 @@ static void xen_crash_shutdown(struct pt_regs *regs)
 static int
 xen_panic_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
-   xen_reboot(SHUTDOWN_crash);
+   if (!kexec_crash_loaded())
+   xen_reboot(SHUTDOWN_crash);
return NOTIFY_DONE;
 }

[PATCH 2/2] Allow kdump with crash_kexec_post_notifiers

2016-07-13 Thread Petr Tesarik

If a crash kernel is loaded, do not crash the running domain. This is
needed if the kernel is loaded with crash_kexec_post_notifiers, because
panic notifiers are run before __crash_kexec() in that case, and this
Xen hook prevents its being called later.

Signed-off-by: Petr Tesarik 
---
 arch/x86/xen/enlighten.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 760789a..6e3e7c6 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1325,7 +1325,8 @@ static void xen_crash_shutdown(struct pt_regs *regs)
 static int
 xen_panic_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
-   xen_reboot(SHUTDOWN_crash);
+   if (!kexec_crash_loaded())
+   xen_reboot(SHUTDOWN_crash);
return NOTIFY_DONE;
 }

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-13 Thread Petr Tesarik

On Wed, 13 Jul 2016 09:26:39 +0100
Russell King - ARM Linux <li...@armlinux.org.uk> wrote:

> On Wed, Jul 13, 2016 at 05:55:33PM +1000, Stewart Smith wrote:
> > Russell King - ARM Linux <li...@armlinux.org.uk> writes:
> > > On Wed, Jul 13, 2016 at 02:59:51PM +1000, Stewart Smith wrote:
> > >> Russell King - ARM Linux <li...@armlinux.org.uk> writes:
> > >> > On Tue, Jul 12, 2016 at 10:58:05PM +0200, Petr Tesarik wrote:
> > >> >> I'm not an expert on DTB, so I can't provide an example of code
> > >> >> execution, but you have already mentioned the 
> > >> >> /chosen/linux,stdout-path
> > >> >> property. If an attacker redirects the bootloader to an insecure
> > >> >> console, they may get access to the system that would otherwise be
> > >> >> impossible.
> > >> >
> > >> > I fail to see how kexec connects with the boot loader - the DTB image
> > >> > that's being talked about is one which is passed from the currently
> > >> > running kernel to the to-be-kexec'd kernel.  For ARM (and I suspect
> > >> > also ARM64) that's a direct call chain which doesn't involve any
> > >> > boot loader or firmware, and certainly none that would involve the
> > >> > passed DTB image.
> > >> 
> > >> For OpenPOWER machines, kexec is the bootloader. Our bootloader is a
> > >> linux kernel and initramfs with a UI (petitboot) - this means we never
> > >> have to write a device driver twice: write a kernel one and you're done
> > >> (for booting from the device and using it in your OS).
> > >
> > > I think you misunderstood my point.
> > >
> > > On ARM, we do not go:
> > >
> > >   kernel (kexec'd from) -> boot loader -> kernel (kexec'd to)
> > >
> > > but we go:
> > >
> > >   kernel (kexec'd from) -> kernel (kexec'd to)
> > >
> > > There's no intermediate step involving any bootloader.
> > >
> > > Hence, my point is that the dtb loaded by kexec is _only_ used by the
> > > kernel which is being kexec'd to, not by the bootloader, nor indeed
> > > the kernel which it is loaded into.
> > >
> > > Moreover, if you read the bit that I quoted (which is what I was
> > > replying to), you'll notice that it is talking about the DTB loaded
> > > by kexec somehow causing the _bootloader_ to be redirected to an
> > > alternative console.  This point is wholely false on ARM.
> > 
> > Ahh.. I missed the bootloader bit there.
> > 
> > In which case, we're the same on OpenPOWER, there is no intermediate
> > bootloader - in our case we have linux (with kexec) taking on what uboot
> > or grub is typically used for on other platforms.
> 
> Indeed - maybe Eric knows better, but I can't see any situation where
> the dtb we load via kexec should ever affect "the bootloader", unless
> the "kernel" that's being loaded into kexec is "the bootloader".
> 
> Now, going back to the more fundamental issue raised in my first reply,
> about the kernel command line.
> 
> On x86, I can see that it _is_ possible for userspace to specify a
> command line, and the kernel loading the image provides the command
> line to the to-be-kexeced kernel with very little checking.  So, if
> your kernel is signed, what stops the "insecure userspace" loading
> a signed kernel but giving it an insecure rootfs and/or console?

This is a valid point. If there are kernel options that can be misused
to defeat the purpose of UEFI SecureBoot, then we're in trouble.
Generally, the Linux kernel should treat the command line as untrusted
source.

My point is that modifying the DTB opens a completely new attack
vector. And the goal is not extending the attack surface (because there
are holes in it already), but reducing the attack surface (e.g. by
limiting available kernel command line options).

Petr T

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-13 Thread Petr Tesarik

On Wed, 13 Jul 2016 09:26:39 +0100
Russell King - ARM Linux  wrote:

> On Wed, Jul 13, 2016 at 05:55:33PM +1000, Stewart Smith wrote:
> > Russell King - ARM Linux  writes:
> > > On Wed, Jul 13, 2016 at 02:59:51PM +1000, Stewart Smith wrote:
> > >> Russell King - ARM Linux  writes:
> > >> > On Tue, Jul 12, 2016 at 10:58:05PM +0200, Petr Tesarik wrote:
> > >> >> I'm not an expert on DTB, so I can't provide an example of code
> > >> >> execution, but you have already mentioned the 
> > >> >> /chosen/linux,stdout-path
> > >> >> property. If an attacker redirects the bootloader to an insecure
> > >> >> console, they may get access to the system that would otherwise be
> > >> >> impossible.
> > >> >
> > >> > I fail to see how kexec connects with the boot loader - the DTB image
> > >> > that's being talked about is one which is passed from the currently
> > >> > running kernel to the to-be-kexec'd kernel.  For ARM (and I suspect
> > >> > also ARM64) that's a direct call chain which doesn't involve any
> > >> > boot loader or firmware, and certainly none that would involve the
> > >> > passed DTB image.
> > >> 
> > >> For OpenPOWER machines, kexec is the bootloader. Our bootloader is a
> > >> linux kernel and initramfs with a UI (petitboot) - this means we never
> > >> have to write a device driver twice: write a kernel one and you're done
> > >> (for booting from the device and using it in your OS).
> > >
> > > I think you misunderstood my point.
> > >
> > > On ARM, we do not go:
> > >
> > >   kernel (kexec'd from) -> boot loader -> kernel (kexec'd to)
> > >
> > > but we go:
> > >
> > >   kernel (kexec'd from) -> kernel (kexec'd to)
> > >
> > > There's no intermediate step involving any bootloader.
> > >
> > > Hence, my point is that the dtb loaded by kexec is _only_ used by the
> > > kernel which is being kexec'd to, not by the bootloader, nor indeed
> > > the kernel which it is loaded into.
> > >
> > > Moreover, if you read the bit that I quoted (which is what I was
> > > replying to), you'll notice that it is talking about the DTB loaded
> > > by kexec somehow causing the _bootloader_ to be redirected to an
> > > alternative console.  This point is wholely false on ARM.
> > 
> > Ahh.. I missed the bootloader bit there.
> > 
> > In which case, we're the same on OpenPOWER, there is no intermediate
> > bootloader - in our case we have linux (with kexec) taking on what uboot
> > or grub is typically used for on other platforms.
> 
> Indeed - maybe Eric knows better, but I can't see any situation where
> the dtb we load via kexec should ever affect "the bootloader", unless
> the "kernel" that's being loaded into kexec is "the bootloader".
> 
> Now, going back to the more fundamental issue raised in my first reply,
> about the kernel command line.
> 
> On x86, I can see that it _is_ possible for userspace to specify a
> command line, and the kernel loading the image provides the command
> line to the to-be-kexeced kernel with very little checking.  So, if
> your kernel is signed, what stops the "insecure userspace" loading
> a signed kernel but giving it an insecure rootfs and/or console?

This is a valid point. If there are kernel options that can be misused
to defeat the purpose of UEFI SecureBoot, then we're in trouble.
Generally, the Linux kernel should treat the command line as untrusted
source.

My point is that modifying the DTB opens a completely new attack
vector. And the goal is not extending the attack surface (because there
are holes in it already), but reducing the attack surface (e.g. by
limiting available kernel command line options).

Petr T

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Petr Tesarik

On Tue, 12 Jul 2016 16:22:07 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

> Petr Tesarik <ptesa...@suse.cz> writes:
> 
> > On Tue, 12 Jul 2016 13:25:11 -0300
> > Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> wrote:
>[...]
> >> I also don't understand what you mean by code execution. How does passing 
> >> a 
> >> device tree blob via kexec enables code execution? How can the signature 
> >> scheme be defeated?
> >
> > I'm not an expert on DTB, so I can't provide an example of code
> > execution, but you have already mentioned the /chosen/linux,stdout-path
> > property. If an attacker redirects the bootloader to an insecure
> > console, they may get access to the system that would otherwise be
> > impossible.
> >
> > In general, tampering with the hardware inventory of a machine opens up
> > a security hole, and one must be very cautious which modifications are
> > allowed. You're giving this power to an (unsigned, hence untrusted)
> > userspace application; Eric argues that only the kernel should have
> > this power.
> 
> At the very least it should be signed.  And of course the more signed
> images we have in different combinations the more easily someone can
> find a combination that does things the people performing the signing
> didn't realizing they were allowing.

Exactly. Reminds me of nasty setuid application exploits when one or
more of stdin, stdout and stderr are closed before exec(), so the first
file to be opened gets one of those special file descriptors. Imagine
what happens if the application opens a secret file for reading (now
file descriptor 0), then expects user input on stdin, detects a syntax
error and complains on stderr, including the full input for reference
("%s is not a valid command")...

No one has designed bootloaders to cope with similar unexpected
situations.

> So if we can not add an extra variable into the mix it would be good.

Indeed. Writing boot loaders is difficult enough already. Adding the
same kind of precautions that are necessary to write secure setuid
applications is over the top IMO.

Petr T

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Petr Tesarik

On Tue, 12 Jul 2016 16:22:07 -0500
ebied...@xmission.com (Eric W. Biederman) wrote:

> Petr Tesarik  writes:
> 
> > On Tue, 12 Jul 2016 13:25:11 -0300
> > Thiago Jung Bauermann  wrote:
>[...]
> >> I also don't understand what you mean by code execution. How does passing 
> >> a 
> >> device tree blob via kexec enables code execution? How can the signature 
> >> scheme be defeated?
> >
> > I'm not an expert on DTB, so I can't provide an example of code
> > execution, but you have already mentioned the /chosen/linux,stdout-path
> > property. If an attacker redirects the bootloader to an insecure
> > console, they may get access to the system that would otherwise be
> > impossible.
> >
> > In general, tampering with the hardware inventory of a machine opens up
> > a security hole, and one must be very cautious which modifications are
> > allowed. You're giving this power to an (unsigned, hence untrusted)
> > userspace application; Eric argues that only the kernel should have
> > this power.
> 
> At the very least it should be signed.  And of course the more signed
> images we have in different combinations the more easily someone can
> find a combination that does things the people performing the signing
> didn't realizing they were allowing.

Exactly. Reminds me of nasty setuid application exploits when one or
more of stdin, stdout and stderr are closed before exec(), so the first
file to be opened gets one of those special file descriptors. Imagine
what happens if the application opens a secret file for reading (now
file descriptor 0), then expects user input on stdin, detects a syntax
error and complains on stderr, including the full input for reference
("%s is not a valid command")...

No one has designed bootloaders to cope with similar unexpected
situations.

> So if we can not add an extra variable into the mix it would be good.

Indeed. Writing boot loaders is difficult enough already. Adding the
same kind of precautions that are necessary to write secure setuid
applications is over the top IMO.

Petr T

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Petr Tesarik

On Tue, 12 Jul 2016 13:25:11 -0300
Thiago Jung Bauermann  wrote:

> Hi Eric,
> 
> I'm trying to understand your concerns leading to your nack. I hope you 
> don't mind expanding your thoughts on them a bit.
> 
> Am Dienstag, 12 Juli 2016, 08:25:48 schrieb Eric W. Biederman:
> > AKASHI Takahiro  writes:
> > > Device tree blob must be passed to a second kernel on DTB-capable
> > > archs, like powerpc and arm64, but the current kernel interface
> > > lacks this support.
> > > 
> > > This patch extends kexec_file_load system call by adding an extra
> > > argument to this syscall so that an arbitrary number of file descriptors
> > > can be handed out from user space to the kernel.
> > > 
> > > See the background [1].
> > > 
> > > Please note that the new interface looks quite similar to the current
> > > system call, but that it won't always mean that it provides the "binary
> > > compatibility."
> > > 
> > > [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html
> > 
> > So this design is wrong.  The kernel already has the device tree blob,
> > you should not be extracting it from the kernel munging it, and then
> > reinserting it in the kernel if you want signatures and everything to
> > pass.
> 
> I don't understand how the kernel signature will be invalidated. 
> 
> There are some types of boot images that can embed a device tree blob in 
> them, but the kernel can also be handed a separate device tree blob from 
> firmware, the boot loader, or kexec. This latter case is what we are 
> discussing, so we are not talking about modifying an embedded blob in the 
> kernel image.
> 
> > What x86 does is pass it's equivalent of the device tree blob from one
> > kernel to another directly and behind the scenes.  It does not go
> > through userspace for this.
> > 
> > Until a persuasive case can be made for going around the kernel and
> > probably adding a feature (like code execution) that can be used to
> > defeat the signature scheme I am going to nack this.
> 
> I also don't understand what you mean by code execution. How does passing a 
> device tree blob via kexec enables code execution? How can the signature 
> scheme be defeated?

I'm not an expert on DTB, so I can't provide an example of code
execution, but you have already mentioned the /chosen/linux,stdout-path
property. If an attacker redirects the bootloader to an insecure
console, they may get access to the system that would otherwise be
impossible.

In general, tampering with the hardware inventory of a machine opens up
a security hole, and one must be very cautious which modifications are
allowed. You're giving this power to an (unsigned, hence untrusted)
userspace application; Eric argues that only the kernel should have
this power.

Just my two cents,
Petr T

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Petr Tesarik

On Tue, 12 Jul 2016 13:25:11 -0300
Thiago Jung Bauermann  wrote:

> Hi Eric,
> 
> I'm trying to understand your concerns leading to your nack. I hope you 
> don't mind expanding your thoughts on them a bit.
> 
> Am Dienstag, 12 Juli 2016, 08:25:48 schrieb Eric W. Biederman:
> > AKASHI Takahiro  writes:
> > > Device tree blob must be passed to a second kernel on DTB-capable
> > > archs, like powerpc and arm64, but the current kernel interface
> > > lacks this support.
> > > 
> > > This patch extends kexec_file_load system call by adding an extra
> > > argument to this syscall so that an arbitrary number of file descriptors
> > > can be handed out from user space to the kernel.
> > > 
> > > See the background [1].
> > > 
> > > Please note that the new interface looks quite similar to the current
> > > system call, but that it won't always mean that it provides the "binary
> > > compatibility."
> > > 
> > > [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html
> > 
> > So this design is wrong.  The kernel already has the device tree blob,
> > you should not be extracting it from the kernel munging it, and then
> > reinserting it in the kernel if you want signatures and everything to
> > pass.
> 
> I don't understand how the kernel signature will be invalidated. 
> 
> There are some types of boot images that can embed a device tree blob in 
> them, but the kernel can also be handed a separate device tree blob from 
> firmware, the boot loader, or kexec. This latter case is what we are 
> discussing, so we are not talking about modifying an embedded blob in the 
> kernel image.
> 
> > What x86 does is pass it's equivalent of the device tree blob from one
> > kernel to another directly and behind the scenes.  It does not go
> > through userspace for this.
> > 
> > Until a persuasive case can be made for going around the kernel and
> > probably adding a feature (like code execution) that can be used to
> > defeat the signature scheme I am going to nack this.
> 
> I also don't understand what you mean by code execution. How does passing a 
> device tree blob via kexec enables code execution? How can the signature 
> scheme be defeated?

I'm not an expert on DTB, so I can't provide an example of code
execution, but you have already mentioned the /chosen/linux,stdout-path
property. If an attacker redirects the bootloader to an insecure
console, they may get access to the system that would otherwise be
impossible.

In general, tampering with the hardware inventory of a machine opens up
a security hole, and one must be very cautious which modifications are
allowed. You're giving this power to an (unsigned, hence untrusted)
userspace application; Eric argues that only the kernel should have
this power.

Just my two cents,
Petr T

Re: [PATCH 1/1] Btrfs: Code Cleanup

2016-03-24 Thread Petr Tesarik

On Thu, 24 Mar 2016 16:03:07 +0100
David Sterba  wrote:

> On Sun, Mar 20, 2016 at 03:11:11PM +0800, Flex Liu wrote:
>[...]
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2325,7 +2325,10 @@ int btrfs_init_new_device(struct btrfs_root *root, 
> > char *device_path)
> > if (seeding_dev) {
> > sb->s_flags &= ~MS_RDONLY;
> > ret = btrfs_prepare_sprout(root);
> > -   BUG_ON(ret); /* -ENOMEM */
> > +   if (ret) {
> > +   btrfs_abort_transaction(trans, root, ret);
> 
> The transaction abort seems a bit heavy as it will take down the whole
> filesystem. It's called from the device add ioctl, this is a restartable
> operation.
> 
> Unfortunatelly btrfs_prepare_sprout is called after the transaction
> start so btrfs_abort_transaction must be called. To avoid it, the code
> would need to be reorganized, so the memory allocations happen in
> advance.

On the other hand, an abort is still better than a BUG_ON(), and it may
be easier to make incremental improvements.

Just my 2 cents (I haven't tried it),
Petr T

Re: [PATCH 1/1] Btrfs: Code Cleanup

2016-03-24 Thread Petr Tesarik

On Thu, 24 Mar 2016 16:03:07 +0100
David Sterba  wrote:

> On Sun, Mar 20, 2016 at 03:11:11PM +0800, Flex Liu wrote:
>[...]
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2325,7 +2325,10 @@ int btrfs_init_new_device(struct btrfs_root *root, 
> > char *device_path)
> > if (seeding_dev) {
> > sb->s_flags &= ~MS_RDONLY;
> > ret = btrfs_prepare_sprout(root);
> > -   BUG_ON(ret); /* -ENOMEM */
> > +   if (ret) {
> > +   btrfs_abort_transaction(trans, root, ret);
> 
> The transaction abort seems a bit heavy as it will take down the whole
> filesystem. It's called from the device add ioctl, this is a restartable
> operation.
> 
> Unfortunatelly btrfs_prepare_sprout is called after the transaction
> start so btrfs_abort_transaction must be called. To avoid it, the code
> would need to be reorganized, so the memory allocations happen in
> advance.

On the other hand, an abort is still better than a BUG_ON(), and it may
be easier to make incremental improvements.

Just my 2 cents (I haven't tried it),
Petr T

Re: [Xen-devel] crash tool - problem with new Xen linear virtual mapped sparse p2m list

2015-11-24 Thread Petr Tesarik

V Tue, 24 Nov 2015 10:35:03 +
Andrew Cooper  napsáno:

> On 24/11/15 10:17, Petr Tesarik wrote:
> > On Tue, 24 Nov 2015 10:09:01 +
> > David Vrabel  wrote:
> >
> >> On 24/11/15 09:55, Malcolm Crossley wrote:
> >>> On 24/11/15 08:59, Jan Beulich wrote:
> >>>>>>> On 24.11.15 at 07:55,  wrote:
> >>>>> What about:
> >>>>>
> >>>>> 4) Instead of relying on the kernel maintained p2m list for m2p
> >>>>>conversion use the hypervisor maintained m2p list which should be
> >>>>>available in the dump as well. This is the way the alive kernel is
> >>>>>working, so mimic it during crash dump analysis.
> >>>> I fully agree; I have to admit that looking at the p2m when doing page
> >>>> table walks for a PV Dom0 (having all machine addresses in page table
> >>>> entries) seems kind of backwards. (But I say this knowing nothing
> >>>> about the tool.)
> >>>>
> >>> I don't think we can reliably use the m2p for PV domains because
> >>> PV domains don't always issue a m2p update hypercall when they change
> >>> their p2m mapping.
> >> This only applies to foreign pages which won't be very interesting to a
> >> crash tool.
> > True. I think the main reason crash hasn't done this is that it cannot
> > find the hypervisor maintained m2p list. It should be sufficient to add
> > some more fields to XEN_VMCOREINFO, so that crash can locate the
> > mapping in the dump.
> 
> The M2P lives at an ABI-specified location in all virtual address spaces
> for PV guests.
> 
> Either 0xF580 or 0x8000 depending on bitness.

Hm, this is nice, but kind of chicken-and-egg problem. A system dump
contains a snapshot of the machine's RAM. But the addresses you
mentioned are virtual addresses. How do I translate them to physical
addresses without an m2p mapping? I need at least the value of CR3 for
that domain, and most likely a way to determine if it is a PV domain.

Petr T
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] crash tool - problem with new Xen linear virtual mapped sparse p2m list

2015-11-24 Thread Petr Tesarik

On Tue, 24 Nov 2015 10:09:01 +
David Vrabel  wrote:

> On 24/11/15 09:55, Malcolm Crossley wrote:
> > On 24/11/15 08:59, Jan Beulich wrote:
> > On 24.11.15 at 07:55,  wrote:
> >>> What about:
> >>>
> >>> 4) Instead of relying on the kernel maintained p2m list for m2p
> >>>conversion use the hypervisor maintained m2p list which should be
> >>>available in the dump as well. This is the way the alive kernel is
> >>>working, so mimic it during crash dump analysis.
> >>
> >> I fully agree; I have to admit that looking at the p2m when doing page
> >> table walks for a PV Dom0 (having all machine addresses in page table
> >> entries) seems kind of backwards. (But I say this knowing nothing
> >> about the tool.)
> >>
> > I don't think we can reliably use the m2p for PV domains because
> > PV domains don't always issue a m2p update hypercall when they change
> > their p2m mapping.
> 
> This only applies to foreign pages which won't be very interesting to a
> crash tool.

True. I think the main reason crash hasn't done this is that it cannot
find the hypervisor maintained m2p list. It should be sufficient to add
some more fields to XEN_VMCOREINFO, so that crash can locate the
mapping in the dump.

Petr T
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] crash tool - problem with new Xen linear virtual mapped sparse p2m list

2015-11-24 Thread Petr Tesarik

On Tue, 24 Nov 2015 10:09:01 +
David Vrabel  wrote:

> On 24/11/15 09:55, Malcolm Crossley wrote:
> > On 24/11/15 08:59, Jan Beulich wrote:
> > On 24.11.15 at 07:55,  wrote:
> >>> What about:
> >>>
> >>> 4) Instead of relying on the kernel maintained p2m list for m2p
> >>>conversion use the hypervisor maintained m2p list which should be
> >>>available in the dump as well. This is the way the alive kernel is
> >>>working, so mimic it during crash dump analysis.
> >>
> >> I fully agree; I have to admit that looking at the p2m when doing page
> >> table walks for a PV Dom0 (having all machine addresses in page table
> >> entries) seems kind of backwards. (But I say this knowing nothing
> >> about the tool.)
> >>
> > I don't think we can reliably use the m2p for PV domains because
> > PV domains don't always issue a m2p update hypercall when they change
> > their p2m mapping.
> 
> This only applies to foreign pages which won't be very interesting to a
> crash tool.

True. I think the main reason crash hasn't done this is that it cannot
find the hypervisor maintained m2p list. It should be sufficient to add
some more fields to XEN_VMCOREINFO, so that crash can locate the
mapping in the dump.

Petr T
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] crash tool - problem with new Xen linear virtual mapped sparse p2m list

2015-11-24 Thread Petr Tesarik

V Tue, 24 Nov 2015 10:35:03 +
Andrew Cooper <andrew.coop...@citrix.com> napsáno:

> On 24/11/15 10:17, Petr Tesarik wrote:
> > On Tue, 24 Nov 2015 10:09:01 +
> > David Vrabel <david.vra...@citrix.com> wrote:
> >
> >> On 24/11/15 09:55, Malcolm Crossley wrote:
> >>> On 24/11/15 08:59, Jan Beulich wrote:
> >>>>>>> On 24.11.15 at 07:55, <jgr...@suse.com> wrote:
> >>>>> What about:
> >>>>>
> >>>>> 4) Instead of relying on the kernel maintained p2m list for m2p
> >>>>>conversion use the hypervisor maintained m2p list which should be
> >>>>>available in the dump as well. This is the way the alive kernel is
> >>>>>working, so mimic it during crash dump analysis.
> >>>> I fully agree; I have to admit that looking at the p2m when doing page
> >>>> table walks for a PV Dom0 (having all machine addresses in page table
> >>>> entries) seems kind of backwards. (But I say this knowing nothing
> >>>> about the tool.)
> >>>>
> >>> I don't think we can reliably use the m2p for PV domains because
> >>> PV domains don't always issue a m2p update hypercall when they change
> >>> their p2m mapping.
> >> This only applies to foreign pages which won't be very interesting to a
> >> crash tool.
> > True. I think the main reason crash hasn't done this is that it cannot
> > find the hypervisor maintained m2p list. It should be sufficient to add
> > some more fields to XEN_VMCOREINFO, so that crash can locate the
> > mapping in the dump.
> 
> The M2P lives at an ABI-specified location in all virtual address spaces
> for PV guests.
> 
> Either 0xF580 or 0x8000 depending on bitness.

Hm, this is nice, but kind of chicken-and-egg problem. A system dump
contains a snapshot of the machine's RAM. But the addresses you
mentioned are virtual addresses. How do I translate them to physical
addresses without an m2p mapping? I need at least the value of CR3 for
that domain, and most likely a way to determine if it is a PV domain.

Petr T
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] cp210x: Store part number

2015-07-27 Thread Petr Tesarik

On Sun, 26 Jul 2015 15:32:54 +0200
Oliver Neukum  wrote:

> On Fri, 2015-07-24 at 08:48 +0200, Petr Tesarik wrote:
> > @@ -872,6 +886,12 @@ static int cp210x_startup(struct usb_serial
> > *serial)
> >  
> > usb_set_serial_data(serial, spriv);
> >  
> > +   /* Get the 1-byte part number of the cp210x device */
> > +   cp210x_control_msg(serial->port[0], CP210X_VENDOR_SPECIFIC,
> > +  REQTYPE_DEVICE_TO_HOST, CP210X_GET_PARTNUM,
> > +  , 1, USB_CTRL_GET_TIMEOUT);
> > +   spriv->bPartNumber = partnum & 0xFF;
> 
> DMA on the stack. That breaks the cache coherency rules.
> You must kmalloc the buffer.

I don't understand. While you're right that I copied this part from
Sillicon Labs' driver without much thinking, and >bPartNumber
can be used directly, I can't see any DMA on stack. FWIW
cp210x_control_msg always allocates a buffer using kcalloc:

buf = kcalloc(length, sizeof(__le32), GFP_KERNEL);
/* ... */
result = usb_control_msg(serial->dev, usb_rcvctrlpipe(serial->dev, 0),
 request, requesttype, value,
 spriv->bInterfaceNumber, buf, size, timeout);

Is that what you mean?

TIA,
Petr T
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] cp210x: Store part number

2015-07-27 Thread Petr Tesarik

On Sun, 26 Jul 2015 15:32:54 +0200
Oliver Neukum oneu...@suse.com wrote:

 On Fri, 2015-07-24 at 08:48 +0200, Petr Tesarik wrote:
  @@ -872,6 +886,12 @@ static int cp210x_startup(struct usb_serial
  *serial)
   
  usb_set_serial_data(serial, spriv);
   
  +   /* Get the 1-byte part number of the cp210x device */
  +   cp210x_control_msg(serial-port[0], CP210X_VENDOR_SPECIFIC,
  +  REQTYPE_DEVICE_TO_HOST, CP210X_GET_PARTNUM,
  +  partnum, 1, USB_CTRL_GET_TIMEOUT);
  +   spriv-bPartNumber = partnum  0xFF;
 
 DMA on the stack. That breaks the cache coherency rules.
 You must kmalloc the buffer.

I don't understand. While you're right that I copied this part from
Sillicon Labs' driver without much thinking, and spriv-bPartNumber
can be used directly, I can't see any DMA on stack. FWIW
cp210x_control_msg always allocates a buffer using kcalloc:

buf = kcalloc(length, sizeof(__le32), GFP_KERNEL);
/* ... */
result = usb_control_msg(serial-dev, usb_rcvctrlpipe(serial-dev, 0),
 request, requesttype, value,
 spriv-bInterfaceNumber, buf, size, timeout);

Is that what you mean?

TIA,
Petr T
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] cp210x: Expose the part number in sysfs

2015-07-24 Thread Petr Tesarik

On Fri, 24 Jul 2015 14:33:23 -0700
Greg Kroah-Hartman  wrote:

> On Fri, Jul 24, 2015 at 11:00:38PM +0200, Petr Tesarik wrote:
> > On Fri, 24 Jul 2015 11:17:55 -0700
> > Greg Kroah-Hartman  wrote:
> > 
> > > On Fri, Jul 24, 2015 at 08:48:11AM +0200, Petr Tesarik wrote:
> > > > From: Petr Tesarik 
> > > > 
> > > > Make it possible to read the cp210x part number from userspace by making
> > > > it a sysfs attribute.
> > > > 
> > > > Signed-off-by: Petr Tesarik 
> > > 
> > > All sysfs files need to be documented in Documentation/ABI/
> > 
> > Is this a recently added requirement? FWIW there are many undocumented
> > sysfs attributes, even in code maintained by you. E.g. each usbserial
> > (ttyUSB*) device has an attribute called "port_number" which is not
> > documented. Or I'm blind...
> 
> It's been a requirement for years.  If we have missed any, please let me
> know and we will add them.  Sometimes we miss this when adding new
> attributes, and many very old attributes never got documented.

Fair enough. The example I gave is ancient...

>[...]
> > > > @@ -892,13 +905,19 @@ static int cp210x_startup(struct usb_serial
> > > > *serial) , 1, USB_CTRL_GET_TIMEOUT);
> > > > spriv->bPartNumber = partnum & 0xFF;
> > > >  
> > > > -   return 0;
> > > > +   result = device_create_file(>interface->dev,
> > > > +   _attr_part_number);
> > > 
> > > You just raced with userspace, it will not properly see this attribute
> > > :(
> > 
> > Can you elaborate on this, please? AFAICS the file is created after all
> > required objects had been instantiated already. Where's the race?
> 
> That's the race.  See this blog post for all the details:
>   
> http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/

Thank you for the link!

> > > Please never use device_create_file, use an attribute group assigned
> > > to the tty device.  Not the USB interface, that is only for USB
> > > interface "things".
>[...]
> > OK, if the USB interface is the wrong place, what's a good place for
> > such a thing? I don't insist on a sysfs attribute, but I don't agree
> > with the tty device.
> 
> Being a usb-serial driver, you don't have "access" to the main USB
> device, so you are kind of violating some layering rules by taking this
> on to the interface.

True. This is still one of my concerns, but the GPIO functionality is
minor compared to the serial bridge, so strictly following layering
rules would be overkill. See Johan Hovold's answer in the thread about
GPIO support for Silicon Labs cp210x USB serial:

  Indeed, you should just hang the gpio device directly off the
  usb-serial device [...]

> Who / what is going to use this file?  What is going to be done with
> the information and to what device is that information going to be
> associated with?

Many thanks again. Thinking about it some more, once I'm done with my
work, userspace can enumerate the gpio devices using sysfs without
having to care about the specific model. The part number is only
relevant internally to the cp210x driver.

In short, there is in fact no user of this file, so the best option is
to report the model in syslog and forget about a sysfs file. Simple.

Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] cp210x: Expose the part number in sysfs

2015-07-24 Thread Petr Tesarik

On Fri, 24 Jul 2015 11:17:55 -0700
Greg Kroah-Hartman  wrote:

> On Fri, Jul 24, 2015 at 08:48:11AM +0200, Petr Tesarik wrote:
> > From: Petr Tesarik 
> > 
> > Make it possible to read the cp210x part number from userspace by making
> > it a sysfs attribute.
> > 
> > Signed-off-by: Petr Tesarik 
> 
> All sysfs files need to be documented in Documentation/ABI/

Is this a recently added requirement? FWIW there are many undocumented
sysfs attributes, even in code maintained by you. E.g. each usbserial
(ttyUSB*) device has an attribute called "port_number" which is not
documented. Or I'm blind...

> > ---
> >  drivers/usb/serial/cp210x.c | 21 -
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/usb/serial/cp210x.c
> > b/drivers/usb/serial/cp210x.c index dbfc722..66de350 100644
> > --- a/drivers/usb/serial/cp210x.c
> > +++ b/drivers/usb/serial/cp210x.c
> > @@ -871,11 +871,24 @@ static void cp210x_break_ctl(struct
> > tty_struct *tty, int break_state) cp210x_set_config(port,
> > CP210X_SET_BREAK, , 2); }
> >  
> > +static ssize_t part_number_show(struct device *dev,
> > +   struct device_attribute *attr,
> > char *buf) +{
> > +   struct usb_interface *intf = to_usb_interface(dev);
> > +   struct usb_serial *serial = usb_get_intfdata(intf);
> > +   struct cp210x_serial_private *spriv =
> > usb_get_serial_data(serial); +
> > +   return sprintf(buf, "%i\n", spriv->bPartNumber);
> > +}
> > +
> > +static DEVICE_ATTR_RO(part_number);
> > +
> >  static int cp210x_startup(struct usb_serial *serial)
> >  {
> > struct usb_host_interface *cur_altsetting;
> > struct cp210x_serial_private *spriv;
> > unsigned int partnum;
> > +   int result;
> >  
> > spriv = kzalloc(sizeof(*spriv), GFP_KERNEL);
> > if (!spriv)
> > @@ -892,13 +905,19 @@ static int cp210x_startup(struct usb_serial
> > *serial) , 1, USB_CTRL_GET_TIMEOUT);
> > spriv->bPartNumber = partnum & 0xFF;
> >  
> > -   return 0;
> > +   result = device_create_file(>interface->dev,
> > +   _attr_part_number);
> 
> You just raced with userspace, it will not properly see this attribute
> :(

Can you elaborate on this, please? AFAICS the file is created after all
required objects had been instantiated already. Where's the race?

> Please never use device_create_file, use an attribute group assigned
> to the tty device.  Not the USB interface, that is only for USB
> interface "things".

Well, I hesitated with adding it to the USB interface, but adding it to
the tty device is definitely wrong. This is indeed an attribute of the
device, not of the tty. If you look at the other CP210x thread, there's
also a gpio device in the chip. I think it's totally silly to look
inside a tty interface to see if there are any GPIO pins...

OK, if the USB interface is the wrong place, what's a good place for
such a thing? I don't insist on a sysfs attribute, but I don't agree
with the tty device.

Petr T
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] cp210x: Unify code for set/get config control messages

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik 

There is a lot of overlap between the two functions (e.g. calculation
of the buffer size), so this removes a bit of code duplication, but
most importantly, a more generic function can be easily reused for
other message types.

Signed-off-by: Petr Tesarik 
---
 drivers/usb/serial/cp210x.c | 109 
 1 file changed, 49 insertions(+), 60 deletions(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index 1bae015..69f03b6 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -307,14 +307,17 @@ enum cp210x_request_type {
 #define CONTROL_WRITE_RTS  0x0200
 
 /*
- * cp210x_get_config
- * Reads from the CP210x configuration registers
+ * cp210x_control_msg
+ * Sends a generic control message, taking care of endianness
+ * and error messages.
  * 'size' is specified in bytes.
- * 'data' is a pointer to a pre-allocated array of integers large
- * enough to hold 'size' bytes (with 4 bytes to each integer)
+ * 'data' is a pointer to the input/output buffer. For output, it holds
+ * the data (in host order) to be sent. For input, it receives data from
+ * the device and must be big enough to hold 'size' bytes.
  */
-static int cp210x_get_config(struct usb_serial_port *port, u8 request,
-   unsigned int *data, int size)
+static int cp210x_control_msg(struct usb_serial_port *port, u8 request,
+ u8 requesttype, u16 value, u32 *data, int size,
+ int timeout)
 {
struct usb_serial *serial = port->serial;
struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
@@ -328,20 +331,22 @@ static int cp210x_get_config(struct usb_serial_port 
*port, u8 request,
if (!buf)
return -ENOMEM;
 
-   /* Issue the request, attempting to read 'size' bytes */
-   result = usb_control_msg(serial->dev, usb_rcvctrlpipe(serial->dev, 0),
-   request, REQTYPE_INTERFACE_TO_HOST, 0x,
-   spriv->bInterfaceNumber, buf, size,
-   USB_CTRL_GET_TIMEOUT);
+   if (!(requesttype & USB_DIR_IN)) {
+   for (i = 0; i < length; i++)
+   buf[i] = cpu_to_le32(data[i]);
+   }
 
-   /* Convert data into an array of integers */
-   for (i = 0; i < length; i++)
-   data[i] = le32_to_cpu(buf[i]);
+   result = usb_control_msg(serial->dev, usb_rcvctrlpipe(serial->dev, 0),
+request, requesttype, value,
+spriv->bInterfaceNumber, buf, size, timeout);
 
-   kfree(buf);
+   if (requesttype & USB_DIR_IN) {
+   for (i = 0; i < length; i++)
+   data[i] = le32_to_cpu(buf[i]);
+   }
 
if (result != size) {
-   dev_dbg(>dev, "%s - Unable to send config request, 
request=0x%x size=%d result=%d\n",
+   dev_dbg(>dev, "%s - Unable to send request, request=0x%x 
size=%d result=%d\n",
__func__, request, size, result);
if (result > 0)
result = -EPROTO;
@@ -349,7 +354,23 @@ static int cp210x_get_config(struct usb_serial_port *port, 
u8 request,
return result;
}
 
-   return 0;
+   kfree(buf);
+
+   return result;
+}
+
+/*
+ * cp210x_get_config
+ * Reads from the CP210x configuration registers
+ * 'size' is specified in bytes.
+ * 'data' is a pointer to a pre-allocated array of integers large
+ * enough to hold 'size' bytes (with 4 bytes to each integer)
+ */
+static int cp210x_get_config(struct usb_serial_port *port, u8 request,
+   unsigned int *data, int size)
+{
+   return cp210x_control_msg(port, request, REQTYPE_INTERFACE_TO_HOST,
+   0x, data, size, USB_CTRL_GET_TIMEOUT);
 }
 
 /*
@@ -361,48 +382,14 @@ static int cp210x_get_config(struct usb_serial_port 
*port, u8 request,
 static int cp210x_set_config(struct usb_serial_port *port, u8 request,
unsigned int *data, int size)
 {
-   struct usb_serial *serial = port->serial;
-   struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
-   __le32 *buf;
-   int result, i, length;
-
-   /* Number of integers required to contain the array */
-   length = (((size - 1) | 3) + 1) / 4;
-
-   buf = kmalloc(length * sizeof(__le32), GFP_KERNEL);
-   if (!buf)
-   return -ENOMEM;
-
-   /* Array of integers into bytes */
-   for (i = 0; i < length; i++)
-   buf[i] = cpu_to_le32(data[i]);
-
-   if (size > 2) {
-   result = usb_control_msg(serial->dev,
-   usb_sndctrlpipe(serial->dev, 0),
-   request, REQTYPE_HOST_TO_INT

[PATCH 1/4] cp210x: Replace USB magic numbers with symbolic names

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik 

The request type is in fact made of three fields that already have
symbolic constants.

While I was rewriting those lines, I also converted the pre-processor
defines into an enum, so they are seen by debuggers.

Signed-off-by: Petr Tesarik 
---
 drivers/usb/serial/cp210x.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index eac7cca..1bae015 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -226,10 +226,16 @@ static struct usb_serial_driver * const serial_drivers[] 
= {
 };
 
 /* Config request types */
-#define REQTYPE_HOST_TO_INTERFACE  0x41
-#define REQTYPE_INTERFACE_TO_HOST  0xc1
-#define REQTYPE_HOST_TO_DEVICE 0x40
-#define REQTYPE_DEVICE_TO_HOST 0xc0
+enum cp210x_request_type {
+   REQTYPE_HOST_TO_INTERFACE =
+   USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_INTERFACE,
+   REQTYPE_INTERFACE_TO_HOST =
+   USB_DIR_IN  | USB_TYPE_VENDOR | USB_RECIP_INTERFACE,
+   REQTYPE_HOST_TO_DEVICE =
+   USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE,
+   REQTYPE_DEVICE_TO_HOST =
+   USB_DIR_IN  | USB_TYPE_VENDOR | USB_RECIP_DEVICE,
+};
 
 /* Config request codes */
 #define CP210X_IFC_ENABLE  0x00
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] cp210x: Store part number

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik 

Query and store the CP210x part number.

Signed-off-by: Petr Tesarik 
---
 drivers/usb/serial/cp210x.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index 69f03b6..dbfc722 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -199,6 +199,7 @@ MODULE_DEVICE_TABLE(usb, id_table);
 
 struct cp210x_serial_private {
__u8bInterfaceNumber;
+   __u8bPartNumber;
 };
 
 static struct usb_serial_driver cp210x_device = {
@@ -264,6 +265,7 @@ enum cp210x_request_type {
 #define CP210X_SET_CHARS   0x19
 #define CP210X_GET_BAUDRATE0x1D
 #define CP210X_SET_BAUDRATE0x1E
+#define CP210X_VENDOR_SPECIFIC 0xFF
 
 /* CP210X_IFC_ENABLE */
 #define UART_ENABLE0x0001
@@ -306,6 +308,17 @@ enum cp210x_request_type {
 #define CONTROL_WRITE_DTR  0x0100
 #define CONTROL_WRITE_RTS  0x0200
 
+/* CP210X_VENDOR_SPECIFIC */
+#define CP210X_GET_PARTNUM 0x370B
+
+/* Part number definitions */
+#define CP2101_PARTNUM 0x01
+#define CP2102_PARTNUM 0x02
+#define CP2103_PARTNUM 0x03
+#define CP2104_PARTNUM 0x04
+#define CP2105_PARTNUM 0x05
+#define CP2108_PARTNUM 0x08
+
 /*
  * cp210x_control_msg
  * Sends a generic control message, taking care of endianness
@@ -862,6 +875,7 @@ static int cp210x_startup(struct usb_serial *serial)
 {
struct usb_host_interface *cur_altsetting;
struct cp210x_serial_private *spriv;
+   unsigned int partnum;
 
spriv = kzalloc(sizeof(*spriv), GFP_KERNEL);
if (!spriv)
@@ -872,6 +886,12 @@ static int cp210x_startup(struct usb_serial *serial)
 
usb_set_serial_data(serial, spriv);
 
+   /* Get the 1-byte part number of the cp210x device */
+   cp210x_control_msg(serial->port[0], CP210X_VENDOR_SPECIFIC,
+  REQTYPE_DEVICE_TO_HOST, CP210X_GET_PARTNUM,
+  , 1, USB_CTRL_GET_TIMEOUT);
+   spriv->bPartNumber = partnum & 0xFF;
+
return 0;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] Show CP210x part number in sysfs

2015-07-24 Thread Petr Tesarik

The cp210x driver can be used for several devices (CP2101/2/3/4). It
is sometimes useful to know the actual part number, because there are
slight differences in their capabilities.

The first two patches are cleanups and not necessary to implement the
feature. I can send them in a separate patch set if that's preferred.

Petr Tesarik (4):
  cp210x: Replace USB magic numbers with symbolic names
  cp210x: Unify code for set/get config control messages
  cp210x: Store part number
  cp210x: Expose the part number in sysfs

 drivers/usb/serial/cp210x.c | 164 ++--
 1 file changed, 99 insertions(+), 65 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] cp210x: Expose the part number in sysfs

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik 

Make it possible to read the cp210x part number from userspace by making
it a sysfs attribute.

Signed-off-by: Petr Tesarik 
---
 drivers/usb/serial/cp210x.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index dbfc722..66de350 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -871,11 +871,24 @@ static void cp210x_break_ctl(struct tty_struct *tty, int 
break_state)
cp210x_set_config(port, CP210X_SET_BREAK, , 2);
 }
 
+static ssize_t part_number_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct usb_interface *intf = to_usb_interface(dev);
+   struct usb_serial *serial = usb_get_intfdata(intf);
+   struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
+
+   return sprintf(buf, "%i\n", spriv->bPartNumber);
+}
+
+static DEVICE_ATTR_RO(part_number);
+
 static int cp210x_startup(struct usb_serial *serial)
 {
struct usb_host_interface *cur_altsetting;
struct cp210x_serial_private *spriv;
unsigned int partnum;
+   int result;
 
spriv = kzalloc(sizeof(*spriv), GFP_KERNEL);
if (!spriv)
@@ -892,13 +905,19 @@ static int cp210x_startup(struct usb_serial *serial)
   , 1, USB_CTRL_GET_TIMEOUT);
spriv->bPartNumber = partnum & 0xFF;
 
-   return 0;
+   result = device_create_file(>interface->dev,
+   _attr_part_number);
+   if (result)
+   kfree(spriv);
+
+   return result;
 }
 
 static void cp210x_release(struct usb_serial *serial)
 {
struct cp210x_serial_private *spriv;
 
+   device_remove_file(>interface->dev, _attr_part_number);
spriv = usb_get_serial_data(serial);
kfree(spriv);
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] Show CP210x part number in sysfs

2015-07-24 Thread Petr Tesarik

The cp210x driver can be used for several devices (CP2101/2/3/4). It
is sometimes useful to know the actual part number, because there are
slight differences in their capabilities.

The first two patches are cleanups and not necessary to implement the
feature. I can send them in a separate patch set if that's preferred.

Petr Tesarik (4):
  cp210x: Replace USB magic numbers with symbolic names
  cp210x: Unify code for set/get config control messages
  cp210x: Store part number
  cp210x: Expose the part number in sysfs

 drivers/usb/serial/cp210x.c | 164 ++--
 1 file changed, 99 insertions(+), 65 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] cp210x: Expose the part number in sysfs

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik ptesa...@suse.cz

Make it possible to read the cp210x part number from userspace by making
it a sysfs attribute.

Signed-off-by: Petr Tesarik ptesa...@suse.com
---
 drivers/usb/serial/cp210x.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index dbfc722..66de350 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -871,11 +871,24 @@ static void cp210x_break_ctl(struct tty_struct *tty, int 
break_state)
cp210x_set_config(port, CP210X_SET_BREAK, state, 2);
 }
 
+static ssize_t part_number_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct usb_interface *intf = to_usb_interface(dev);
+   struct usb_serial *serial = usb_get_intfdata(intf);
+   struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
+
+   return sprintf(buf, %i\n, spriv-bPartNumber);
+}
+
+static DEVICE_ATTR_RO(part_number);
+
 static int cp210x_startup(struct usb_serial *serial)
 {
struct usb_host_interface *cur_altsetting;
struct cp210x_serial_private *spriv;
unsigned int partnum;
+   int result;
 
spriv = kzalloc(sizeof(*spriv), GFP_KERNEL);
if (!spriv)
@@ -892,13 +905,19 @@ static int cp210x_startup(struct usb_serial *serial)
   partnum, 1, USB_CTRL_GET_TIMEOUT);
spriv-bPartNumber = partnum  0xFF;
 
-   return 0;
+   result = device_create_file(serial-interface-dev,
+   dev_attr_part_number);
+   if (result)
+   kfree(spriv);
+
+   return result;
 }
 
 static void cp210x_release(struct usb_serial *serial)
 {
struct cp210x_serial_private *spriv;
 
+   device_remove_file(serial-interface-dev, dev_attr_part_number);
spriv = usb_get_serial_data(serial);
kfree(spriv);
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] cp210x: Store part number

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik ptesa...@suse.cz

Query and store the CP210x part number.

Signed-off-by: Petr Tesarik ptesa...@suse.com
---
 drivers/usb/serial/cp210x.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index 69f03b6..dbfc722 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -199,6 +199,7 @@ MODULE_DEVICE_TABLE(usb, id_table);
 
 struct cp210x_serial_private {
__u8bInterfaceNumber;
+   __u8bPartNumber;
 };
 
 static struct usb_serial_driver cp210x_device = {
@@ -264,6 +265,7 @@ enum cp210x_request_type {
 #define CP210X_SET_CHARS   0x19
 #define CP210X_GET_BAUDRATE0x1D
 #define CP210X_SET_BAUDRATE0x1E
+#define CP210X_VENDOR_SPECIFIC 0xFF
 
 /* CP210X_IFC_ENABLE */
 #define UART_ENABLE0x0001
@@ -306,6 +308,17 @@ enum cp210x_request_type {
 #define CONTROL_WRITE_DTR  0x0100
 #define CONTROL_WRITE_RTS  0x0200
 
+/* CP210X_VENDOR_SPECIFIC */
+#define CP210X_GET_PARTNUM 0x370B
+
+/* Part number definitions */
+#define CP2101_PARTNUM 0x01
+#define CP2102_PARTNUM 0x02
+#define CP2103_PARTNUM 0x03
+#define CP2104_PARTNUM 0x04
+#define CP2105_PARTNUM 0x05
+#define CP2108_PARTNUM 0x08
+
 /*
  * cp210x_control_msg
  * Sends a generic control message, taking care of endianness
@@ -862,6 +875,7 @@ static int cp210x_startup(struct usb_serial *serial)
 {
struct usb_host_interface *cur_altsetting;
struct cp210x_serial_private *spriv;
+   unsigned int partnum;
 
spriv = kzalloc(sizeof(*spriv), GFP_KERNEL);
if (!spriv)
@@ -872,6 +886,12 @@ static int cp210x_startup(struct usb_serial *serial)
 
usb_set_serial_data(serial, spriv);
 
+   /* Get the 1-byte part number of the cp210x device */
+   cp210x_control_msg(serial-port[0], CP210X_VENDOR_SPECIFIC,
+  REQTYPE_DEVICE_TO_HOST, CP210X_GET_PARTNUM,
+  partnum, 1, USB_CTRL_GET_TIMEOUT);
+   spriv-bPartNumber = partnum  0xFF;
+
return 0;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] cp210x: Replace USB magic numbers with symbolic names

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik ptesa...@suse.cz

The request type is in fact made of three fields that already have
symbolic constants.

While I was rewriting those lines, I also converted the pre-processor
defines into an enum, so they are seen by debuggers.

Signed-off-by: Petr Tesarik ptesa...@suse.com
---
 drivers/usb/serial/cp210x.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index eac7cca..1bae015 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -226,10 +226,16 @@ static struct usb_serial_driver * const serial_drivers[] 
= {
 };
 
 /* Config request types */
-#define REQTYPE_HOST_TO_INTERFACE  0x41
-#define REQTYPE_INTERFACE_TO_HOST  0xc1
-#define REQTYPE_HOST_TO_DEVICE 0x40
-#define REQTYPE_DEVICE_TO_HOST 0xc0
+enum cp210x_request_type {
+   REQTYPE_HOST_TO_INTERFACE =
+   USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_INTERFACE,
+   REQTYPE_INTERFACE_TO_HOST =
+   USB_DIR_IN  | USB_TYPE_VENDOR | USB_RECIP_INTERFACE,
+   REQTYPE_HOST_TO_DEVICE =
+   USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE,
+   REQTYPE_DEVICE_TO_HOST =
+   USB_DIR_IN  | USB_TYPE_VENDOR | USB_RECIP_DEVICE,
+};
 
 /* Config request codes */
 #define CP210X_IFC_ENABLE  0x00
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] cp210x: Unify code for set/get config control messages

2015-07-24 Thread Petr Tesarik

From: Petr Tesarik ptesa...@suse.cz

There is a lot of overlap between the two functions (e.g. calculation
of the buffer size), so this removes a bit of code duplication, but
most importantly, a more generic function can be easily reused for
other message types.

Signed-off-by: Petr Tesarik ptesa...@suse.com
---
 drivers/usb/serial/cp210x.c | 109 
 1 file changed, 49 insertions(+), 60 deletions(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index 1bae015..69f03b6 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -307,14 +307,17 @@ enum cp210x_request_type {
 #define CONTROL_WRITE_RTS  0x0200
 
 /*
- * cp210x_get_config
- * Reads from the CP210x configuration registers
+ * cp210x_control_msg
+ * Sends a generic control message, taking care of endianness
+ * and error messages.
  * 'size' is specified in bytes.
- * 'data' is a pointer to a pre-allocated array of integers large
- * enough to hold 'size' bytes (with 4 bytes to each integer)
+ * 'data' is a pointer to the input/output buffer. For output, it holds
+ * the data (in host order) to be sent. For input, it receives data from
+ * the device and must be big enough to hold 'size' bytes.
  */
-static int cp210x_get_config(struct usb_serial_port *port, u8 request,
-   unsigned int *data, int size)
+static int cp210x_control_msg(struct usb_serial_port *port, u8 request,
+ u8 requesttype, u16 value, u32 *data, int size,
+ int timeout)
 {
struct usb_serial *serial = port-serial;
struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
@@ -328,20 +331,22 @@ static int cp210x_get_config(struct usb_serial_port 
*port, u8 request,
if (!buf)
return -ENOMEM;
 
-   /* Issue the request, attempting to read 'size' bytes */
-   result = usb_control_msg(serial-dev, usb_rcvctrlpipe(serial-dev, 0),
-   request, REQTYPE_INTERFACE_TO_HOST, 0x,
-   spriv-bInterfaceNumber, buf, size,
-   USB_CTRL_GET_TIMEOUT);
+   if (!(requesttype  USB_DIR_IN)) {
+   for (i = 0; i  length; i++)
+   buf[i] = cpu_to_le32(data[i]);
+   }
 
-   /* Convert data into an array of integers */
-   for (i = 0; i  length; i++)
-   data[i] = le32_to_cpu(buf[i]);
+   result = usb_control_msg(serial-dev, usb_rcvctrlpipe(serial-dev, 0),
+request, requesttype, value,
+spriv-bInterfaceNumber, buf, size, timeout);
 
-   kfree(buf);
+   if (requesttype  USB_DIR_IN) {
+   for (i = 0; i  length; i++)
+   data[i] = le32_to_cpu(buf[i]);
+   }
 
if (result != size) {
-   dev_dbg(port-dev, %s - Unable to send config request, 
request=0x%x size=%d result=%d\n,
+   dev_dbg(port-dev, %s - Unable to send request, request=0x%x 
size=%d result=%d\n,
__func__, request, size, result);
if (result  0)
result = -EPROTO;
@@ -349,7 +354,23 @@ static int cp210x_get_config(struct usb_serial_port *port, 
u8 request,
return result;
}
 
-   return 0;
+   kfree(buf);
+
+   return result;
+}
+
+/*
+ * cp210x_get_config
+ * Reads from the CP210x configuration registers
+ * 'size' is specified in bytes.
+ * 'data' is a pointer to a pre-allocated array of integers large
+ * enough to hold 'size' bytes (with 4 bytes to each integer)
+ */
+static int cp210x_get_config(struct usb_serial_port *port, u8 request,
+   unsigned int *data, int size)
+{
+   return cp210x_control_msg(port, request, REQTYPE_INTERFACE_TO_HOST,
+   0x, data, size, USB_CTRL_GET_TIMEOUT);
 }
 
 /*
@@ -361,48 +382,14 @@ static int cp210x_get_config(struct usb_serial_port 
*port, u8 request,
 static int cp210x_set_config(struct usb_serial_port *port, u8 request,
unsigned int *data, int size)
 {
-   struct usb_serial *serial = port-serial;
-   struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
-   __le32 *buf;
-   int result, i, length;
-
-   /* Number of integers required to contain the array */
-   length = (((size - 1) | 3) + 1) / 4;
-
-   buf = kmalloc(length * sizeof(__le32), GFP_KERNEL);
-   if (!buf)
-   return -ENOMEM;
-
-   /* Array of integers into bytes */
-   for (i = 0; i  length; i++)
-   buf[i] = cpu_to_le32(data[i]);
-
-   if (size  2) {
-   result = usb_control_msg(serial-dev,
-   usb_sndctrlpipe(serial-dev, 0),
-   request, REQTYPE_HOST_TO_INTERFACE, 0x,
-   spriv-bInterfaceNumber, buf, size

Re: [PATCH 4/4] cp210x: Expose the part number in sysfs

2015-07-24 Thread Petr Tesarik

On Fri, 24 Jul 2015 11:17:55 -0700
Greg Kroah-Hartman gre...@linuxfoundation.org wrote:

 On Fri, Jul 24, 2015 at 08:48:11AM +0200, Petr Tesarik wrote:
  From: Petr Tesarik ptesa...@suse.cz
  
  Make it possible to read the cp210x part number from userspace by making
  it a sysfs attribute.
  
  Signed-off-by: Petr Tesarik ptesa...@suse.com
 
 All sysfs files need to be documented in Documentation/ABI/

Is this a recently added requirement? FWIW there are many undocumented
sysfs attributes, even in code maintained by you. E.g. each usbserial
(ttyUSB*) device has an attribute called port_number which is not
documented. Or I'm blind...

  ---
   drivers/usb/serial/cp210x.c | 21 -
   1 file changed, 20 insertions(+), 1 deletion(-)
  
  diff --git a/drivers/usb/serial/cp210x.c
  b/drivers/usb/serial/cp210x.c index dbfc722..66de350 100644
  --- a/drivers/usb/serial/cp210x.c
  +++ b/drivers/usb/serial/cp210x.c
  @@ -871,11 +871,24 @@ static void cp210x_break_ctl(struct
  tty_struct *tty, int break_state) cp210x_set_config(port,
  CP210X_SET_BREAK, state, 2); }
   
  +static ssize_t part_number_show(struct device *dev,
  +   struct device_attribute *attr,
  char *buf) +{
  +   struct usb_interface *intf = to_usb_interface(dev);
  +   struct usb_serial *serial = usb_get_intfdata(intf);
  +   struct cp210x_serial_private *spriv =
  usb_get_serial_data(serial); +
  +   return sprintf(buf, %i\n, spriv-bPartNumber);
  +}
  +
  +static DEVICE_ATTR_RO(part_number);
  +
   static int cp210x_startup(struct usb_serial *serial)
   {
  struct usb_host_interface *cur_altsetting;
  struct cp210x_serial_private *spriv;
  unsigned int partnum;
  +   int result;
   
  spriv = kzalloc(sizeof(*spriv), GFP_KERNEL);
  if (!spriv)
  @@ -892,13 +905,19 @@ static int cp210x_startup(struct usb_serial
  *serial) partnum, 1, USB_CTRL_GET_TIMEOUT);
  spriv-bPartNumber = partnum  0xFF;
   
  -   return 0;
  +   result = device_create_file(serial-interface-dev,
  +   dev_attr_part_number);
 
 You just raced with userspace, it will not properly see this attribute
 :(

Can you elaborate on this, please? AFAICS the file is created after all
required objects had been instantiated already. Where's the race?

 Please never use device_create_file, use an attribute group assigned
 to the tty device.  Not the USB interface, that is only for USB
 interface things.

Well, I hesitated with adding it to the USB interface, but adding it to
the tty device is definitely wrong. This is indeed an attribute of the
device, not of the tty. If you look at the other CP210x thread, there's
also a gpio device in the chip. I think it's totally silly to look
inside a tty interface to see if there are any GPIO pins...

OK, if the USB interface is the wrong place, what's a good place for
such a thing? I don't insist on a sysfs attribute, but I don't agree
with the tty device.

Petr T
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] cp210x: Expose the part number in sysfs

2015-07-24 Thread Petr Tesarik

On Fri, 24 Jul 2015 14:33:23 -0700
Greg Kroah-Hartman gre...@linuxfoundation.org wrote:

 On Fri, Jul 24, 2015 at 11:00:38PM +0200, Petr Tesarik wrote:
  On Fri, 24 Jul 2015 11:17:55 -0700
  Greg Kroah-Hartman gre...@linuxfoundation.org wrote:
  
   On Fri, Jul 24, 2015 at 08:48:11AM +0200, Petr Tesarik wrote:
From: Petr Tesarik ptesa...@suse.cz

Make it possible to read the cp210x part number from userspace by making
it a sysfs attribute.

Signed-off-by: Petr Tesarik ptesa...@suse.com
   
   All sysfs files need to be documented in Documentation/ABI/
  
  Is this a recently added requirement? FWIW there are many undocumented
  sysfs attributes, even in code maintained by you. E.g. each usbserial
  (ttyUSB*) device has an attribute called port_number which is not
  documented. Or I'm blind...
 
 It's been a requirement for years.  If we have missed any, please let me
 know and we will add them.  Sometimes we miss this when adding new
 attributes, and many very old attributes never got documented.

Fair enough. The example I gave is ancient...

[...]
@@ -892,13 +905,19 @@ static int cp210x_startup(struct usb_serial
*serial) partnum, 1, USB_CTRL_GET_TIMEOUT);
spriv-bPartNumber = partnum  0xFF;
 
-   return 0;
+   result = device_create_file(serial-interface-dev,
+   dev_attr_part_number);
   
   You just raced with userspace, it will not properly see this attribute
   :(
  
  Can you elaborate on this, please? AFAICS the file is created after all
  required objects had been instantiated already. Where's the race?
 
 That's the race.  See this blog post for all the details:
   
 http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/

Thank you for the link!

   Please never use device_create_file, use an attribute group assigned
   to the tty device.  Not the USB interface, that is only for USB
   interface things.
[...]
  OK, if the USB interface is the wrong place, what's a good place for
  such a thing? I don't insist on a sysfs attribute, but I don't agree
  with the tty device.
 
 Being a usb-serial driver, you don't have access to the main USB
 device, so you are kind of violating some layering rules by taking this
 on to the interface.

True. This is still one of my concerns, but the GPIO functionality is
minor compared to the serial bridge, so strictly following layering
rules would be overkill. See Johan Hovold's answer in the thread about
GPIO support for Silicon Labs cp210x USB serial:

  Indeed, you should just hang the gpio device directly off the
  usb-serial device [...]

 Who / what is going to use this file?  What is going to be done with
 the information and to what device is that information going to be
 associated with?

Many thanks again. Thinking about it some more, once I'm done with my
work, userspace can enumerate the gpio devices using sysfs without
having to care about the specific model. The part number is only
relevant internally to the cp210x driver.

In short, there is in fact no user of this file, so the best option is
to report the model in syslog and forget about a sysfs file. Simple.

Petr
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] pty: BREAK for pseudoterminals

2015-02-17 Thread Petr Tesarik

On Tue, 17 Feb 2015 14:03:58 -0500
Peter Hurley  wrote:

> On 02/16/2015 08:04 AM, Peter Hurley wrote:
> > On 02/05/2015 02:11 PM, Nan Li wrote:
> >> This will greatly enhance the usefulness of QEMU virtual serial ports, 
> >> because the Linux kernel interprets a break on the serial console as a 
> >> SysRq, but there is currently no way to pass this signal over a 
> >> pseudo-terminal. This patch will work for transmitting BREAK from master 
> >> to slave through pseudo-terminal.
> > 
> > pty is an IPC mechanism, not a virtualization driver.
> > 
> > Those programs that know they're consuming from a slave pty may not
> > have bothered to set termios for BREAKs, and may not be prepared
> > for NULLs in the input stream.
> 
> I'm not even seeing how this works as advertised.
> 
> In QEMU, the *master* end is the source/sink for the guest VM, while the
> *slave* pty is sink/source for the host. Like this:

Yes, you're totally right. This patch implements break handling on the
"wrong" side of the pty pair. I think you have already pointed this
out in one of the previous mails, and we're waiting for a patch that
implements it on the slave side.

Regards,
Petr Tesarik

> Host <--> slave pty <===> master pty <--> qemu <==> emulated serial
> port
> 
> So this patch enables the guest VM to transmit BREAK to the host.
> Why is that useful?
> 
> Regards,
> Peter Hurley
> 
> >> Signed-off-by: Nan Li 
> >> ---
> >>  drivers/tty/pty.c | 19 +++
> >>  1 file changed, 19 insertions(+)
> >>
> >> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
> >> index e72ee62..ac8893a 100644
> >> --- a/drivers/tty/pty.c
> >> +++ b/drivers/tty/pty.c
> >> @@ -31,6 +31,7 @@
> >>  static struct tty_driver *ptm_driver;
> >>  static struct tty_driver *pts_driver;
> >>  static DEFINE_MUTEX(devpts_mutex);
> >> +#define BREAK_STRING '\0'
> >>  #endif
> >>  
> >>  static void pty_close(struct tty_struct *tty, struct file *filp)
> >> @@ -674,6 +675,23 @@ static void pty_unix98_shutdown(struct
> >> tty_struct *tty) devpts_kill_index(tty->driver_data, tty->index);
> >>  }
> >>  
> >> +static int pty_unix98_break_ctl(struct tty_struct *tty, int
> >> break_state) +{
> >> +  int c;
> >> +  struct tty_struct *to = tty->link;
> >> +
> >> +  if (break_state) {
> >> +  /* Stuff the break data into the input queue of
> >> the other end */
> >> +  c = tty_insert_flip_char(to->port, BREAK_STRING,
> >> TTY_BREAK);
> >> +  /* And shovel */
> >> +  if (c)
> >> +  tty_flip_buffer_push(to->port);
> >> +  else
> >> +  return -ENOMEM;
> >> +  }
> >> +  return 0;
> >> +}
> >> +
> >>  static const struct tty_operations ptm_unix98_ops = {
> >>.lookup = ptm_unix98_lookup,
> >>.install = pty_unix98_install,
> >> @@ -686,6 +704,7 @@ static const struct tty_operations
> >> ptm_unix98_ops = { .chars_in_buffer = pty_chars_in_buffer,
> >>.unthrottle = pty_unthrottle,
> >>.ioctl = pty_unix98_ioctl,
> >> +  .break_ctl = pty_unix98_break_ctl,
> >>.resize = pty_resize,
> >>.shutdown = pty_unix98_shutdown,
> >>.cleanup = pty_cleanup
> >>
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] pty: BREAK for pseudoterminals

2015-02-17 Thread Petr Tesarik

On Tue, 17 Feb 2015 14:03:58 -0500
Peter Hurley pe...@hurleysoftware.com wrote:

 On 02/16/2015 08:04 AM, Peter Hurley wrote:
  On 02/05/2015 02:11 PM, Nan Li wrote:
  This will greatly enhance the usefulness of QEMU virtual serial ports, 
  because the Linux kernel interprets a break on the serial console as a 
  SysRq, but there is currently no way to pass this signal over a 
  pseudo-terminal. This patch will work for transmitting BREAK from master 
  to slave through pseudo-terminal.
  
  pty is an IPC mechanism, not a virtualization driver.
  
  Those programs that know they're consuming from a slave pty may not
  have bothered to set termios for BREAKs, and may not be prepared
  for NULLs in the input stream.
 
 I'm not even seeing how this works as advertised.
 
 In QEMU, the *master* end is the source/sink for the guest VM, while the
 *slave* pty is sink/source for the host. Like this:

Yes, you're totally right. This patch implements break handling on the
wrong side of the pty pair. I think you have already pointed this
out in one of the previous mails, and we're waiting for a patch that
implements it on the slave side.

Regards,
Petr Tesarik

 Host -- slave pty === master pty -- qemu == emulated serial
 port
 
 So this patch enables the guest VM to transmit BREAK to the host.
 Why is that useful?
 
 Regards,
 Peter Hurley
 
  Signed-off-by: Nan Li n...@suse.com
  ---
   drivers/tty/pty.c | 19 +++
   1 file changed, 19 insertions(+)
 
  diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
  index e72ee62..ac8893a 100644
  --- a/drivers/tty/pty.c
  +++ b/drivers/tty/pty.c
  @@ -31,6 +31,7 @@
   static struct tty_driver *ptm_driver;
   static struct tty_driver *pts_driver;
   static DEFINE_MUTEX(devpts_mutex);
  +#define BREAK_STRING '\0'
   #endif
   
   static void pty_close(struct tty_struct *tty, struct file *filp)
  @@ -674,6 +675,23 @@ static void pty_unix98_shutdown(struct
  tty_struct *tty) devpts_kill_index(tty-driver_data, tty-index);
   }
   
  +static int pty_unix98_break_ctl(struct tty_struct *tty, int
  break_state) +{
  +  int c;
  +  struct tty_struct *to = tty-link;
  +
  +  if (break_state) {
  +  /* Stuff the break data into the input queue of
  the other end */
  +  c = tty_insert_flip_char(to-port, BREAK_STRING,
  TTY_BREAK);
  +  /* And shovel */
  +  if (c)
  +  tty_flip_buffer_push(to-port);
  +  else
  +  return -ENOMEM;
  +  }
  +  return 0;
  +}
  +
   static const struct tty_operations ptm_unix98_ops = {
 .lookup = ptm_unix98_lookup,
 .install = pty_unix98_install,
  @@ -686,6 +704,7 @@ static const struct tty_operations
  ptm_unix98_ops = { .chars_in_buffer = pty_chars_in_buffer,
 .unthrottle = pty_unthrottle,
 .ioctl = pty_unix98_ioctl,
  +  .break_ctl = pty_unix98_break_ctl,
 .resize = pty_resize,
 .shutdown = pty_unix98_shutdown,
 .cleanup = pty_cleanup
 
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] pty: BREAK for pseudoterminals

2015-02-16 Thread Petr Tesarik

On Mon, 16 Feb 2015 11:24:16 -0500
Peter Hurley  wrote:

> Hi Petr,
> 
> On 02/16/2015 08:22 AM, Petr Tesarik wrote:
> > On Mon, 16 Feb 2015 08:04:02 -0500
> > Peter Hurley  wrote:
> > 
> >> On 02/05/2015 02:11 PM, Nan Li wrote:
> >>> This will greatly enhance the usefulness of QEMU virtual serial ports, 
> >>> because the Linux kernel interprets a break on the serial console as a 
> >>> SysRq, but there is currently no way to pass this signal over a 
> >>> pseudo-terminal. This patch will work for transmitting BREAK from master 
> >>> to slave through pseudo-terminal.
> >>
> >> pty is an IPC mechanism, not a virtualization driver.
> > 
> > No, but it can be used as a TTY. Teletypes have always had the capacity
> > to send and receive BREAK.
> 
> In some general-purpose but restricted capacity, the *slave* end _mimics_
> a tty. That doesn't mean that it is suitable for every conceivable
> use as a tty, nor should it.

Unless there's some specification of what should and what should not be
implemented, this question is open for discussion, methinks.

> If BREAK was actually useful for real terminal i/o, the pty driver
> would already support this.

If I strictly followed this statement, no improvement would ever be
possible. Or did I miss something? Are Linux PTYs a legacy subsystem
that never gets any new features?

>[...]
> > Well, the default termios includes IGNBRK, so unless they bothered
> > to do anything about BREAKs, they won't see any change.
> 
> Userspace programs are sloppy, especially with terminal i/o and
> settings. Unlikely is not the same as not possible.

Sure. New features may break sloppy programs. OTOH, the obvious
workaround is not using such programs together with new programs that
actually use tcsendbreak() for something... until those sloppy programs
are fixed. It's not like the whole system stops working once this patch
is applied.

> > Anyway, the current kernel behaviour is clearly suboptimal. Calling
> > tcsendbreak() on a pty descriptor does nothing but reports success.
> > There are obviously two ways to fix it: either report an error, or
> > deliver the BREAK for real.
> 
> The pty master end is even less of a tty than the slave end, but this
> isn't really about errno. This patch doesn't address either of your
> points wrt tcsendbreak() on the slave descriptor which is the actual
> terminal end.

That's a valid point. And, indeed, the terminal end actually needs the
handling of BREAK to make it useful.

> > This patch implements the latter, adding at least one valid use case
> > to explain why it is better than the former.
> 
> I disagree that this is a valid use case for the _pty driver_.
> 
> AFAICT this is simply for convenience, as sysrq functionality is
> already available via sendkey.

That's a completely different story. This patch (after fixing it to
work with the terminal end) would allow me to set up a QEMU emulated
serial port using a pty (i.e. "-chardev pty") and send a BREAK signal
to it, no matter what is running in the guest.

I mean, I can run an emulated MIPS64 as a QEMU guest on an x86_64 host,
and still somehow pass SysRq to it. IIUC this will never be possible
with KVP.

Another use case: In my job, I'm struggling with different serial
consoles (some using ipmi SoL, some using telnet to a service
processor, some connected with a real RS-232 link). If I could send
BREAK over a pty, I could extend ipmiconsole to translate it to the SOL
message, telnet to translate it to the telnet escape, amtterm to send a
corresponding message... Then I could send a BREAK to any of my systems
simply by pressing 'C-A b' in screen(1) without having to think how is
this particular machine connected and what the correct sequence is for
that protocol.

Just my two cents,
Petr Tesarik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] pty: BREAK for pseudoterminals

2015-02-16 Thread Petr Tesarik

On Mon, 16 Feb 2015 08:04:02 -0500
Peter Hurley  wrote:

> On 02/05/2015 02:11 PM, Nan Li wrote:
> > This will greatly enhance the usefulness of QEMU virtual serial ports, 
> > because the Linux kernel interprets a break on the serial console as a 
> > SysRq, but there is currently no way to pass this signal over a 
> > pseudo-terminal. This patch will work for transmitting BREAK from master to 
> > slave through pseudo-terminal.
> 
> pty is an IPC mechanism, not a virtualization driver.

No, but it can be used as a TTY. Teletypes have always had the capacity
to send and receive BREAK.

> Those programs that know they're consuming from a slave pty may not
> have bothered to set termios for BREAKs, and may not be prepared
> for NULLs in the input stream.

Well, the default termios includes IGNBRK, so unless they bothered to
do anything about BREAKs, they won't see any change.

Anyway, the current kernel behaviour is clearly suboptimal. Calling
tcsendbreak() on a pty descriptor does nothing but reports success.
There are obviously two ways to fix it: either report an error, or
deliver the BREAK for real.

This patch implements the latter, adding at least one valid use case
to explain why it is better than the former.

Regards,
Petr Tesarik

> > Signed-off-by: Nan Li 
> > ---
> >  drivers/tty/pty.c | 19 +++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
> > index e72ee62..ac8893a 100644
> > --- a/drivers/tty/pty.c
> > +++ b/drivers/tty/pty.c
> > @@ -31,6 +31,7 @@
> >  static struct tty_driver *ptm_driver;
> >  static struct tty_driver *pts_driver;
> >  static DEFINE_MUTEX(devpts_mutex);
> > +#define BREAK_STRING '\0'
> >  #endif
> >  
> >  static void pty_close(struct tty_struct *tty, struct file *filp)
> > @@ -674,6 +675,23 @@ static void pty_unix98_shutdown(struct
> > tty_struct *tty) devpts_kill_index(tty->driver_data, tty->index);
> >  }
> >  
> > +static int pty_unix98_break_ctl(struct tty_struct *tty, int
> > break_state) +{
> > +   int c;
> > +   struct tty_struct *to = tty->link;
> > +
> > +   if (break_state) {
> > +   /* Stuff the break data into the input queue of
> > the other end */
> > +   c = tty_insert_flip_char(to->port, BREAK_STRING,
> > TTY_BREAK);
> > +   /* And shovel */
> > +   if (c)
> > +   tty_flip_buffer_push(to->port);
> > +   else
> > +   return -ENOMEM;
> > +   }
> > +   return 0;
> > +}
> > +
> >  static const struct tty_operations ptm_unix98_ops = {
> > .lookup = ptm_unix98_lookup,
> > .install = pty_unix98_install,
> > @@ -686,6 +704,7 @@ static const struct tty_operations
> > ptm_unix98_ops = { .chars_in_buffer = pty_chars_in_buffer,
> > .unthrottle = pty_unthrottle,
> > .ioctl = pty_unix98_ioctl,
> > +   .break_ctl = pty_unix98_break_ctl,
> > .resize = pty_resize,
> > .shutdown = pty_unix98_shutdown,
> > .cleanup = pty_cleanup
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 190 matches

Mail list logo