[dm-devel] [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page

2022-04-19 Thread Jane Chu
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/capcyv4hrxpb1tasbzug-ggdvs0oofkxmxlihmktg_kfi7yb...@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/core.c |  6 +++---
 arch/x86/mm/pat/set_memory.c   | 23 +++
 drivers/nvdimm/pmem.c  | 30 +++---
 include/linux/set_memory.h |  4 ++--
 4 files changed, 23 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 981496e6bc0e..fa67bb9d1afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
 
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
-   set_mce_nospec(pfn, whole_page(mce));
+   set_mce_nospec(pfn);
mce->kflags |= MCE_HANDLED_UC;
}
 
@@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
 
ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
if (!ret) {
-   set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+   set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
sync_core();
return;
}
@@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
p->mce_count = 0;
pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
-   set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+   set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
 }
 
 static void queue_task_work(struct mce *m, char *msg, void (*func)(struct 
callback_head *))
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 978cf5bd2ab6..e3a5e55f3e08 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
 }
 EXPORT_SYMBOL(set_memory_wb);
 
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-int set_mce_nospec(unsigned long pfn, bool unmap)
+/* Prevent speculative access to a page by marking it not-present */
+int set_mce_nospec(unsigned long pfn)
 {
unsigned long decoy_addr;
int rc;
@@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
 */
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-   if (unmap)
-   rc = set_memory_np(decoy_addr, 1);
-   else
-   rc = set_memory_uc(decoy_addr, 1);
+   rc = set_memory_np(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
 }
 
+static int set_memory_present(unsigned long *addr, int numpages)
+{
+   return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
+}
+
 /* Restore full speculative operation to the pfn. */
 int clear_mce_nospec(unsigned long pfn)
 {
-   return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+   unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
+
+   return set_memory_present(, 1);
 }
 EXPORT_SYMBOL_GPL(clear_mce_nospec);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..4aa17132a557 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device 
*pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
 {
-   blk_status_t rc = BLK_STS_OK;
-   bool bad_pmem = false;
phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
void *pmem_addr = pmem->virt_addr + pmem_off;
 
-   if (unlikely(is_bad_pmem(>bb, sector, len)))
-   bad_pmem = true;
+   if (unlikely(is_bad_pmem(>bb, sector, len))) {
+   

[dm-devel] [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity

2022-04-19 Thread Jane Chu
nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES.
Instead, let the driver rely on mce->misc register to determine
the poison granularity.

Suggested-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 drivers/acpi/nfit/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..d48a388b796e 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block *nb, 
unsigned long val,
 */
mutex_lock(_desc_lock);
list_for_each_entry(acpi_desc, _descs, list) {
+   unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
struct device *dev = acpi_desc->dev;
int found_match = 0;
 
@@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block *nb, 
unsigned long val,
 
/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-   ALIGN(mce->addr, L1_CACHE_BYTES),
-   L1_CACHE_BYTES);
+   ALIGN_DOWN(mce->addr, align), align);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);
 
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[PATCH v8 6/7] pmem: refactor pmem_clear_poison()

2022-04-19 Thread Jane Chu
Refactor the pmem_clear_poison() function such that the common
shared code between the typical write path and the recovery write
path is factored out.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 73 ---
 1 file changed, 48 insertions(+), 25 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3c0cad38ec33..c3772304d417 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,9 +45,25 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
 }
 
-static void hwpoison_clear(struct pmem_device *pmem,
-   phys_addr_t phys, unsigned int len)
+static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
 {
+   return pmem->phys_addr + offset;
+}
+
+static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
+{
+   return (offset - pmem->data_offset) >> SECTOR_SHIFT;
+}
+
+static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
+{
+   return (sector << SECTOR_SHIFT) + pmem->data_offset;
+}
+
+static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset,
+   unsigned int len)
+{
+   phys_addr_t phys = to_phys(pmem, offset);
unsigned long pfn_start, pfn_end, pfn;
 
/* only pmem in the linear map supports HWPoison */
@@ -69,33 +85,40 @@ static void hwpoison_clear(struct pmem_device *pmem,
}
 }
 
-static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
-   phys_addr_t offset, unsigned int len)
+static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
 {
-   struct device *dev = to_dev(pmem);
-   sector_t sector;
-   long cleared;
-   blk_status_t rc = BLK_STS_OK;
+   if (blks == 0)
+   return;
+   badblocks_clear(>bb, sector, blks);
+   if (pmem->bb_state)
+   sysfs_notify_dirent(pmem->bb_state);
+}
 
-   sector = (offset - pmem->data_offset) / 512;
+static long __pmem_clear_poison(struct pmem_device *pmem,
+   phys_addr_t offset, unsigned int len)
+{
+   phys_addr_t phys = to_phys(pmem, offset);
+   long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
 
-   cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-   if (cleared < len)
-   rc = BLK_STS_IOERR;
-   if (cleared > 0 && cleared / 512) {
-   hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
-   cleared /= 512;
-   dev_dbg(dev, "%#llx clear %ld sector%s\n",
-   (unsigned long long) sector, cleared,
-   cleared > 1 ? "s" : "");
-   badblocks_clear(>bb, sector, cleared);
-   if (pmem->bb_state)
-   sysfs_notify_dirent(pmem->bb_state);
+   if (cleared > 0) {
+   pmem_mkpage_present(pmem, offset, cleared);
+   arch_invalidate_pmem(pmem->virt_addr + offset, len);
}
+   return cleared;
+}
 
-   arch_invalidate_pmem(pmem->virt_addr + offset, len);
+static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
+   phys_addr_t offset, unsigned int len)
+{
+   long cleared = __pmem_clear_poison(pmem, offset, len);
 
-   return rc;
+   if (cleared < 0)
+   return BLK_STS_IOERR;
+
+   pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
+   if (cleared < len)
+   return BLK_STS_IOERR;
+   return BLK_STS_OK;
 }
 
 static void write_pmem(void *pmem_addr, struct page *page,
@@ -143,7 +166,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
sector_t sector, unsigned int len)
 {
blk_status_t rc;
-   phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+   phys_addr_t pmem_off = to_offset(pmem, sector);
void *pmem_addr = pmem->virt_addr + pmem_off;
 
if (unlikely(is_bad_pmem(>bb, sector, len)))
@@ -158,7 +181,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
 {
-   phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+   phys_addr_t pmem_off = to_offset(pmem, sector);
void *pmem_addr = pmem->virt_addr + pmem_off;
 
if (unlikely(is_bad_pmem(>bb, sector, len))) {
-- 
2.18.4




[dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode

2022-04-19 Thread Jane Chu
Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested.  To make the interface clear, introduce
enum dax_access_mode {
DAX_ACCESS,
DAX_RECOVERY_WRITE,
}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.

Suggested-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 drivers/dax/super.c |  5 ++--
 drivers/md/dm-linear.c  |  5 ++--
 drivers/md/dm-log-writes.c  |  5 ++--
 drivers/md/dm-stripe.c  |  5 ++--
 drivers/md/dm-target.c  |  4 ++-
 drivers/md/dm-writecache.c  |  7 +++---
 drivers/md/dm.c |  5 ++--
 drivers/nvdimm/pmem.c   | 44 +
 drivers/nvdimm/pmem.h   |  5 +++-
 drivers/s390/block/dcssblk.c|  9 ---
 fs/dax.c|  9 ---
 fs/fuse/dax.c   |  4 +--
 include/linux/dax.h |  9 +--
 include/linux/device-mapper.h   |  4 ++-
 tools/testing/nvdimm/pmem-dax.c |  3 ++-
 15 files changed, 85 insertions(+), 38 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0211e6f7b47a..5405eb553430 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,7 @@ enum dax_device_flags {
  * @dax_dev: a dax_device instance representing the logical memory range
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @mode: indicator on normal access or recovery write
  * @kaddr: output parameter that returns a virtual address mapping of pfn
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
@@ -124,7 +125,7 @@ enum dax_device_flags {
  * pages accessible at the device relative @pgoff.
  */
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
-   void **kaddr, pfn_t *pfn)
+   enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
 {
long avail;
 
@@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t 
pgoff, long nr_pages,
return -EINVAL;
 
avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
-   kaddr, pfn);
+   mode, kaddr, pfn);
if (!avail)
return -ERANGE;
return min(avail, nr_pages);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 76b486e4d2be..13e263299c9c 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct 
dm_target *ti, pgoff_t *pgoff)
 }
 
 static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-   long nr_pages, void **kaddr, pfn_t *pfn)
+   long nr_pages, enum dax_access_mode mode, void **kaddr,
+   pfn_t *pfn)
 {
struct dax_device *dax_dev = linear_dax_pgoff(ti, );
 
-   return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+   return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index c9d036d6bb2e..06bdbed65eb1 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct 
dm_target *ti,
 }
 
 static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-long nr_pages, void **kaddr, pfn_t 
*pfn)
+   long nr_pages, enum dax_access_mode mode, void **kaddr,
+   pfn_t *pfn)
 {
struct dax_device *dax_dev = log_writes_dax_pgoff(ti, );
 
-   return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+   return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c81d331d1afe..77d72900e997 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct 
dm_target *ti, pgoff_t *pgoff)
 }
 
 static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-   long nr_pages, void **kaddr, pfn_t *pfn)
+   long nr_pages, enum dax_access_mode mode, void **kaddr,
+   pfn_t *pfn)
 {
struct dax_device *dax_dev = stripe_dax_pgoff(ti, );
 
-   return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+   return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-target.c 

[PATCH v8 5/7] dax: add .recovery_write dax_operation

2022-04-19 Thread Jane Chu
Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.

Signed-off-by: Jane Chu 
---
 drivers/dax/super.c   |  9 +
 drivers/md/dm-linear.c| 10 ++
 drivers/md/dm-log-writes.c| 10 ++
 drivers/md/dm-stripe.c| 10 ++
 drivers/md/dm.c   | 20 
 drivers/nvdimm/pmem.c |  7 +++
 fs/dax.c  | 13 -
 include/linux/dax.h   | 13 +
 include/linux/device-mapper.h |  9 +
 9 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5405eb553430..50a08b2ec247 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -195,6 +195,15 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *iter)
+{
+   if (!dax_dev->ops->recovery_write)
+   return 0;
+   return dax_dev->ops->recovery_write(dax_dev, pgoff, addr, bytes, iter);
+}
+EXPORT_SYMBOL_GPL(dax_recovery_write);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 13e263299c9c..cdf48bc8c5b0 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -188,9 +188,18 @@ static int linear_dax_zero_page_range(struct dm_target 
*ti, pgoff_t pgoff,
return dax_zero_page_range(dax_dev, pgoff, nr_pages);
 }
 
+static size_t linear_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct dax_device *dax_dev = linear_dax_pgoff(ti, );
+
+   return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define linear_dax_direct_access NULL
 #define linear_dax_zero_page_range NULL
+#define linear_dax_recovery_write NULL
 #endif
 
 static struct target_type linear_target = {
@@ -208,6 +217,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_zero_page_range = linear_dax_zero_page_range,
+   .dax_recovery_write = linear_dax_recovery_write,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 06bdbed65eb1..22739dccdd17 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -905,9 +905,18 @@ static int log_writes_dax_zero_page_range(struct dm_target 
*ti, pgoff_t pgoff,
return dax_zero_page_range(dax_dev, pgoff, nr_pages << PAGE_SHIFT);
 }
 
+static size_t log_writes_dax_recovery_write(struct dm_target *ti,
+   pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct dax_device *dax_dev = log_writes_dax_pgoff(ti, );
+
+   return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define log_writes_dax_direct_access NULL
 #define log_writes_dax_zero_page_range NULL
+#define log_writes_dax_recovery_write NULL
 #endif
 
 static struct target_type log_writes_target = {
@@ -925,6 +934,7 @@ static struct target_type log_writes_target = {
.io_hints = log_writes_io_hints,
.direct_access = log_writes_dax_direct_access,
.dax_zero_page_range = log_writes_dax_zero_page_range,
+   .dax_recovery_write = log_writes_dax_recovery_write,
 };
 
 static int __init dm_log_writes_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 77d72900e997..baa085cc67bd 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -331,9 +331,18 @@ static int stripe_dax_zero_page_range(struct dm_target 
*ti, pgoff_t pgoff,
return dax_zero_page_range(dax_dev, pgoff, nr_pages);
 }
 
+static size_t stripe_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct dax_device *dax_dev = stripe_dax_pgoff(ti, );
+
+   return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define stripe_dax_direct_access NULL
 #define stripe_dax_zero_page_range NULL
+#define stripe_dax_recovery_write NULL
 #endif
 
 /*
@@ -470,6 +479,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_zero_page_range = stripe_dax_zero_page_range,
+   .dax_recovery_write = stripe_dax_recovery_write,
 };
 
 int __init dm_stripe_init(void)
diff --git 

[dm-devel] [PATCH v8 7/7] pmem: implement pmem_recovery_write()

2022-04-19 Thread Jane Chu
The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.

The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write.  A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are serialized by pmem device level .recovery_lock.

Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 56 ++-
 drivers/nvdimm/pmem.h |  1 +
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c3772304d417..134f8909eb65 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
+/*
+ * The recovery write thread started out as a normal pwrite thread and
+ * when the filesystem was told about potential media error in the
+ * range, filesystem turns the normal pwrite to a dax_recovery_write.
+ *
+ * The recovery write consists of clearing media poison, clearing page
+ * HWPoison bit, reenable page-wide read-write permission, flush the
+ * caches and finally write.  A competing pread thread will be held
+ * off during the recovery process since data read back might not be
+ * valid, and this is achieved by clearing the badblock records after
+ * the recovery write is complete. Competing recovery write threads
+ * are serialized by pmem device level .recovery_lock.
+ */
 static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   return 0;
+   struct pmem_device *pmem = dax_get_private(dax_dev);
+   size_t olen, len, off;
+   phys_addr_t pmem_off;
+   struct device *dev = pmem->bb.dev;
+   long cleared;
+
+   off = offset_in_page(addr);
+   len = PFN_PHYS(PFN_UP(off + bytes));
+   if (!is_bad_pmem(>bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
+   return _copy_from_iter_flushcache(addr, bytes, i);
+
+   /*
+* Not page-aligned range cannot be recovered. This should not
+* happen unless something else went wrong.
+*/
+   if (off || !PAGE_ALIGNED(bytes)) {
+   dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not 
page aligned\n",
+   addr, bytes);
+   return 0;
+   }
+
+   mutex_lock(>recovery_lock);
+   pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+   cleared = __pmem_clear_poison(pmem, pmem_off, len);
+   if (cleared > 0 && cleared < len) {
+   dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
+   cleared, len);
+   mutex_unlock(>recovery_lock);
+   return 0;
+   }
+   if (cleared < 0) {
+   dev_warn(dev, "poison clear failed: %ld\n", cleared);
+   mutex_unlock(>recovery_lock);
+   return 0;
+   }
+
+   olen = _copy_from_iter_flushcache(addr, bytes, i);
+   pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
+
+   mutex_unlock(>recovery_lock);
+   return olen;
 }
 
 static const struct dax_operations pmem_dax_ops = {
@@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+   mutex_init(>recovery_lock);
pmem->dax_dev = dax_dev;
 
rc = device_add_disk(dev, disk, pmem_attribute_groups);
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 392b0b38acb9..91e40f5e3c0e 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -27,6 +27,7 @@ struct pmem_device {
struct dax_device   *dax_dev;
struct gendisk  *disk;
struct dev_pagemap  pgmap;
+   struct mutexrecovery_lock;
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions

2022-04-19 Thread Jane Chu
Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.

While at it, fixup a function name in a comment.

Reviewed-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 arch/x86/include/asm/set_memory.h | 52 ---
 arch/x86/mm/pat/set_memory.c  | 49 -
 include/linux/set_memory.h|  8 ++---
 3 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index 78ca53512486..b45c4d27fd46 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
 
-#ifdef CONFIG_X86_64
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-{
-   unsigned long decoy_addr;
-   int rc;
-
-   /* SGX pages are not in the 1:1 map */
-   if (arch_is_platform_page(pfn << PAGE_SHIFT))
-   return 0;
-   /*
-* We would like to just call:
-*  set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
-* but doing that would radically increase the odds of a
-* speculative access to the poison page because we'd have
-* the virtual address of the kernel 1:1 mapping sitting
-* around in registers.
-* Instead we get tricky.  We create a non-canonical address
-* that looks just like the one we want, but has bit 63 flipped.
-* This relies on set_memory_XX() properly sanitizing any __pa()
-* results with __PHYSICAL_MASK or PTE_PFN_MASK.
-*/
-   decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-   if (unmap)
-   rc = set_memory_np(decoy_addr, 1);
-   else
-   rc = set_memory_uc(decoy_addr, 1);
-   if (rc)
-   pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-   return rc;
-}
-#define set_mce_nospec set_mce_nospec
-
-/* Restore full speculative operation to the pfn. */
-static inline int clear_mce_nospec(unsigned long pfn)
-{
-   return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
-}
-#define clear_mce_nospec clear_mce_nospec
-#else
-/*
- * Few people would run a 32-bit kernel on a machine that supports
- * recoverable errors because they have too much memory to boot 32-bit.
- */
-#endif
-
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index abf5ed76e4b7..978cf5bd2ab6 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page 
**pages, int numpages,
 }
 
 /*
- * _set_memory_prot is an internal helper for callers that have been passed
+ * __set_memory_prot is an internal helper for callers that have been passed
  * a pgprot_t value from upper layers and a reservation has already been taken.
  * If you want to set the pgprot to a specific page protocol, use the
  * set_memory_xx() functions.
@@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages)
 }
 EXPORT_SYMBOL(set_memory_wb);
 
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+int set_mce_nospec(unsigned long pfn, bool unmap)
+{
+   unsigned long decoy_addr;
+   int rc;
+
+   if (!IS_ENABLED(CONFIG_64BIT))
+   return 0;
+
+   /* SGX pages are not in the 1:1 map */
+   if (arch_is_platform_page(pfn << PAGE_SHIFT))
+   return 0;
+   /*
+* We would like to just call:
+*  set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
+* but doing that would radically increase the odds of a
+* speculative access to the poison page because we'd have
+* the virtual address of the kernel 1:1 mapping sitting
+* around in registers.
+* Instead we get tricky.  We create a non-canonical address
+* that looks just like the one we want, but has bit 63 flipped.
+* This relies on set_memory_XX() properly sanitizing any __pa()
+* results with __PHYSICAL_MASK or PTE_PFN_MASK.
+*/
+   decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+   if (unmap)
+   rc = set_memory_np(decoy_addr, 1);
+   else
+   rc = set_memory_uc(decoy_addr, 1);
+   if (rc)
+   pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+   return rc;
+}
+
+/* Restore full speculative operation to the pfn. */
+int clear_mce_nospec(unsigned 

[PATCH v8 0/7] DAX poison recovery

2022-04-19 Thread Jane Chu
In this series, dax recovery code path is independent of that of
normal write. Competing dax recovery threads are serialized,
racing read threads are guaranteed not overlapping with the
recovery process.

In this phase, the recovery granularity is page, future patch
will explore recovery in finer granularity.

Changelog:
v7 -> v8:
  - add a patch to teach the nfit driver to rely on mce->misc for poison
granularity, suggested by Dan
  - add set_memory_present() helper to be invoked by set_mce_nospec() for
better readibility, suggested by Dan
  - folded a trivial fix to comments in patch 2/NN, suggested by Boris,
and more commit message comments from Boris
  - mode .recovery_write to dax_operation as dev_pagemap_ops is meant for
device agnostic operations, suggested by Christoph
  - replace DAX_RECOVERY flag with enum dax_access_mode suggested by Dan
  - re-organized __pmem_direct_access as provided by Christoph
  - split [PATCH v7 4/6] into two patches: one introduces
DAX_RECOVERY_WRITE, and the other introduces .recovery_write operation
  
v6 -> v7:
 . incorporated comments from Christoph, and picked up a reviewed-by
 . add x...@kernel.org per Boris
 . discovered pmem firmware doesn't reliably handle a request to clear
   poison over a large range (such as 2M), hence worked around the
   the feature by limiting the size of the requested range to kernel
   page size. 

v5->v6:
  . per Christoph, move set{clear}_mce_nospec() inline functions out
of include/linux/set_memory.h and into arch/x86/mm/pat/set_memory.c
file, so that no need to export _set_memory_present().
  . per Christoph, ratelimit warning message in pmem_do_write()
  . per both Christoph and Dan, switch back to adding a flag to
dax_direct_access() instead of embedding the flag in kvaddr
  . suggestions from Christoph for improving code structure and
readability
  . per Dan, add .recovery_write to dev_pagemap.ops instead of adding
it to dax_operations, such that, the DM layer doesn't need to be
involved explicitly in dax recoovery write
  . per Dan, is_bad_pmem() takes a seqlock, so no need to place it
under recovery_lock.
  Many thanks for both reviewers!

v4->v5:
  Fixed build errors reported by kernel test robot

v3->v4:
  Rebased to v5.17-rc1-81-g0280e3c58f92

References:
v4 https://lore.kernel.org/lkml/2022012626.860012-1-jane@oracle.com/T/
v3 https://lkml.org/lkml/2022/1/11/900
v2 https://lore.kernel.org/all/20211106011638.2613039-1-jane@oracle.com/
Disussions about marking poisoned page as 'np'
https://lore.kernel.org/all/capcyv4hrxpb1tasbzug-ggdvs0oofkxmxlihmktg_kfi7yb...@mail.gmail.com/

Jane Chu (7):
  acpi/nfit: rely on mce->misc to determine poison granularity
  x86/mce: relocate set{clear}_mce_nospec() functions
  mce: fix set_mce_nospec to always unmap the whole page
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  dax: add .recovery_write dax_operation
  pmem: refactor pmem_clear_poison()
  pmem: implement pmem_recovery_write()

 arch/x86/include/asm/set_memory.h |  52 
 arch/x86/kernel/cpu/mce/core.c|   6 +-
 arch/x86/mm/pat/set_memory.c  |  48 ++-
 drivers/acpi/nfit/mce.c   |   4 +-
 drivers/dax/super.c   |  14 +-
 drivers/md/dm-linear.c|  15 ++-
 drivers/md/dm-log-writes.c|  15 ++-
 drivers/md/dm-stripe.c|  15 ++-
 drivers/md/dm-target.c|   4 +-
 drivers/md/dm-writecache.c|   7 +-
 drivers/md/dm.c   |  25 +++-
 drivers/nvdimm/pmem.c | 207 +-
 drivers/nvdimm/pmem.h |   6 +-
 drivers/s390/block/dcssblk.c  |   9 +-
 fs/dax.c  |  22 +++-
 fs/fuse/dax.c |   4 +-
 include/linux/dax.h   |  22 +++-
 include/linux/device-mapper.h |  13 +-
 include/linux/set_memory.h|  10 +-
 tools/testing/nvdimm/pmem-dax.c   |   3 +-
 20 files changed, 351 insertions(+), 150 deletions(-)


base-commit: 028192fea1de083f4f12bfb1eb7c4d7beb5c8ecd
-- 
2.18.4




Re: [dm-devel] [PATCH 08/27] btrfs: use bdev_max_active_zones instead of open coding it

2022-04-19 Thread Anand Jain

On 4/15/22 12:52, Christoph Hellwig wrote:

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
Acked-by: David Sterba 


LGTM.

Reviewed-by: Anand Jain 



---
  fs/btrfs/zoned.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1b1b310c3c510..f72cad7391a11 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -350,7 +350,6 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, 
bool populate_cache)
struct btrfs_fs_info *fs_info = device->fs_info;
struct btrfs_zoned_device_info *zone_info = NULL;
struct block_device *bdev = device->bdev;
-   struct request_queue *queue = bdev_get_queue(bdev);
unsigned int max_active_zones;
unsigned int nactive;
sector_t nr_sectors;
@@ -410,7 +409,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, 
bool populate_cache)
if (!IS_ALIGNED(nr_sectors, zone_sectors))
zone_info->nr_zones++;
  
-	max_active_zones = queue_max_active_zones(queue);

+   max_active_zones = bdev_max_active_zones(bdev);
if (max_active_zones && max_active_zones < BTRFS_MIN_ACTIVE_ZONES) {
btrfs_err_in_rcu(fs_info,
  "zoned: %s: max active zones %u is too small, need at least %u active zones",


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 3/3] dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation

2022-04-19 Thread Matthias Kaehlcke
The verity glue for LoadPin is only needed when CONFIG_SECURITY_LOADPIN_VERITY
is set, use this option for conditional compilation instead of the combo of
CONFIG_DM_VERITY and CONFIG_SECURITY_LOADPIN.

Signed-off-by: Matthias Kaehlcke 
---

 drivers/md/Makefile   | 7 +--
 include/linux/dm-verity-loadpin.h | 2 +-
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index e12cd004d375..a96441752ec7 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -83,6 +83,7 @@ obj-$(CONFIG_DM_LOG_WRITES)   += dm-log-writes.o
 obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o
 obj-$(CONFIG_DM_ZONED) += dm-zoned.o
 obj-$(CONFIG_DM_WRITECACHE)+= dm-writecache.o
+obj-$(CONFIG_SECURITY_LOADPIN_VERITY)  += dm-verity-loadpin.o
 
 ifeq ($(CONFIG_DM_INIT),y)
 dm-mod-objs+= dm-init.o
@@ -100,12 +101,6 @@ ifeq ($(CONFIG_IMA),y)
 dm-mod-objs+= dm-ima.o
 endif
 
-ifeq ($(CONFIG_DM_VERITY),y)
-ifeq ($(CONFIG_SECURITY_LOADPIN),y)
-dm-mod-objs+= dm-verity-loadpin.o
-endif
-endif
-
 ifeq ($(CONFIG_DM_VERITY_FEC),y)
 dm-verity-objs += dm-verity-fec.o
 endif
diff --git a/include/linux/dm-verity-loadpin.h 
b/include/linux/dm-verity-loadpin.h
index 12a86911d05a..be63ac76f98d 100644
--- a/include/linux/dm-verity-loadpin.h
+++ b/include/linux/dm-verity-loadpin.h
@@ -13,7 +13,7 @@ struct trusted_root_digest {
struct list_head node;
 };
 
-#if IS_ENABLED(CONFIG_SECURITY_LOADPIN) && IS_BUILTIN(CONFIG_DM_VERITY)
+#if IS_ENABLED(CONFIG_SECURITY_LOADPIN_VERITY)
 void dm_verity_loadpin_set_trusted_root_digests(struct list_head *digests);
 bool dm_verity_loadpin_is_md_trusted(struct mapped_device *md);
 #else
-- 
2.36.0.rc0.470.gd361397f0d-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/3] dm: Add verity helpers for LoadPin

2022-04-19 Thread Matthias Kaehlcke
LoadPin limits loading of kernel modules, firmware and certain
other files to a 'pinned' file system (typically a read-only
rootfs). To provide more flexibility LoadPin is being extended
to also allow loading these files from trusted dm-verity
devices. For that purpose LoadPin can be provided with a list
of verity root digests that it should consider as trusted.

Add a bunch of helpers to allow LoadPin to check whether a DM
device is a trusted verity device. The new functions broadly
fall in two categories: those that need access to verity
internals (like the root digest), and the 'glue' between
LoadPin and verity. The new file dm-verity-loadpin.c contains
the glue functions.

Signed-off-by: Matthias Kaehlcke 
---

 drivers/md/Makefile   |  6 +++
 drivers/md/dm-verity-loadpin.c| 80 +++
 drivers/md/dm-verity-target.c | 34 +
 drivers/md/dm-verity.h|  4 ++
 include/linux/dm-verity-loadpin.h | 27 +++
 5 files changed, 151 insertions(+)
 create mode 100644 drivers/md/dm-verity-loadpin.c
 create mode 100644 include/linux/dm-verity-loadpin.h

diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 0454b0885b01..e12cd004d375 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -100,6 +100,12 @@ ifeq ($(CONFIG_IMA),y)
 dm-mod-objs+= dm-ima.o
 endif
 
+ifeq ($(CONFIG_DM_VERITY),y)
+ifeq ($(CONFIG_SECURITY_LOADPIN),y)
+dm-mod-objs+= dm-verity-loadpin.o
+endif
+endif
+
 ifeq ($(CONFIG_DM_VERITY_FEC),y)
 dm-verity-objs += dm-verity-fec.o
 endif
diff --git a/drivers/md/dm-verity-loadpin.c b/drivers/md/dm-verity-loadpin.c
new file mode 100644
index ..972ca93a2231
--- /dev/null
+++ b/drivers/md/dm-verity-loadpin.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include 
+#include 
+#include 
+
+#include "dm.h"
+#include "dm-verity.h"
+
+static struct list_head *trusted_root_digests;
+
+/*
+ * Sets the root digests of verity devices which LoadPin considers as trusted.
+ *
+ * This function must only be called once.
+ */
+void dm_verity_loadpin_set_trusted_root_digests(struct list_head *digests)
+{
+   if (!trusted_root_digests)
+   trusted_root_digests = digests;
+   else
+   pr_warn("verity root digests trusted by LoadPin are already 
set!!!\n");
+}
+
+static bool is_trusted_verity_target(struct dm_target *ti)
+{
+   u8 *root_digest;
+   unsigned int digest_size;
+   struct trusted_root_digest *trd;
+   bool trusted = false;
+
+   if (!dm_is_verity_target(ti))
+   return false;
+
+   if (dm_verity_get_root_digest(ti, _digest, _size))
+   return false;
+
+   list_for_each_entry(trd, trusted_root_digests, node) {
+   if ((trd->len == digest_size) &&
+   !memcmp(trd->data, root_digest, digest_size)) {
+   trusted = true;
+   break;
+   }
+   }
+
+   kfree(root_digest);
+
+   return trusted;
+}
+
+/*
+ * Determines whether a mapped device is a verity device that is trusted
+ * by LoadPin.
+ */
+bool dm_verity_loadpin_is_md_trusted(struct mapped_device *md)
+{
+   int srcu_idx;
+   struct dm_table *table;
+   unsigned int num_targets;
+   bool trusted = false;
+   int i;
+
+   if (!trusted_root_digests || list_empty(trusted_root_digests))
+   return false;
+
+   table = dm_get_live_table(md, _idx);
+   num_targets = dm_table_get_num_targets(table);
+   for (i = 0; i < num_targets; i++) {
+   struct dm_target *ti = dm_table_get_target(table, i);
+
+   if (is_trusted_verity_target(ti)) {
+   trusted = true;
+   break;
+   }
+   }
+
+   dm_put_live_table(md, srcu_idx);
+
+   return trusted;
+}
diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index 80133aae0db3..6bea9692cd39 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -16,9 +16,11 @@
 #include "dm-verity.h"
 #include "dm-verity-fec.h"
 #include "dm-verity-verify-sig.h"
+#include 
 #include 
 #include 
 #include 
+#include 
 
 #define DM_MSG_PREFIX  "verity"
 
@@ -1310,6 +1312,38 @@ static int verity_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
return r;
 }
 
+/*
+ * Check whether a DM target is a verity target.
+ */
+bool dm_is_verity_target(struct dm_target *ti)
+{
+   return ti->type->module == THIS_MODULE;
+}
+EXPORT_SYMBOL_GPL(dm_is_verity_target);
+
+/*
+ * Get the root digest of a verity target.
+ *
+ * Returns a copy of the root digests, the caller is responsible for
+ * freeing the memory of the digest.
+ */
+int dm_verity_get_root_digest(struct dm_target *ti, u8 **root_digest, unsigned 
int *digest_size)
+{
+   struct dm_verity *v = ti->private;
+
+   if (!dm_is_verity_target(ti))
+  

Re: [dm-devel] [PATCH 10/11] rnbd-srv: use bdev_discard_alignment

2022-04-19 Thread Jinpu Wang
On Mon, Apr 18, 2022 at 6:53 AM Christoph Hellwig  wrote:
>
> Use bdev_discard_alignment to calculate the correct discard alignment
> offset even for partitions instead of just looking at the queue limit.
>
> Signed-off-by: Christoph Hellwig 
Thx!
Acked-by: Jack Wang 
> ---
>  drivers/block/rnbd/rnbd-srv-dev.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/rnbd/rnbd-srv-dev.h 
> b/drivers/block/rnbd/rnbd-srv-dev.h
> index d080a0de59225..4309e52524691 100644
> --- a/drivers/block/rnbd/rnbd-srv-dev.h
> +++ b/drivers/block/rnbd/rnbd-srv-dev.h
> @@ -59,7 +59,7 @@ static inline int rnbd_dev_get_discard_granularity(const 
> struct rnbd_dev *dev)
>
>  static inline int rnbd_dev_get_discard_alignment(const struct rnbd_dev *dev)
>  {
> -   return bdev_get_queue(dev->bdev)->limits.discard_alignment;
> +   return bdev_discard_alignment(dev->bdev);
>  }
>
>  #endif /* RNBD_SRV_DEV_H */
> --
> 2.30.2
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [f2fs-dev] [PATCH 26/27] block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD

2022-04-19 Thread Chao Yu

On 2022/4/15 12:52, Christoph Hellwig wrote:

Secure erase is a very different operation from discard in that it is
a data integrity operation vs hint.  Fully split the limits and helper
infrastructure to make the separation more clear.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Acked-by: Christoph Böhmwalder  [drbd]
Acked-by: Ryusuke Konishi  [nifs2]
Acked-by: Jaegeuk Kim  [f2fs]
Acked-by: Coly Li  [bcache]
Acked-by: David Sterba  [btrfs]


For f2fs part,

Acked-by: Chao Yu 

Thanks,

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] ext4_writepages: jbd2_start: 5120 pages, ino 11; err -5

2022-04-19 Thread Donald Buczek

We have a cluster scheduler which provides each cluster job with a private scratch 
filesystem (TMPDIR). These are created when a job starts and removed when a job 
completes. The setup works by fallocate, losetup, mkfs.ext4, mkdir, mount, "losetup 
-d", rm and the teardown just does a umount and rmdir.

This works but there is one nuisance: The systems usually have a lot of memory 
and some jobs write a lot of data to their scratch filesystems. So when a job 
finishes, there often is a lot to sync by umount which sometimes takes many 
minutes and wastes a lot of I/O bandwidth. Additionally, the reserved space 
can't be returned and reused until the umount is finished and the backing file 
is deleted.

So I was looking for a way to avoid that but didn't find something 
straightforward. The workaround I've found so far is using a dm-device (linear 
target) between the filesystem and the loop device and then use this sequence 
for teardown:

- fcntl EXT4_IOC_SHUTDOWN with EXT4_GOING_FLAGS_NOLOGFLUSH
- dmestup reload $dmname --table "0 $sectors zero"
- dmsetup resume $dmname --noflush
- umount $mountpoint
- dmsetup remove --deferred $dmname
- rmdir $mountpoint

This seems to do what I want. The unnecessary flushing of the temporary data is 
redirected from the backing file into the zero target and it works really fast. 
There is one remaining problem though, which might be just a cosmetic one: 
Although ext4 is shut down to prevent it from writing, I sometimes get the 
error message from the subject in the logs:

[2963044.462043] EXT4-fs (dm-1): mounted filesystem without journal. Opts: 
(null)
[2963044.686994] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963044.728391] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)
[2963055.585198] EXT4-fs (dm-2): shut down requested (2)
[2963064.821246] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)
[2963074.838259] EXT4-fs (dm-2): shut down requested (2)
[2963095.979089] EXT4-fs (dm-0): shut down requested (2)
[2963096.066376] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 5120 pages, ino 
11; err -5
[2963108.636648] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963125.194740] EXT4-fs (dm-0): shut down requested (2)
[2963166.708088] EXT4-fs (dm-1): shut down requested (2)
[2963169.334437] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963227.515974] EXT4-fs (dm-0): shut down requested (2)
[2966222.515143] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2966222.523390] EXT4-fs (dm-1): mounted filesystem without journal. Opts: 
(null)
[2966222.598071] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)

So I'd like to ask a few questions:

- Is this error message expected or is it a bug?
- Can it be ignored or is there a leak or something on that error path.
- Is there a better way to do what I want? Something I've overlooked?
- I consider to create a new dm target or add an option to an existing one, because I feel that 
"zero" underneath a filesystem asks for problems because a filesystem expects to read 
back the data that it wrote, and the "error" target would trigger lots of errors during 
the writeback attempts. What I really want is a target which silently discard writes and returns 
errors on reads. Any opinion about that?
- But to use devicemapper to eat away the I/O is also just a workaround to the 
fact that we can't parse some flag to umount to say that we are okay to lose 
all data and leave the filesystem in a corrupted state if this was the last 
reference to it. Would this be a useful feature?

Best
  Donald
--
Donald Buczek
buc...@molgen.mpg.de

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/3] LoadPin: Enable loading from trusted dm-verity devices

2022-04-19 Thread Matthias Kaehlcke
Extend LoadPin to allow loading of kernel files from trusted
dm-verity [1] devices.

This change adds the concept of trusted verity devices to
LoadPin. Userspace can use the new systl file
'loadpin/trusted_verity_root_digests' to provide LoadPin
with a list of root digests from dm-verity devices that
LoadPin should consider as trusted. When a kernel file is
read LoadPin first checks (as usual) whether the file is
located on the pinned root, if so the file can be loaded.
Otherwise, if the verity extension is enabled, LoadPin
determines whether the file is located on a verity backed
device and whether the root digest of that device is in
the list of trusted digests. The file can be loaded if
the verity device has a trusted root digest.

The list of trusted root digests can only be written once
(typically at boot time), to limit the possiblity of
attackers setting up rogue verity devices and marking them
as trusted.

Background:

As of now LoadPin restricts loading of kernel files to a single
pinned filesystem, typically the rootfs. This works for many
systems, however it can result in a bloated rootfs (and OTA
updates) on platforms where multiple boards with different
hardware configurations use the same rootfs image. Especially
when 'optional' files are large it may be preferable to
download/install them only when they are actually needed by a
given board. Chrome OS uses Downloadable Content (DLC) [2] to
deploy certain 'packages' at runtime. As an example a DLC
package could contain firmware for a peripheral that is not
present on all boards. DLCs use dm-verity to verify the
integrity of the DLC content.

[1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/verity.html
[2] 
https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/dlcservice/docs/developer.md

Signed-off-by: Matthias Kaehlcke 
---

 security/loadpin/Kconfig   |  11 +++
 security/loadpin/loadpin.c | 168 -
 2 files changed, 178 insertions(+), 1 deletion(-)

diff --git a/security/loadpin/Kconfig b/security/loadpin/Kconfig
index 91be65dec2ab..cf3e6431e02d 100644
--- a/security/loadpin/Kconfig
+++ b/security/loadpin/Kconfig
@@ -18,3 +18,14 @@ config SECURITY_LOADPIN_ENFORCE
  If selected, LoadPin will enforce pinning at boot. If not
  selected, it can be enabled at boot with the kernel parameter
  "loadpin.enforce=1".
+
+config SECURITY_LOADPIN_VERITY
+   bool "Allow reading files from certain other filesystems that use 
dm-verity"
+   depends on DM_VERITY=y
+   help
+ If selected LoadPin can allow reading files from filesystems
+ that use dm-verity. A verity filesystem can be configured as
+ being trusted by LoadPin by writing its root digest to the
+ sysctl file 'trusted_verity_root_digests'. The sysctl file
+ can only be written once (typically at boot) and accepts a
+ list of comma separated digests.
diff --git a/security/loadpin/loadpin.c b/security/loadpin/loadpin.c
index b12f7d986b1e..b9c174f8687d 100644
--- a/security/loadpin/loadpin.c
+++ b/security/loadpin/loadpin.c
@@ -18,6 +18,8 @@
 #include 
 #include/* current */
 #include 
+#include 
+#include 
 
 static void report_load(const char *origin, struct file *file, char *operation)
 {
@@ -43,6 +45,10 @@ static char *exclude_read_files[READING_MAX_ID];
 static int ignore_read_file_id[READING_MAX_ID] __ro_after_init;
 static struct super_block *pinned_root;
 static DEFINE_SPINLOCK(pinned_root_spinlock);
+#ifdef CONFIG_SECURITY_LOADPIN_VERITY
+static bool verity_digests_set;
+static LIST_HEAD(trusted_verity_root_digests);
+#endif
 
 #ifdef CONFIG_SYSCTL
 
@@ -65,6 +71,144 @@ static struct ctl_table loadpin_sysctl_table[] = {
{ }
 };
 
+#ifdef CONFIG_SECURITY_LOADPIN_VERITY
+
+static int proc_verity_root_digests(struct ctl_table *table, int write,
+   void *buffer, size_t *lenp, loff_t *ppos)
+{
+   struct trusted_root_digest *trd;
+   char *buf;
+   int rc;
+
+   if (write) {
+   struct ctl_table tbl = *table;
+   char *p, *d;
+
+   if (*ppos)
+   return -EINVAL;
+
+   if (verity_digests_set)
+   return -EPERM;
+
+   buf = kzalloc(tbl.maxlen, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   tbl.data = buf;
+
+   rc = proc_dostring(, write, buffer, lenp, ppos);
+   if (rc) {
+   kfree(buf);
+   return rc;
+   }
+
+   if (strlen(buf) == 0) {
+   verity_digests_set = true;
+   return false;
+   }
+
+   p = buf;
+   while ((d = strsep(, ",")) != NULL) {
+   int len = strlen(d);
+
+   if (len % 2) {
+   rc = -EINVAL;
+   

[dm-devel] [PATCH 0/3] LoadPin: Enable loading from trusted dm-verity devices

2022-04-19 Thread Matthias Kaehlcke
As of now LoadPin restricts loading of kernel files to a single
pinned filesystem, typically the rootfs. This works for many
systems, however it can result in a bloated rootfs (and OTA
updates) on platforms where multiple boards with different
hardware configurations use the same rootfs image. Especially
when 'optional' files are large it may be preferable to
download/install them only when they are actually needed by a
given board. Chrome OS uses Downloadable Content (DLC) [1] to
deploy certain 'packages' at runtime. As an example a DLC
package could contain firmware for a peripheral that is not
present on all boards. DLCs use dm-verity [2] to verify the
integrity of the DLC content.

This series extends LoadPin to allow loading of kernel files
from trusted dm-verity devices. It adds the concept of
trusted verity devices to LoadPin. Userspace can use the
new systl file 'loadpin/trusted_verity_root_digests' to
provide LoadPin with a list of root digests from dm-verity
devices that LoadPin should consider as trusted. When a
kernel file is read LoadPin first checks (as usual) whether
the file is located on the pinned root, if so the file can
be loaded. Otherwise, if the verity extension is enabled,
LoadPin determines whether the file is located on a verity
backed device and whether the root digest of that device
is in the list of trusted digests. The file can be loaded
if the verity device has a trusted root digest.

The list of trusted root digests can only be written once
(typically at boot time), to limit the possiblity of
attackers setting up rogue verity devices and marking them
as trusted.

[1] 
https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/dlcservice/docs/developer.md
[2] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/verity.html


Matthias Kaehlcke (3):
  dm: Add verity helpers for LoadPin
  LoadPin: Enable loading from trusted dm-verity devices
  dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional
compilation

 drivers/md/Makefile   |   1 +
 drivers/md/dm-verity-loadpin.c|  80 ++
 drivers/md/dm-verity-target.c |  34 ++
 drivers/md/dm-verity.h|   4 +
 include/linux/dm-verity-loadpin.h |  27 +
 security/loadpin/Kconfig  |  11 ++
 security/loadpin/loadpin.c| 168 +-
 7 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 drivers/md/dm-verity-loadpin.c
 create mode 100644 include/linux/dm-verity-loadpin.h

-- 
2.36.0.rc0.470.gd361397f0d-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/3] LoadPin: Enable loading from trusted dm-verity devices

2022-04-19 Thread Matthias Kaehlcke
Hi Kees,

On Mon, Apr 18, 2022 at 03:14:14PM -0700, Kees Cook wrote:
> [oops, resending to actual CC list]
> 
> On Mon, Apr 18, 2022 at 02:15:56PM -0700, Matthias Kaehlcke wrote:
> > This series extends LoadPin to allow loading of kernel files
> > from trusted dm-verity devices. It adds the concept of
> > trusted verity devices to LoadPin. Userspace can use the
> > new systl file 'loadpin/trusted_verity_root_digests' to
> > provide LoadPin with a list of root digests from dm-verity
> > devices that LoadPin should consider as trusted. When a
> > kernel file is read LoadPin first checks (as usual) whether
> > the file is located on the pinned root, if so the file can
> > be loaded. Otherwise, if the verity extension is enabled,
> > LoadPin determines whether the file is located on a verity
> > backed device and whether the root digest of that device
> > is in the list of trusted digests. The file can be loaded
> > if the verity device has a trusted root digest.
> > 
> > The list of trusted root digests can only be written once
> > (typically at boot time), to limit the possiblity of
> > attackers setting up rogue verity devices and marking them
> > as trusted.


> Thanks for working all this out! Where does the list of trusted
> roothashes come from? I assume some chain of trust exists. Is the list
> maybe already stored on the rootfs?

Yes, at least the content of the list comes from the rootfs. The
userspace part is still TBD (also pending on the evolution of this
patchset), having the list pre-formatted in a single file on the
rootfs should be fine.

> It'd be nice if there was some way to pass the trust chain to LoadPin
> more directly.

I imagine you envision LoadPin reading the file itself, instead of
just processing the content. That should be doable. One option would
be to pass the path of the file with the hashes through the sysctl
file and use kernel_read_file_from_path() to read it if the path is
in the pinned root (or maybe even in any trusted file system ;-)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm: fix bio length of empty flush

2022-04-19 Thread Shin'ichiro Kawasaki
The commit 92986f6b4c8a ("dm: use bio_clone_fast in alloc_io/alloc_tio")
removed bio_clone_fast() call from alloc_tio() when ci->io->tio is
available. In this case, ci->bio is not copied to ci->io->tio.clone.
This is fine since init_clone_info() sets same values to ci->bio and
ci->io->tio.clone.

However, when incoming bios have REQ_PREFLUSH flag, __send_empty_flush()
prepares a zero length bio on stack and set it to ci->bio. At this time,
ci->io->tio.clone still keeps non-zero length. When alloc_tio() chooses
this ci->io->tio.clone as the bio to map, it is passed to targets as
non-empty flush bio. It causes bio length check failure in dm-zoned and
unexpected operation such as dm_accept_partial_bio() call.

To avoid the non-empty flush bio, set zero length to ci->io->tio.clone
in __send_empty_flush().

Fixes: 92986f6b4c8a ("dm: use bio_clone_fast in alloc_io/alloc_tio")
Signed-off-by: Shin'ichiro Kawasaki 
---
 drivers/md/dm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..f2397546b93f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1392,6 +1392,7 @@ static void __send_empty_flush(struct clone_info *ci)
 
ci->bio = _bio;
ci->sector_count = 0;
+   ci->io->tio.clone.bi_iter.bi_size = 0;
 
while ((ti = dm_table_get_target(ci->map, target_nr++)))
__send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
-- 
2.35.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel