[dm-devel] [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
The set_memory_uc() approach doesn't work well in all cases. As Dan pointed out when "The VMM unmapped the bad page from guest physical space and passed the machine check to the guest." "The guest gets virtual #MC on an access to that page. When the guest tries to do set_memory_uc() and instructs cpa_flush() to do clean caches that results in taking another fault / exception perhaps because the VMM unmapped the page from the guest." Since the driver has special knowledge to handle NP or UC, mark the poisoned page with NP and let driver handle it when it comes down to repair. Please refer to discussions here for more details. https://lore.kernel.org/all/capcyv4hrxpb1tasbzug-ggdvs0oofkxmxlihmktg_kfi7yb...@mail.gmail.com/ Now since poisoned page is marked as not-present, in order to avoid writing to a not-present page and trigger kernel Oops, also fix pmem_do_write(). Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Signed-off-by: Jane Chu --- arch/x86/kernel/cpu/mce/core.c | 6 +++--- arch/x86/mm/pat/set_memory.c | 23 +++ drivers/nvdimm/pmem.c | 30 +++--- include/linux/set_memory.h | 4 ++-- 4 files changed, 23 insertions(+), 40 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 981496e6bc0e..fa67bb9d1afe 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, pfn = mce->addr >> PAGE_SHIFT; if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + set_mce_nospec(pfn); mce->kflags |= MCE_HANDLED_UC; } @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb) ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); if (!ret) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + set_mce_nospec(p->mce_addr >> PAGE_SHIFT); sync_core(); return; } @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb) p->mce_count = 0; pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr); if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0)) - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + set_mce_nospec(p->mce_addr >> PAGE_SHIFT); } static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *)) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 978cf5bd2ab6..e3a5e55f3e08 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages) } EXPORT_SYMBOL(set_memory_wb); -/* - * Prevent speculative access to the page by either unmapping - * it (if we do not require access to any part of the page) or - * marking it uncacheable (if we want to try to retrieve data - * from non-poisoned lines in the page). - */ -int set_mce_nospec(unsigned long pfn, bool unmap) +/* Prevent speculative access to a page by marking it not-present */ +int set_mce_nospec(unsigned long pfn) { unsigned long decoy_addr; int rc; @@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap) */ decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63)); - if (unmap) - rc = set_memory_np(decoy_addr, 1); - else - rc = set_memory_uc(decoy_addr, 1); + rc = set_memory_np(decoy_addr, 1); if (rc) pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); return rc; } +static int set_memory_present(unsigned long *addr, int numpages) +{ + return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0); +} + /* Restore full speculative operation to the pfn. */ int clear_mce_nospec(unsigned long pfn) { - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1); + unsigned long addr = (unsigned long) pfn_to_kaddr(pfn); + + return set_memory_present(, 1); } EXPORT_SYMBOL_GPL(clear_mce_nospec); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 58d95242a836..4aa17132a557 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem, struct page *page, unsigned int page_off, sector_t sector, unsigned int len) { - blk_status_t rc = BLK_STS_OK; - bool bad_pmem = false; phys_addr_t pmem_off = sector * 512 + pmem->data_offset; void *pmem_addr = pmem->virt_addr + pmem_off; - if (unlikely(is_bad_pmem(>bb, sector, len))) - bad_pmem = true; + if (unlikely(is_bad_pmem(>bb, sector, len))) { +
[dm-devel] [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity
nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES. Instead, let the driver rely on mce->misc register to determine the poison granularity. Suggested-by: Dan Williams Signed-off-by: Jane Chu --- drivers/acpi/nfit/mce.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c index ee8d9973f60b..d48a388b796e 100644 --- a/drivers/acpi/nfit/mce.c +++ b/drivers/acpi/nfit/mce.c @@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, */ mutex_lock(_desc_lock); list_for_each_entry(acpi_desc, _descs, list) { + unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc); struct device *dev = acpi_desc->dev; int found_match = 0; @@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, /* If this fails due to an -ENOMEM, there is little we can do */ nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus, - ALIGN(mce->addr, L1_CACHE_BYTES), - L1_CACHE_BYTES); + ALIGN_DOWN(mce->addr, align), align); nvdimm_region_notify(nfit_spa->nd_region, NVDIMM_REVALIDATE_POISON); -- 2.18.4 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[PATCH v8 6/7] pmem: refactor pmem_clear_poison()
Refactor the pmem_clear_poison() function such that the common shared code between the typical write path and the recovery write path is factored out. Reviewed-by: Christoph Hellwig Reviewed-by: Dan Williams Signed-off-by: Jane Chu --- drivers/nvdimm/pmem.c | 73 --- 1 file changed, 48 insertions(+), 25 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 3c0cad38ec33..c3772304d417 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -45,9 +45,25 @@ static struct nd_region *to_region(struct pmem_device *pmem) return to_nd_region(to_dev(pmem)->parent); } -static void hwpoison_clear(struct pmem_device *pmem, - phys_addr_t phys, unsigned int len) +static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset) { + return pmem->phys_addr + offset; +} + +static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset) +{ + return (offset - pmem->data_offset) >> SECTOR_SHIFT; +} + +static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector) +{ + return (sector << SECTOR_SHIFT) + pmem->data_offset; +} + +static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset, + unsigned int len) +{ + phys_addr_t phys = to_phys(pmem, offset); unsigned long pfn_start, pfn_end, pfn; /* only pmem in the linear map supports HWPoison */ @@ -69,33 +85,40 @@ static void hwpoison_clear(struct pmem_device *pmem, } } -static blk_status_t pmem_clear_poison(struct pmem_device *pmem, - phys_addr_t offset, unsigned int len) +static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks) { - struct device *dev = to_dev(pmem); - sector_t sector; - long cleared; - blk_status_t rc = BLK_STS_OK; + if (blks == 0) + return; + badblocks_clear(>bb, sector, blks); + if (pmem->bb_state) + sysfs_notify_dirent(pmem->bb_state); +} - sector = (offset - pmem->data_offset) / 512; +static long __pmem_clear_poison(struct pmem_device *pmem, + phys_addr_t offset, unsigned int len) +{ + phys_addr_t phys = to_phys(pmem, offset); + long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len); - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len); - if (cleared < len) - rc = BLK_STS_IOERR; - if (cleared > 0 && cleared / 512) { - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared); - cleared /= 512; - dev_dbg(dev, "%#llx clear %ld sector%s\n", - (unsigned long long) sector, cleared, - cleared > 1 ? "s" : ""); - badblocks_clear(>bb, sector, cleared); - if (pmem->bb_state) - sysfs_notify_dirent(pmem->bb_state); + if (cleared > 0) { + pmem_mkpage_present(pmem, offset, cleared); + arch_invalidate_pmem(pmem->virt_addr + offset, len); } + return cleared; +} - arch_invalidate_pmem(pmem->virt_addr + offset, len); +static blk_status_t pmem_clear_poison(struct pmem_device *pmem, + phys_addr_t offset, unsigned int len) +{ + long cleared = __pmem_clear_poison(pmem, offset, len); - return rc; + if (cleared < 0) + return BLK_STS_IOERR; + + pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT); + if (cleared < len) + return BLK_STS_IOERR; + return BLK_STS_OK; } static void write_pmem(void *pmem_addr, struct page *page, @@ -143,7 +166,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem, sector_t sector, unsigned int len) { blk_status_t rc; - phys_addr_t pmem_off = sector * 512 + pmem->data_offset; + phys_addr_t pmem_off = to_offset(pmem, sector); void *pmem_addr = pmem->virt_addr + pmem_off; if (unlikely(is_bad_pmem(>bb, sector, len))) @@ -158,7 +181,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem, struct page *page, unsigned int page_off, sector_t sector, unsigned int len) { - phys_addr_t pmem_off = sector * 512 + pmem->data_offset; + phys_addr_t pmem_off = to_offset(pmem, sector); void *pmem_addr = pmem->virt_addr + pmem_off; if (unlikely(is_bad_pmem(>bb, sector, len))) { -- 2.18.4
[dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams Signed-off-by: Jane Chu --- drivers/dax/super.c | 5 ++-- drivers/md/dm-linear.c | 5 ++-- drivers/md/dm-log-writes.c | 5 ++-- drivers/md/dm-stripe.c | 5 ++-- drivers/md/dm-target.c | 4 ++- drivers/md/dm-writecache.c | 7 +++--- drivers/md/dm.c | 5 ++-- drivers/nvdimm/pmem.c | 44 + drivers/nvdimm/pmem.h | 5 +++- drivers/s390/block/dcssblk.c| 9 --- fs/dax.c| 9 --- fs/fuse/dax.c | 4 +-- include/linux/dax.h | 9 +-- include/linux/device-mapper.h | 4 ++- tools/testing/nvdimm/pmem-dax.c | 3 ++- 15 files changed, 85 insertions(+), 38 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0211e6f7b47a..5405eb553430 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -117,6 +117,7 @@ enum dax_device_flags { * @dax_dev: a dax_device instance representing the logical memory range * @pgoff: offset in pages from the start of the device to translate * @nr_pages: number of consecutive pages caller can handle relative to @pfn + * @mode: indicator on normal access or recovery write * @kaddr: output parameter that returns a virtual address mapping of pfn * @pfn: output parameter that returns an absolute pfn translation of @pgoff * @@ -124,7 +125,7 @@ enum dax_device_flags { * pages accessible at the device relative @pgoff. */ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, - void **kaddr, pfn_t *pfn) + enum dax_access_mode mode, void **kaddr, pfn_t *pfn) { long avail; @@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, return -EINVAL; avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, - kaddr, pfn); + mode, kaddr, pfn); if (!avail) return -ERANGE; return min(avail, nr_pages); diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 76b486e4d2be..13e263299c9c 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) } static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, enum dax_access_mode mode, void **kaddr, + pfn_t *pfn) { struct dax_device *dax_dev = linear_dax_pgoff(ti, ); - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); + return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn); } static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index c9d036d6bb2e..06bdbed65eb1 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti, } static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, -long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, enum dax_access_mode mode, void **kaddr, + pfn_t *pfn) { struct dax_device *dax_dev = log_writes_dax_pgoff(ti, ); - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); + return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn); } static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index c81d331d1afe..77d72900e997 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) } static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, enum dax_access_mode mode, void **kaddr, + pfn_t *pfn) { struct dax_device *dax_dev = stripe_dax_pgoff(ti, ); - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); + return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn); } static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, diff --git a/drivers/md/dm-target.c
[PATCH v8 5/7] dax: add .recovery_write dax_operation
Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Signed-off-by: Jane Chu --- drivers/dax/super.c | 9 + drivers/md/dm-linear.c| 10 ++ drivers/md/dm-log-writes.c| 10 ++ drivers/md/dm-stripe.c| 10 ++ drivers/md/dm.c | 20 drivers/nvdimm/pmem.c | 7 +++ fs/dax.c | 13 - include/linux/dax.h | 13 + include/linux/device-mapper.h | 9 + 9 files changed, 100 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 5405eb553430..50a08b2ec247 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -195,6 +195,15 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, } EXPORT_SYMBOL_GPL(dax_zero_page_range); +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *iter) +{ + if (!dax_dev->ops->recovery_write) + return 0; + return dax_dev->ops->recovery_write(dax_dev, pgoff, addr, bytes, iter); +} +EXPORT_SYMBOL_GPL(dax_recovery_write); + #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 13e263299c9c..cdf48bc8c5b0 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -188,9 +188,18 @@ static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, return dax_zero_page_range(dax_dev, pgoff, nr_pages); } +static size_t linear_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + struct dax_device *dax_dev = linear_dax_pgoff(ti, ); + + return dax_recovery_write(dax_dev, pgoff, addr, bytes, i); +} + #else #define linear_dax_direct_access NULL #define linear_dax_zero_page_range NULL +#define linear_dax_recovery_write NULL #endif static struct target_type linear_target = { @@ -208,6 +217,7 @@ static struct target_type linear_target = { .iterate_devices = linear_iterate_devices, .direct_access = linear_dax_direct_access, .dax_zero_page_range = linear_dax_zero_page_range, + .dax_recovery_write = linear_dax_recovery_write, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index 06bdbed65eb1..22739dccdd17 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -905,9 +905,18 @@ static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, return dax_zero_page_range(dax_dev, pgoff, nr_pages << PAGE_SHIFT); } +static size_t log_writes_dax_recovery_write(struct dm_target *ti, + pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) +{ + struct dax_device *dax_dev = log_writes_dax_pgoff(ti, ); + + return dax_recovery_write(dax_dev, pgoff, addr, bytes, i); +} + #else #define log_writes_dax_direct_access NULL #define log_writes_dax_zero_page_range NULL +#define log_writes_dax_recovery_write NULL #endif static struct target_type log_writes_target = { @@ -925,6 +934,7 @@ static struct target_type log_writes_target = { .io_hints = log_writes_io_hints, .direct_access = log_writes_dax_direct_access, .dax_zero_page_range = log_writes_dax_zero_page_range, + .dax_recovery_write = log_writes_dax_recovery_write, }; static int __init dm_log_writes_init(void) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 77d72900e997..baa085cc67bd 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -331,9 +331,18 @@ static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, return dax_zero_page_range(dax_dev, pgoff, nr_pages); } +static size_t stripe_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + struct dax_device *dax_dev = stripe_dax_pgoff(ti, ); + + return dax_recovery_write(dax_dev, pgoff, addr, bytes, i); +} + #else #define stripe_dax_direct_access NULL #define stripe_dax_zero_page_range NULL +#define stripe_dax_recovery_write NULL #endif /* @@ -470,6 +479,7 @@ static struct target_type stripe_target = { .io_hints = stripe_io_hints, .direct_access = stripe_dax_direct_access, .dax_zero_page_range = stripe_dax_zero_page_range, + .dax_recovery_write = stripe_dax_recovery_write, }; int __init dm_stripe_init(void) diff --git
[dm-devel] [PATCH v8 7/7] pmem: implement pmem_recovery_write()
The recovery write thread started out as a normal pwrite thread and when the filesystem was told about potential media error in the range, filesystem turns the normal pwrite to a dax_recovery_write. The recovery write consists of clearing media poison, clearing page HWPoison bit, reenable page-wide read-write permission, flush the caches and finally write. A competing pread thread will be held off during the recovery process since data read back might not be valid, and this is achieved by clearing the badblock records after the recovery write is complete. Competing recovery write threads are serialized by pmem device level .recovery_lock. Signed-off-by: Jane Chu --- drivers/nvdimm/pmem.c | 56 ++- drivers/nvdimm/pmem.h | 1 + 2 files changed, 56 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index c3772304d417..134f8909eb65 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn); } +/* + * The recovery write thread started out as a normal pwrite thread and + * when the filesystem was told about potential media error in the + * range, filesystem turns the normal pwrite to a dax_recovery_write. + * + * The recovery write consists of clearing media poison, clearing page + * HWPoison bit, reenable page-wide read-write permission, flush the + * caches and finally write. A competing pread thread will be held + * off during the recovery process since data read back might not be + * valid, and this is achieved by clearing the badblock records after + * the recovery write is complete. Competing recovery write threads + * are serialized by pmem device level .recovery_lock. + */ static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return 0; + struct pmem_device *pmem = dax_get_private(dax_dev); + size_t olen, len, off; + phys_addr_t pmem_off; + struct device *dev = pmem->bb.dev; + long cleared; + + off = offset_in_page(addr); + len = PFN_PHYS(PFN_UP(off + bytes)); + if (!is_bad_pmem(>bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len)) + return _copy_from_iter_flushcache(addr, bytes, i); + + /* +* Not page-aligned range cannot be recovered. This should not +* happen unless something else went wrong. +*/ + if (off || !PAGE_ALIGNED(bytes)) { + dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n", + addr, bytes); + return 0; + } + + mutex_lock(>recovery_lock); + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; + cleared = __pmem_clear_poison(pmem, pmem_off, len); + if (cleared > 0 && cleared < len) { + dev_warn(dev, "poison cleared only %ld out of %lu bytes\n", + cleared, len); + mutex_unlock(>recovery_lock); + return 0; + } + if (cleared < 0) { + dev_warn(dev, "poison clear failed: %ld\n", cleared); + mutex_unlock(>recovery_lock); + return 0; + } + + olen = _copy_from_iter_flushcache(addr, bytes, i); + pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT); + + mutex_unlock(>recovery_lock); + return olen; } static const struct dax_operations pmem_dax_ops = { @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev, if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); + mutex_init(>recovery_lock); pmem->dax_dev = dax_dev; rc = device_add_disk(dev, disk, pmem_attribute_groups); diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 392b0b38acb9..91e40f5e3c0e 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -27,6 +27,7 @@ struct pmem_device { struct dax_device *dax_dev; struct gendisk *disk; struct dev_pagemap pgmap; + struct mutexrecovery_lock; }; long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, -- 2.18.4 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c file where they belong. While at it, fixup a function name in a comment. Reviewed-by: Dan Williams Signed-off-by: Jane Chu --- arch/x86/include/asm/set_memory.h | 52 --- arch/x86/mm/pat/set_memory.c | 49 - include/linux/set_memory.h| 8 ++--- 3 files changed, 52 insertions(+), 57 deletions(-) diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 78ca53512486..b45c4d27fd46 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page); extern int kernel_set_to_readonly; -#ifdef CONFIG_X86_64 -/* - * Prevent speculative access to the page by either unmapping - * it (if we do not require access to any part of the page) or - * marking it uncacheable (if we want to try to retrieve data - * from non-poisoned lines in the page). - */ -static inline int set_mce_nospec(unsigned long pfn, bool unmap) -{ - unsigned long decoy_addr; - int rc; - - /* SGX pages are not in the 1:1 map */ - if (arch_is_platform_page(pfn << PAGE_SHIFT)) - return 0; - /* -* We would like to just call: -* set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1); -* but doing that would radically increase the odds of a -* speculative access to the poison page because we'd have -* the virtual address of the kernel 1:1 mapping sitting -* around in registers. -* Instead we get tricky. We create a non-canonical address -* that looks just like the one we want, but has bit 63 flipped. -* This relies on set_memory_XX() properly sanitizing any __pa() -* results with __PHYSICAL_MASK or PTE_PFN_MASK. -*/ - decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63)); - - if (unmap) - rc = set_memory_np(decoy_addr, 1); - else - rc = set_memory_uc(decoy_addr, 1); - if (rc) - pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); - return rc; -} -#define set_mce_nospec set_mce_nospec - -/* Restore full speculative operation to the pfn. */ -static inline int clear_mce_nospec(unsigned long pfn) -{ - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1); -} -#define clear_mce_nospec clear_mce_nospec -#else -/* - * Few people would run a 32-bit kernel on a machine that supports - * recoverable errors because they have too much memory to boot 32-bit. - */ -#endif - #endif /* _ASM_X86_SET_MEMORY_H */ diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index abf5ed76e4b7..978cf5bd2ab6 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages, } /* - * _set_memory_prot is an internal helper for callers that have been passed + * __set_memory_prot is an internal helper for callers that have been passed * a pgprot_t value from upper layers and a reservation has already been taken. * If you want to set the pgprot to a specific page protocol, use the * set_memory_xx() functions. @@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages) } EXPORT_SYMBOL(set_memory_wb); +/* + * Prevent speculative access to the page by either unmapping + * it (if we do not require access to any part of the page) or + * marking it uncacheable (if we want to try to retrieve data + * from non-poisoned lines in the page). + */ +int set_mce_nospec(unsigned long pfn, bool unmap) +{ + unsigned long decoy_addr; + int rc; + + if (!IS_ENABLED(CONFIG_64BIT)) + return 0; + + /* SGX pages are not in the 1:1 map */ + if (arch_is_platform_page(pfn << PAGE_SHIFT)) + return 0; + /* +* We would like to just call: +* set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1); +* but doing that would radically increase the odds of a +* speculative access to the poison page because we'd have +* the virtual address of the kernel 1:1 mapping sitting +* around in registers. +* Instead we get tricky. We create a non-canonical address +* that looks just like the one we want, but has bit 63 flipped. +* This relies on set_memory_XX() properly sanitizing any __pa() +* results with __PHYSICAL_MASK or PTE_PFN_MASK. +*/ + decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63)); + + if (unmap) + rc = set_memory_np(decoy_addr, 1); + else + rc = set_memory_uc(decoy_addr, 1); + if (rc) + pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); + return rc; +} + +/* Restore full speculative operation to the pfn. */ +int clear_mce_nospec(unsigned
[PATCH v8 0/7] DAX poison recovery
In this series, dax recovery code path is independent of that of normal write. Competing dax recovery threads are serialized, racing read threads are guaranteed not overlapping with the recovery process. In this phase, the recovery granularity is page, future patch will explore recovery in finer granularity. Changelog: v7 -> v8: - add a patch to teach the nfit driver to rely on mce->misc for poison granularity, suggested by Dan - add set_memory_present() helper to be invoked by set_mce_nospec() for better readibility, suggested by Dan - folded a trivial fix to comments in patch 2/NN, suggested by Boris, and more commit message comments from Boris - mode .recovery_write to dax_operation as dev_pagemap_ops is meant for device agnostic operations, suggested by Christoph - replace DAX_RECOVERY flag with enum dax_access_mode suggested by Dan - re-organized __pmem_direct_access as provided by Christoph - split [PATCH v7 4/6] into two patches: one introduces DAX_RECOVERY_WRITE, and the other introduces .recovery_write operation v6 -> v7: . incorporated comments from Christoph, and picked up a reviewed-by . add x...@kernel.org per Boris . discovered pmem firmware doesn't reliably handle a request to clear poison over a large range (such as 2M), hence worked around the the feature by limiting the size of the requested range to kernel page size. v5->v6: . per Christoph, move set{clear}_mce_nospec() inline functions out of include/linux/set_memory.h and into arch/x86/mm/pat/set_memory.c file, so that no need to export _set_memory_present(). . per Christoph, ratelimit warning message in pmem_do_write() . per both Christoph and Dan, switch back to adding a flag to dax_direct_access() instead of embedding the flag in kvaddr . suggestions from Christoph for improving code structure and readability . per Dan, add .recovery_write to dev_pagemap.ops instead of adding it to dax_operations, such that, the DM layer doesn't need to be involved explicitly in dax recoovery write . per Dan, is_bad_pmem() takes a seqlock, so no need to place it under recovery_lock. Many thanks for both reviewers! v4->v5: Fixed build errors reported by kernel test robot v3->v4: Rebased to v5.17-rc1-81-g0280e3c58f92 References: v4 https://lore.kernel.org/lkml/2022012626.860012-1-jane@oracle.com/T/ v3 https://lkml.org/lkml/2022/1/11/900 v2 https://lore.kernel.org/all/20211106011638.2613039-1-jane@oracle.com/ Disussions about marking poisoned page as 'np' https://lore.kernel.org/all/capcyv4hrxpb1tasbzug-ggdvs0oofkxmxlihmktg_kfi7yb...@mail.gmail.com/ Jane Chu (7): acpi/nfit: rely on mce->misc to determine poison granularity x86/mce: relocate set{clear}_mce_nospec() functions mce: fix set_mce_nospec to always unmap the whole page dax: introduce DAX_RECOVERY_WRITE dax access mode dax: add .recovery_write dax_operation pmem: refactor pmem_clear_poison() pmem: implement pmem_recovery_write() arch/x86/include/asm/set_memory.h | 52 arch/x86/kernel/cpu/mce/core.c| 6 +- arch/x86/mm/pat/set_memory.c | 48 ++- drivers/acpi/nfit/mce.c | 4 +- drivers/dax/super.c | 14 +- drivers/md/dm-linear.c| 15 ++- drivers/md/dm-log-writes.c| 15 ++- drivers/md/dm-stripe.c| 15 ++- drivers/md/dm-target.c| 4 +- drivers/md/dm-writecache.c| 7 +- drivers/md/dm.c | 25 +++- drivers/nvdimm/pmem.c | 207 +- drivers/nvdimm/pmem.h | 6 +- drivers/s390/block/dcssblk.c | 9 +- fs/dax.c | 22 +++- fs/fuse/dax.c | 4 +- include/linux/dax.h | 22 +++- include/linux/device-mapper.h | 13 +- include/linux/set_memory.h| 10 +- tools/testing/nvdimm/pmem-dax.c | 3 +- 20 files changed, 351 insertions(+), 150 deletions(-) base-commit: 028192fea1de083f4f12bfb1eb7c4d7beb5c8ecd -- 2.18.4
Re: [dm-devel] [PATCH 08/27] btrfs: use bdev_max_active_zones instead of open coding it
On 4/15/22 12:52, Christoph Hellwig wrote: Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Acked-by: David Sterba LGTM. Reviewed-by: Anand Jain --- fs/btrfs/zoned.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 1b1b310c3c510..f72cad7391a11 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -350,7 +350,6 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache) struct btrfs_fs_info *fs_info = device->fs_info; struct btrfs_zoned_device_info *zone_info = NULL; struct block_device *bdev = device->bdev; - struct request_queue *queue = bdev_get_queue(bdev); unsigned int max_active_zones; unsigned int nactive; sector_t nr_sectors; @@ -410,7 +409,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache) if (!IS_ALIGNED(nr_sectors, zone_sectors)) zone_info->nr_zones++; - max_active_zones = queue_max_active_zones(queue); + max_active_zones = bdev_max_active_zones(bdev); if (max_active_zones && max_active_zones < BTRFS_MIN_ACTIVE_ZONES) { btrfs_err_in_rcu(fs_info, "zoned: %s: max active zones %u is too small, need at least %u active zones", -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH 3/3] dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation
The verity glue for LoadPin is only needed when CONFIG_SECURITY_LOADPIN_VERITY is set, use this option for conditional compilation instead of the combo of CONFIG_DM_VERITY and CONFIG_SECURITY_LOADPIN. Signed-off-by: Matthias Kaehlcke --- drivers/md/Makefile | 7 +-- include/linux/dm-verity-loadpin.h | 2 +- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/md/Makefile b/drivers/md/Makefile index e12cd004d375..a96441752ec7 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -83,6 +83,7 @@ obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o obj-$(CONFIG_DM_ZONED) += dm-zoned.o obj-$(CONFIG_DM_WRITECACHE)+= dm-writecache.o +obj-$(CONFIG_SECURITY_LOADPIN_VERITY) += dm-verity-loadpin.o ifeq ($(CONFIG_DM_INIT),y) dm-mod-objs+= dm-init.o @@ -100,12 +101,6 @@ ifeq ($(CONFIG_IMA),y) dm-mod-objs+= dm-ima.o endif -ifeq ($(CONFIG_DM_VERITY),y) -ifeq ($(CONFIG_SECURITY_LOADPIN),y) -dm-mod-objs+= dm-verity-loadpin.o -endif -endif - ifeq ($(CONFIG_DM_VERITY_FEC),y) dm-verity-objs += dm-verity-fec.o endif diff --git a/include/linux/dm-verity-loadpin.h b/include/linux/dm-verity-loadpin.h index 12a86911d05a..be63ac76f98d 100644 --- a/include/linux/dm-verity-loadpin.h +++ b/include/linux/dm-verity-loadpin.h @@ -13,7 +13,7 @@ struct trusted_root_digest { struct list_head node; }; -#if IS_ENABLED(CONFIG_SECURITY_LOADPIN) && IS_BUILTIN(CONFIG_DM_VERITY) +#if IS_ENABLED(CONFIG_SECURITY_LOADPIN_VERITY) void dm_verity_loadpin_set_trusted_root_digests(struct list_head *digests); bool dm_verity_loadpin_is_md_trusted(struct mapped_device *md); #else -- 2.36.0.rc0.470.gd361397f0d-goog -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH 1/3] dm: Add verity helpers for LoadPin
LoadPin limits loading of kernel modules, firmware and certain other files to a 'pinned' file system (typically a read-only rootfs). To provide more flexibility LoadPin is being extended to also allow loading these files from trusted dm-verity devices. For that purpose LoadPin can be provided with a list of verity root digests that it should consider as trusted. Add a bunch of helpers to allow LoadPin to check whether a DM device is a trusted verity device. The new functions broadly fall in two categories: those that need access to verity internals (like the root digest), and the 'glue' between LoadPin and verity. The new file dm-verity-loadpin.c contains the glue functions. Signed-off-by: Matthias Kaehlcke --- drivers/md/Makefile | 6 +++ drivers/md/dm-verity-loadpin.c| 80 +++ drivers/md/dm-verity-target.c | 34 + drivers/md/dm-verity.h| 4 ++ include/linux/dm-verity-loadpin.h | 27 +++ 5 files changed, 151 insertions(+) create mode 100644 drivers/md/dm-verity-loadpin.c create mode 100644 include/linux/dm-verity-loadpin.h diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 0454b0885b01..e12cd004d375 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -100,6 +100,12 @@ ifeq ($(CONFIG_IMA),y) dm-mod-objs+= dm-ima.o endif +ifeq ($(CONFIG_DM_VERITY),y) +ifeq ($(CONFIG_SECURITY_LOADPIN),y) +dm-mod-objs+= dm-verity-loadpin.o +endif +endif + ifeq ($(CONFIG_DM_VERITY_FEC),y) dm-verity-objs += dm-verity-fec.o endif diff --git a/drivers/md/dm-verity-loadpin.c b/drivers/md/dm-verity-loadpin.c new file mode 100644 index ..972ca93a2231 --- /dev/null +++ b/drivers/md/dm-verity-loadpin.c @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include +#include +#include + +#include "dm.h" +#include "dm-verity.h" + +static struct list_head *trusted_root_digests; + +/* + * Sets the root digests of verity devices which LoadPin considers as trusted. + * + * This function must only be called once. + */ +void dm_verity_loadpin_set_trusted_root_digests(struct list_head *digests) +{ + if (!trusted_root_digests) + trusted_root_digests = digests; + else + pr_warn("verity root digests trusted by LoadPin are already set!!!\n"); +} + +static bool is_trusted_verity_target(struct dm_target *ti) +{ + u8 *root_digest; + unsigned int digest_size; + struct trusted_root_digest *trd; + bool trusted = false; + + if (!dm_is_verity_target(ti)) + return false; + + if (dm_verity_get_root_digest(ti, _digest, _size)) + return false; + + list_for_each_entry(trd, trusted_root_digests, node) { + if ((trd->len == digest_size) && + !memcmp(trd->data, root_digest, digest_size)) { + trusted = true; + break; + } + } + + kfree(root_digest); + + return trusted; +} + +/* + * Determines whether a mapped device is a verity device that is trusted + * by LoadPin. + */ +bool dm_verity_loadpin_is_md_trusted(struct mapped_device *md) +{ + int srcu_idx; + struct dm_table *table; + unsigned int num_targets; + bool trusted = false; + int i; + + if (!trusted_root_digests || list_empty(trusted_root_digests)) + return false; + + table = dm_get_live_table(md, _idx); + num_targets = dm_table_get_num_targets(table); + for (i = 0; i < num_targets; i++) { + struct dm_target *ti = dm_table_get_target(table, i); + + if (is_trusted_verity_target(ti)) { + trusted = true; + break; + } + } + + dm_put_live_table(md, srcu_idx); + + return trusted; +} diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c index 80133aae0db3..6bea9692cd39 100644 --- a/drivers/md/dm-verity-target.c +++ b/drivers/md/dm-verity-target.c @@ -16,9 +16,11 @@ #include "dm-verity.h" #include "dm-verity-fec.h" #include "dm-verity-verify-sig.h" +#include #include #include #include +#include #define DM_MSG_PREFIX "verity" @@ -1310,6 +1312,38 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv) return r; } +/* + * Check whether a DM target is a verity target. + */ +bool dm_is_verity_target(struct dm_target *ti) +{ + return ti->type->module == THIS_MODULE; +} +EXPORT_SYMBOL_GPL(dm_is_verity_target); + +/* + * Get the root digest of a verity target. + * + * Returns a copy of the root digests, the caller is responsible for + * freeing the memory of the digest. + */ +int dm_verity_get_root_digest(struct dm_target *ti, u8 **root_digest, unsigned int *digest_size) +{ + struct dm_verity *v = ti->private; + + if (!dm_is_verity_target(ti)) +
Re: [dm-devel] [PATCH 10/11] rnbd-srv: use bdev_discard_alignment
On Mon, Apr 18, 2022 at 6:53 AM Christoph Hellwig wrote: > > Use bdev_discard_alignment to calculate the correct discard alignment > offset even for partitions instead of just looking at the queue limit. > > Signed-off-by: Christoph Hellwig Thx! Acked-by: Jack Wang > --- > drivers/block/rnbd/rnbd-srv-dev.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/block/rnbd/rnbd-srv-dev.h > b/drivers/block/rnbd/rnbd-srv-dev.h > index d080a0de59225..4309e52524691 100644 > --- a/drivers/block/rnbd/rnbd-srv-dev.h > +++ b/drivers/block/rnbd/rnbd-srv-dev.h > @@ -59,7 +59,7 @@ static inline int rnbd_dev_get_discard_granularity(const > struct rnbd_dev *dev) > > static inline int rnbd_dev_get_discard_alignment(const struct rnbd_dev *dev) > { > - return bdev_get_queue(dev->bdev)->limits.discard_alignment; > + return bdev_discard_alignment(dev->bdev); > } > > #endif /* RNBD_SRV_DEV_H */ > -- > 2.30.2 > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [f2fs-dev] [PATCH 26/27] block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD
On 2022/4/15 12:52, Christoph Hellwig wrote: Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Acked-by: Christoph Böhmwalder [drbd] Acked-by: Ryusuke Konishi [nifs2] Acked-by: Jaegeuk Kim [f2fs] Acked-by: Coly Li [bcache] Acked-by: David Sterba [btrfs] For f2fs part, Acked-by: Chao Yu Thanks, -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] ext4_writepages: jbd2_start: 5120 pages, ino 11; err -5
We have a cluster scheduler which provides each cluster job with a private scratch filesystem (TMPDIR). These are created when a job starts and removed when a job completes. The setup works by fallocate, losetup, mkfs.ext4, mkdir, mount, "losetup -d", rm and the teardown just does a umount and rmdir. This works but there is one nuisance: The systems usually have a lot of memory and some jobs write a lot of data to their scratch filesystems. So when a job finishes, there often is a lot to sync by umount which sometimes takes many minutes and wastes a lot of I/O bandwidth. Additionally, the reserved space can't be returned and reused until the umount is finished and the backing file is deleted. So I was looking for a way to avoid that but didn't find something straightforward. The workaround I've found so far is using a dm-device (linear target) between the filesystem and the loop device and then use this sequence for teardown: - fcntl EXT4_IOC_SHUTDOWN with EXT4_GOING_FLAGS_NOLOGFLUSH - dmestup reload $dmname --table "0 $sectors zero" - dmsetup resume $dmname --noflush - umount $mountpoint - dmsetup remove --deferred $dmname - rmdir $mountpoint This seems to do what I want. The unnecessary flushing of the temporary data is redirected from the backing file into the zero target and it works really fast. There is one remaining problem though, which might be just a cosmetic one: Although ext4 is shut down to prevent it from writing, I sometimes get the error message from the subject in the logs: [2963044.462043] EXT4-fs (dm-1): mounted filesystem without journal. Opts: (null) [2963044.686994] EXT4-fs (dm-0): mounted filesystem without journal. Opts: (null) [2963044.728391] EXT4-fs (dm-2): mounted filesystem without journal. Opts: (null) [2963055.585198] EXT4-fs (dm-2): shut down requested (2) [2963064.821246] EXT4-fs (dm-2): mounted filesystem without journal. Opts: (null) [2963074.838259] EXT4-fs (dm-2): shut down requested (2) [2963095.979089] EXT4-fs (dm-0): shut down requested (2) [2963096.066376] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 5120 pages, ino 11; err -5 [2963108.636648] EXT4-fs (dm-0): mounted filesystem without journal. Opts: (null) [2963125.194740] EXT4-fs (dm-0): shut down requested (2) [2963166.708088] EXT4-fs (dm-1): shut down requested (2) [2963169.334437] EXT4-fs (dm-0): mounted filesystem without journal. Opts: (null) [2963227.515974] EXT4-fs (dm-0): shut down requested (2) [2966222.515143] EXT4-fs (dm-0): mounted filesystem without journal. Opts: (null) [2966222.523390] EXT4-fs (dm-1): mounted filesystem without journal. Opts: (null) [2966222.598071] EXT4-fs (dm-2): mounted filesystem without journal. Opts: (null) So I'd like to ask a few questions: - Is this error message expected or is it a bug? - Can it be ignored or is there a leak or something on that error path. - Is there a better way to do what I want? Something I've overlooked? - I consider to create a new dm target or add an option to an existing one, because I feel that "zero" underneath a filesystem asks for problems because a filesystem expects to read back the data that it wrote, and the "error" target would trigger lots of errors during the writeback attempts. What I really want is a target which silently discard writes and returns errors on reads. Any opinion about that? - But to use devicemapper to eat away the I/O is also just a workaround to the fact that we can't parse some flag to umount to say that we are okay to lose all data and leave the filesystem in a corrupted state if this was the last reference to it. Would this be a useful feature? Best Donald -- Donald Buczek buc...@molgen.mpg.de -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH 2/3] LoadPin: Enable loading from trusted dm-verity devices
Extend LoadPin to allow loading of kernel files from trusted dm-verity [1] devices. This change adds the concept of trusted verity devices to LoadPin. Userspace can use the new systl file 'loadpin/trusted_verity_root_digests' to provide LoadPin with a list of root digests from dm-verity devices that LoadPin should consider as trusted. When a kernel file is read LoadPin first checks (as usual) whether the file is located on the pinned root, if so the file can be loaded. Otherwise, if the verity extension is enabled, LoadPin determines whether the file is located on a verity backed device and whether the root digest of that device is in the list of trusted digests. The file can be loaded if the verity device has a trusted root digest. The list of trusted root digests can only be written once (typically at boot time), to limit the possiblity of attackers setting up rogue verity devices and marking them as trusted. Background: As of now LoadPin restricts loading of kernel files to a single pinned filesystem, typically the rootfs. This works for many systems, however it can result in a bloated rootfs (and OTA updates) on platforms where multiple boards with different hardware configurations use the same rootfs image. Especially when 'optional' files are large it may be preferable to download/install them only when they are actually needed by a given board. Chrome OS uses Downloadable Content (DLC) [2] to deploy certain 'packages' at runtime. As an example a DLC package could contain firmware for a peripheral that is not present on all boards. DLCs use dm-verity to verify the integrity of the DLC content. [1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/verity.html [2] https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/dlcservice/docs/developer.md Signed-off-by: Matthias Kaehlcke --- security/loadpin/Kconfig | 11 +++ security/loadpin/loadpin.c | 168 - 2 files changed, 178 insertions(+), 1 deletion(-) diff --git a/security/loadpin/Kconfig b/security/loadpin/Kconfig index 91be65dec2ab..cf3e6431e02d 100644 --- a/security/loadpin/Kconfig +++ b/security/loadpin/Kconfig @@ -18,3 +18,14 @@ config SECURITY_LOADPIN_ENFORCE If selected, LoadPin will enforce pinning at boot. If not selected, it can be enabled at boot with the kernel parameter "loadpin.enforce=1". + +config SECURITY_LOADPIN_VERITY + bool "Allow reading files from certain other filesystems that use dm-verity" + depends on DM_VERITY=y + help + If selected LoadPin can allow reading files from filesystems + that use dm-verity. A verity filesystem can be configured as + being trusted by LoadPin by writing its root digest to the + sysctl file 'trusted_verity_root_digests'. The sysctl file + can only be written once (typically at boot) and accepts a + list of comma separated digests. diff --git a/security/loadpin/loadpin.c b/security/loadpin/loadpin.c index b12f7d986b1e..b9c174f8687d 100644 --- a/security/loadpin/loadpin.c +++ b/security/loadpin/loadpin.c @@ -18,6 +18,8 @@ #include #include/* current */ #include +#include +#include static void report_load(const char *origin, struct file *file, char *operation) { @@ -43,6 +45,10 @@ static char *exclude_read_files[READING_MAX_ID]; static int ignore_read_file_id[READING_MAX_ID] __ro_after_init; static struct super_block *pinned_root; static DEFINE_SPINLOCK(pinned_root_spinlock); +#ifdef CONFIG_SECURITY_LOADPIN_VERITY +static bool verity_digests_set; +static LIST_HEAD(trusted_verity_root_digests); +#endif #ifdef CONFIG_SYSCTL @@ -65,6 +71,144 @@ static struct ctl_table loadpin_sysctl_table[] = { { } }; +#ifdef CONFIG_SECURITY_LOADPIN_VERITY + +static int proc_verity_root_digests(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct trusted_root_digest *trd; + char *buf; + int rc; + + if (write) { + struct ctl_table tbl = *table; + char *p, *d; + + if (*ppos) + return -EINVAL; + + if (verity_digests_set) + return -EPERM; + + buf = kzalloc(tbl.maxlen, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + tbl.data = buf; + + rc = proc_dostring(, write, buffer, lenp, ppos); + if (rc) { + kfree(buf); + return rc; + } + + if (strlen(buf) == 0) { + verity_digests_set = true; + return false; + } + + p = buf; + while ((d = strsep(, ",")) != NULL) { + int len = strlen(d); + + if (len % 2) { + rc = -EINVAL; +
[dm-devel] [PATCH 0/3] LoadPin: Enable loading from trusted dm-verity devices
As of now LoadPin restricts loading of kernel files to a single pinned filesystem, typically the rootfs. This works for many systems, however it can result in a bloated rootfs (and OTA updates) on platforms where multiple boards with different hardware configurations use the same rootfs image. Especially when 'optional' files are large it may be preferable to download/install them only when they are actually needed by a given board. Chrome OS uses Downloadable Content (DLC) [1] to deploy certain 'packages' at runtime. As an example a DLC package could contain firmware for a peripheral that is not present on all boards. DLCs use dm-verity [2] to verify the integrity of the DLC content. This series extends LoadPin to allow loading of kernel files from trusted dm-verity devices. It adds the concept of trusted verity devices to LoadPin. Userspace can use the new systl file 'loadpin/trusted_verity_root_digests' to provide LoadPin with a list of root digests from dm-verity devices that LoadPin should consider as trusted. When a kernel file is read LoadPin first checks (as usual) whether the file is located on the pinned root, if so the file can be loaded. Otherwise, if the verity extension is enabled, LoadPin determines whether the file is located on a verity backed device and whether the root digest of that device is in the list of trusted digests. The file can be loaded if the verity device has a trusted root digest. The list of trusted root digests can only be written once (typically at boot time), to limit the possiblity of attackers setting up rogue verity devices and marking them as trusted. [1] https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/dlcservice/docs/developer.md [2] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/verity.html Matthias Kaehlcke (3): dm: Add verity helpers for LoadPin LoadPin: Enable loading from trusted dm-verity devices dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation drivers/md/Makefile | 1 + drivers/md/dm-verity-loadpin.c| 80 ++ drivers/md/dm-verity-target.c | 34 ++ drivers/md/dm-verity.h| 4 + include/linux/dm-verity-loadpin.h | 27 + security/loadpin/Kconfig | 11 ++ security/loadpin/loadpin.c| 168 +- 7 files changed, 324 insertions(+), 1 deletion(-) create mode 100644 drivers/md/dm-verity-loadpin.c create mode 100644 include/linux/dm-verity-loadpin.h -- 2.36.0.rc0.470.gd361397f0d-goog -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 0/3] LoadPin: Enable loading from trusted dm-verity devices
Hi Kees, On Mon, Apr 18, 2022 at 03:14:14PM -0700, Kees Cook wrote: > [oops, resending to actual CC list] > > On Mon, Apr 18, 2022 at 02:15:56PM -0700, Matthias Kaehlcke wrote: > > This series extends LoadPin to allow loading of kernel files > > from trusted dm-verity devices. It adds the concept of > > trusted verity devices to LoadPin. Userspace can use the > > new systl file 'loadpin/trusted_verity_root_digests' to > > provide LoadPin with a list of root digests from dm-verity > > devices that LoadPin should consider as trusted. When a > > kernel file is read LoadPin first checks (as usual) whether > > the file is located on the pinned root, if so the file can > > be loaded. Otherwise, if the verity extension is enabled, > > LoadPin determines whether the file is located on a verity > > backed device and whether the root digest of that device > > is in the list of trusted digests. The file can be loaded > > if the verity device has a trusted root digest. > > > > The list of trusted root digests can only be written once > > (typically at boot time), to limit the possiblity of > > attackers setting up rogue verity devices and marking them > > as trusted. > Thanks for working all this out! Where does the list of trusted > roothashes come from? I assume some chain of trust exists. Is the list > maybe already stored on the rootfs? Yes, at least the content of the list comes from the rootfs. The userspace part is still TBD (also pending on the evolution of this patchset), having the list pre-formatted in a single file on the rootfs should be fine. > It'd be nice if there was some way to pass the trust chain to LoadPin > more directly. I imagine you envision LoadPin reading the file itself, instead of just processing the content. That should be doable. One option would be to pass the path of the file with the hashes through the sysctl file and use kernel_read_file_from_path() to read it if the path is in the pinned root (or maybe even in any trusted file system ;-) -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH] dm: fix bio length of empty flush
The commit 92986f6b4c8a ("dm: use bio_clone_fast in alloc_io/alloc_tio") removed bio_clone_fast() call from alloc_tio() when ci->io->tio is available. In this case, ci->bio is not copied to ci->io->tio.clone. This is fine since init_clone_info() sets same values to ci->bio and ci->io->tio.clone. However, when incoming bios have REQ_PREFLUSH flag, __send_empty_flush() prepares a zero length bio on stack and set it to ci->bio. At this time, ci->io->tio.clone still keeps non-zero length. When alloc_tio() chooses this ci->io->tio.clone as the bio to map, it is passed to targets as non-empty flush bio. It causes bio length check failure in dm-zoned and unexpected operation such as dm_accept_partial_bio() call. To avoid the non-empty flush bio, set zero length to ci->io->tio.clone in __send_empty_flush(). Fixes: 92986f6b4c8a ("dm: use bio_clone_fast in alloc_io/alloc_tio") Signed-off-by: Shin'ichiro Kawasaki --- drivers/md/dm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3c5fad7c4ee6..f2397546b93f 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1392,6 +1392,7 @@ static void __send_empty_flush(struct clone_info *ci) ci->bio = _bio; ci->sector_count = 0; + ci->io->tio.clone.bi_iter.bi_size = 0; while ((ti = dm_table_get_target(ci->map, target_nr++))) __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL); -- 2.35.1 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel