On Thu, Sep 25, 2025 at 02:27:06PM +0200, Pratyush Yadav wrote:
> I think the tables should be treated as the final serialized data
> structure, and should get all the same properties that other KHO
> serialization formats have like stable binary format, versioning, etc.
Right, that's how I see it
>first);
> +
> + while (chunk) {
> + struct kho_vmalloc_chunk *tmp = chunk;
> +
> + kho_vmalloc_unpreserve_chunk(chunk);
> +
> + chunk = KHOSER_LOAD_PTR(chunk->hdr.next);
> + kfree(tmp);
Shouldn't this be free_page()?
Otherwise looks OK
Reviewed-by: Jason Gunthorpe
Jason
t for
> vmalloc preservation.
>
> Signed-off-by: Mike Rapoport (Microsoft)
> ---
> include/linux/kexec_handover.h | 5 +++--
> kernel/kexec_handover.c| 25 +++--
> mm/memblock.c | 4 +++-
> 3 files changed, 17 insertions(+), 17 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Wed, Sep 10, 2025 at 09:22:03PM +0100, Lorenzo Stoakes wrote:
> +static inline void mmap_action_remap(struct mmap_action *action,
> + unsigned long addr, unsigned long pfn, unsigned long size,
> + pgprot_t pgprot)
> +{
> + action->type = MMAP_REMAP_PFN;
> +
> + ac
On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
> It's not only remap that is a concern here, people do all kinds of weird
> and wonderful things in .mmap(), sometimes in combination with remap.
So it should really not be split this way, complete is a badly name
prepopulate and i
-
> 4 files changed, 85 insertions(+), 52 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Tue, Sep 09, 2025 at 10:14:41PM +0200, Andrey Ryabinin wrote:
> +static int kstate_preserve_phys(struct kstate_stream *stream, void *obj,
> + const struct kstate_field *field)
> +{
> + struct reserve_mem_table *map = obj;
> +
> + return kho_preserve_phys(map->
desc->action_error_hook to filter the remap error to
> -EAGAIN to keep behaviour consistent.
Hurm, in practice this converts reserve_pfn_range()/etc conflicts into
from EINVAL into EAGAIN and converts all the unlikely OOM ENOMEM
failures to EAGAIN. Seems wrong/unnecessary to me, I wouldn
On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > Perhaps
> >
> > !vma_desc_cowable()
> >
> > Is what many drivers are really trying to assert.
>
> Well no, because:
>
> static inline bool is_cow_mapping(vm_flags_t flags)
> {
> return (flags & (VM_SHARED | VM_MAYWRITE)) =
On Mon, Sep 08, 2025 at 02:37:44PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 10:11:21AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> > > @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *f
> 1. Find the `start_level` from the `target_order`. (for example,
> target_order = 10, start_level = 4)
> 2. The path from the root down to the level above `start_level` is
> fixed (index 0 at each of these levels).
> 3. At `start_level`, the index is also fixed, by (1 << (63 -
> PAGE_SHIFT
On Mon, Sep 08, 2025 at 03:18:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
> >
> > > It's not only remap that is a concern here, people do
igned-off-by: Lorenzo Stoakes
> ---
> include/linux/shmem_fs.h | 3 ++-
> mm/shmem.c | 41
> 2 files changed, 35 insertions(+), 9 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Mon, Sep 08, 2025 at 05:50:18PM +0200, David Hildenbrand wrote:
> So in practice there is indeed not a big difference between a private and
> cow mapping.
Right and most drivers just check SHARED.
But if we are being documentative why they check shared is because the
driver cannot tolerate CO
On Wed, Sep 17, 2025 at 08:11:08PM +0100, Lorenzo Stoakes wrote:
> -int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> +static int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long
> addr,
> unsigned long pfn, unsigned long size, pgprot_t pr
please drop this.
Soon future work will require something more complicated to compute if
pgprot_decrypted() should be called so this unused stuff isn't going
to hold up.
Otherwise looks good to me
Reviewed-by: Jason Gunthorpe
Jason
On Wed, Sep 17, 2025 at 08:11:11PM +0100, Lorenzo Stoakes wrote:
> +static int mmap_action_finish(struct mmap_action *action,
> + const struct vm_area_struct *vma, int err)
> +{
> + /*
> + * If an error occurs, unmap the VMA altogether and return an error. We
> + * only cl
On Mon, Sep 08, 2025 at 12:10:39PM +0100, Lorenzo Stoakes wrote:
> remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> it must be supplied with a correct PFN to do so. If the caller must hold
> locks to be able to do this, those locks should be held across the
> operation, a
On Thu, Sep 18, 2025 at 09:00:31PM +0200, Andrey Ryabinin wrote:
> By contrast, KSTATE centralizes this logic. It avoids duplicating code
> and lets us express the preservation details declaratively instead
> of re-implementing them per struct.
I didn't really see it centralize much of anything,
On Mon, Sep 08, 2025 at 05:24:23PM +0200, David Hildenbrand wrote:
> >
> > > I think we need to be cautious of scope here :) I don't want to
> > > accidentally break things this way.
> >
> > IMHO it is worth doing when you get into more driver places it is far
> > more obvious why the VM_SHARED i
On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file,
> struct vm_area_struct *vma)
> vm_flags |= VM_NORESERVE;
>
> if (hugetlb_reserve_pages(inode,
> - vma->vm_pg
On Tue, Sep 16, 2025 at 03:20:51PM +0200, Pratyush Yadav wrote:
> >> >> @@ -210,16 +226,16 @@ static void kho_restore_page(struct page *page,
> >> >> unsigned int order)
> >> >> struct folio *kho_restore_folio(phys_addr_t phys)
> >> >> {
> >> >> struct page *page = pfn_to_online_page(PHY
On Mon, Sep 08, 2025 at 12:10:47PM +0100, Lorenzo Stoakes wrote:
> Now we have the capacity to set up the VMA in f_op->mmap_prepare and then
> later, once the VMA is established, insert a mixed mapping in
> f_op->mmap_complete, do so for kcov.
>
> We utilise the context desc->mmap_context field to
On Tue, Sep 16, 2025 at 06:57:56PM +0100, Lorenzo Stoakes wrote:
> > > + /*
> > > + * If an error occurs, unmap the VMA altogether and return an error. We
> > > + * only clear the newly allocated VMA, since this function is only
> > > + * invoked if we do NOT merge, so we only clean up the VMA w
On Mon, Sep 15, 2025 at 01:54:05PM +0100, Lorenzo Stoakes wrote:
> > Just mark the functions as manipulating the action using the 'action'
> > in the fuction name.
>
> Because now sub-callers that partially map using one method and partially map
> using another now need to have a desc too that the
On Mon, Sep 15, 2025 at 01:23:30PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 09:11:12AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 10, 2025 at 09:22:03PM +0100, Lorenzo Stoakes wrote:
> > > +static inline void mmap_action_remap(struct mmap_action *action,
> &
On Wed, Sep 17, 2025 at 02:15:28PM -0700, Andrew Morton wrote:
> On Wed, 17 Sep 2025 20:40:32 +0300 Mike Rapoport wrote:
> > +struct kho_vmalloc_chunk;
> > +struct kho_vmalloc {
> > +DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *);
>
> offtopic nit: DECLARE_KHOSER_PTR() *defines* a
toakes
> Reviewed-by: David Hildenbrand
> ---
> mm/vma.c | 8
> 1 file changed, 4 insertions(+), 4 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> resctl uses remap_pfn_range(), but holds a mutex over the
> operation. Therefore, establish the mutex in mmap_prepare(), release it in
> mmap_complete() and release it in mmap_abort() should the operation fail.
The mutex can't do a
On Mon, Sep 15, 2025 at 02:51:52PM +0100, Lorenzo Stoakes wrote:
> > vmcore is a true MIXEDMAP, it isn't doing two actions. These mixedmap
> > helpers just aren't good for what mixedmap needs.. Mixed map need a
> > list of physical pfns with a bit indicating if they are "special" or
> > not. If you
y: Reinette Chatre
> ---
> fs/resctrl/pseudo_lock.c | 20 +---
> 1 file changed, 9 insertions(+), 11 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Wed, Sep 17, 2025 at 12:18:39PM -0400, Pasha Tatashin wrote:
> On Wed, Sep 17, 2025 at 8:22 AM Jason Gunthorpe wrote:
> >
> > On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote:
> > > + * kho_order_table
> > > + * +---+--
On Tue, Sep 16, 2025 at 07:50:16PM -0700, Jason Miu wrote:
> + * kho_order_table
> + * +---++
> + * | 0 order| 1 order| 2 order ... | HUGETLB_PAGE_ORDER |
> + * ++--++
> + * |
> + * |
> + * v
> + * ++
On Tue, Sep 16, 2025 at 07:50:15PM -0700, Jason Miu wrote:
> This series transitions KHO from an xarray-based metadata tracking
> system with serialization to using page table like data structures
> that can be passed directly to the next kernel.
>
> The key motivations for this change are to:
> -
t; ---
> kernel/relay.c | 33 +
> 1 file changed, 17 insertions(+), 16 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
.
>
> Signed-off-by: Lorenzo Stoakes
> Reviewed-by: Jan Kara
> Acked-by: David Hildenbrand
> ---
> fs/ntfs3/file.c| 2 +-
> include/linux/mm.h | 10 ++
> mm/secretmem.c | 2 +-
> 3 files changed, 12 insertions(+), 2 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
t *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot)
{
int err;
err = remap_pfn_range_prepare_vma(vma, addr, pfn, size)
if (err)
return err;
if (IS_ENABLED(__HAVE_PFNMAP_TRACKING))
return remap_pfn_range_track(vma, addr, pfn, size, prot);
return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
}
(fix pgtable_Types.h to #define to 1 so IS_ENABLED works)
But the logic here is all fine
Reviewed-by: Jason Gunthorpe
Jason
| 9 +
> 1 file changed, 5 insertions(+), 4 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Tue, Sep 16, 2025 at 03:11:59PM +0100, Lorenzo Stoakes wrote:
> -static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
> +static int iommufd_fops_mmap_prepare(struct vm_area_desc *desc)
> {
> + struct file *filp = desc->file;
> struct iommufd_ctx *ictx = filp->p
On Tue, Sep 16, 2025 at 03:11:53PM +0100, Lorenzo Stoakes wrote:
>
> -int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
> - unsigned long pfn, unsigned long size, pgprot_t prot)
> +static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
> {
>
On Tue, Sep 16, 2025 at 03:11:54PM +0100, Lorenzo Stoakes wrote:
> +/* What action should be taken after an .mmap_prepare call is complete? */
> +enum mmap_action_type {
> + MMAP_NOTHING, /* Mapping is complete, no further action. */
> + MMAP_REMAP_PFN, /* Remap PFN range
Reviewed-by: Jan Kara
> ---
> drivers/dax/device.c | 32 +---
> 1 file changed, 21 insertions(+), 11 deletions(-)
Reviewed-by: Jason Gunthorpe
Jason
On Wed, Sep 10, 2025 at 09:22:11PM +0100, Lorenzo Stoakes wrote:
> +static int kcov_mmap_prepare(struct vm_area_desc *desc)
> {
> int res = 0;
> - struct kcov *kcov = vma->vm_file->private_data;
> - unsigned long size, off;
> - struct page *page;
> + struct kcov *kcov = desc-
On Mon, Sep 15, 2025 at 01:43:50PM +0100, Lorenzo Stoakes wrote:
> > > + if (kcov->area == NULL || desc->pgoff != 0 ||
> > > + vma_desc_size(desc) != size) {
> >
> > IMHO these range checks should be cleaned up into a helper:
> >
> > /* Returns true if the VMA falls within starting_pgoff to
> >
On Mon, Sep 15, 2025 at 07:36:25PM +0300, Mike Rapoport wrote:
> > Under the covers it all uses the generic folio based code we already
> > have, but we should have appropriate wrappers around that code that
> > make clear these patterns.
>
> Right, but that does not mean that vmalloc preserve/res
On Mon, Sep 15, 2025 at 05:01:01PM +0300, Mike Rapoport wrote:
> > kzalloc() cannot be preserved, the only thing we support today is
> > alloc_page(), so this code pattern shouldn't exist.
>
> kzalloc(PAGE_SIZE) can be preserved, it's page aligned and we don't have to
> restore it into a slab cac
On Mon, Sep 15, 2025 at 05:12:27PM +0300, Mike Rapoport wrote:
> > I don't suppose I'd insist on it, but something to consider since you
> > are likely going to do another revision anyway.
>
> I think vmalloc is as basic as folio.
vmalloc() ultimately calls vm_area_alloc_pages() ->
alloc_pages_b
On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> {
> - const unsigned long len = desc->end - desc->start;
> + const unsigned long len = vma_desc_size(desc);
>
> if ((desc->vm_flags & (VM_SHARED | VM_M
On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote:
> But sadly some _do need_ to do extra work afterwards, most notably,
> prepopulation.
I think Jan is suggesting something more like
mmap_op()
{
struct vma_desc desc = {};
desc.[..] = x
desc.[..] = y
desc.[..] = z
vm
On Mon, Sep 08, 2025 at 03:47:34PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > > Perhaps
> > > >
> > > > !vma_desc_cowab
On Wed, Sep 10, 2025 at 05:52:04PM +0200, Pratyush Yadav wrote:
> On Wed, Sep 10 2025, Matthew Wilcox wrote:
>
> > On Wed, Sep 10, 2025 at 05:34:40PM +0200, Pratyush Yadav wrote:
> >> +#define KHO_PAGE_MAGIC 0x4b484f50U /* ASCII for 'KHOP' */
> >> +
> >> +/*
> >> + * KHO uses page->private, which
On Mon, Sep 08, 2025 at 02:12:00PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> > > static int secretmem_mmap_prepare(struct vm_area_desc *desc)
>
On Tue, Sep 09, 2025 at 05:40:21PM +0200, Pratyush Yadav wrote:
> PS: do you know if bitfield layout is reliable for serialization? Can
> different compiler versions move them around? I always thought they can.
> If not, I can also use them in memfd code since they make the code
> neater.
It is sp
On Tue, Sep 09, 2025 at 04:44:21PM +0200, Pratyush Yadav wrote:
> The KHO Array is a data structure that behaves like a sparse array of
> pointers. It is designed to be preserved and restored over Kexec
> Handover (KHO), and targets only 64-bit platforms. It can store 8-byte
> aligned pointers. It
On Mon, Sep 08, 2025 at 12:10:44PM +0100, Lorenzo Stoakes wrote:
> We thread the state through the mmap_context, allowing for both PFN map and
> mixed mapped pre-population.
>
> Signed-off-by: Lorenzo Stoakes
> ---
> fs/cramfs/inode.c | 134 +++---
> 1 fil
On Mon, Sep 08, 2025 at 12:10:37PM +0100, Lorenzo Stoakes wrote:
> We have introduced the f_op->mmap_prepare hook to allow for setting up a
> VMA far earlier in the process of mapping memory, reducing problematic
> error handling paths, but this does not provide what all
> drivers/filesystems need.
On Mon, Sep 08, 2025 at 08:12:59PM +0200, Pratyush Yadav wrote:
> > +#define KHO_VMALLOC_FLAGS_MASK (VM_ALLOC | VM_ALLOW_HUGE_VMAP)
>
> I don't think it is a good idea to re-use VM flags. This can make adding
> more flags later down the line ugly. I think it would be better to
> define KHO_VMA
On Mon, Sep 08, 2025 at 01:35:27PM +0300, Mike Rapoport wrote:
> +static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk
> *cur)
> +{
> + struct kho_vmalloc_chunk *chunk;
> + int err;
> +
> + chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!chunk)
> +
On Wed, Sep 03, 2025 at 10:25:02PM +0300, Mike Rapoport wrote:
> It seems that our major disagreement is about using 'folio' vs 'page' in
> the naming.
It is a folio because folio is the name for something that is a high
order page and it signals that the pointer is the head page. Which is
excatl
On Wed, Sep 03, 2025 at 06:38:00PM +0300, Mike Rapoport wrote:
> > Don't call kho_preserve_phy if you already have a page!
>
> Ok, I'll add kho_preserve_page() ;-P.
Cast it to a folio :P
> Now seriously, by no means this is a folio,
It really is. The entire bitmap thing is about preserving fo
On Wed, Sep 03, 2025 at 09:30:17AM +0300, Mike Rapoport wrote:
> +int kho_preserve_vmalloc(void *ptr, phys_addr_t *preservation)
> +{
> + struct kho_vmalloc_chunk *chunk, *first_chunk;
> + struct vm_struct *vm = find_vm_area(ptr);
> + int err;
> +
> + if (!vm)
> + return
On Sat, May 31, 2025 at 08:16:14PM -0700, David Rientjes wrote:
> Pratyush asked about the relationship between KHO and LUO. Pasha noted
> that KHO provides a state machine and in RFC v2 of LUO, LUO can drive KHO
> which makes the KHO debugfs interface optional. KHO activate will cause
> LUO to s
On Fri, Apr 04, 2025 at 04:24:54PM +, Pratyush Yadav wrote:
> Only if the objects in the slab cache are of a format that doesn't
> change, and I am not sure if that is the case anywhere. Maybe a driver
> written with KHO in mind would find it useful, but that's way down the
> line.
Things like
On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> > > > We know what the future use case is for the folio preservation, all
> > > > the drivers and the iommu are going to rely on this.
> > >
> > > We don't know how much of the preservation will be based on folios.
> >
> > I think a
On Mon, Apr 07, 2025 at 07:31:21PM +0300, Mike Rapoport wrote:
> > alloc_pages is a 0 order "folio". vmalloc is an array of 0 order
> > folios (?)
>
> According to current Matthew's plan [1] vmalloc is misc memory :)
Someday! :)
> Ok, let's stick with memdesc then. Put aside the name it looks li
On Thu, Apr 10, 2025 at 05:51:51PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 09, 2025 at 01:28:37PM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 09, 2025 at 07:19:30PM +0300, Mike Rapoport wrote:
> > > But we have memdesc today, it's struct page.
> >
> > N
On Wed, Apr 09, 2025 at 07:19:30PM +0300, Mike Rapoport wrote:
> On Wed, Apr 09, 2025 at 12:37:14PM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 09, 2025 at 04:58:16PM +0300, Mike Rapoport wrote:
> > > >
> > > > I think we still don't really know what wil
On Wed, Apr 09, 2025 at 04:58:16PM +0300, Mike Rapoport wrote:
> > I'm not sure that is consistent with what Matthew is trying to build,
> > I think we are trying to remove 'struct page' usage, especially for
> > compound pages. Right now, though it is confusing, folio is the right
> > word to enco
On Wed, Apr 09, 2025 at 07:28:47PM +0300, Mike Rapoport wrote:
> On Mon, Apr 07, 2025 at 11:16:26AM -0300, Jason Gunthorpe wrote:
> > On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> >
> > KHO needs to provide a way to give back an allocated struct page/folio
On Wed, Apr 09, 2025 at 12:06:27PM +0300, Mike Rapoport wrote:
> Now we've settled with terminology, and given that currently memdesc ==
> struct page, I think we need kho_preserve_folio(struct *folio) for actual
> struct folios and, apparently other high order allocations, and
> kho_preserve_page
On Sun, Apr 06, 2025 at 07:34:30PM +0300, Mike Rapoport wrote:
> It's more than 200 line longer than maple tree if we count the lines.
> My point is both table and xarrays are trying to optimize for an unknown
> goal.
Not unknown, the point of the bitmap scheme is to be memory
deterministic.
You
On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote:
> > Maybe change the reserved regions code to put the region list in a
> > folio and preserve the folio instead of using FDT as a "demo" for the
> > functionality.
>
> Folios are not available when we restore reserved regions, this jus
On Wed, Mar 19, 2025 at 01:35:31PM +, Pratyush Yadav wrote:
> On Tue, Mar 18 2025, Jason Gunthorpe wrote:
>
> > On Tue, Mar 18, 2025 at 11:02:31PM +, Pratyush Yadav wrote:
> >
> >> I suppose we can serialize all FDs when the box is sealed and get rid of
>
On Wed, Mar 19, 2025 at 06:55:42PM -0700, Changyuan Lyu wrote:
> From: Alexander Graf
>
> Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> HandOver is a mechanism that allows Linux to preserve state - arbitrary
> properties as well as memory locations - across kexec.
>
>
On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote:
> +/**
> + * kho_preserve_folio - preserve a folio across KHO.
> + * @folio: folio to preserve
> + *
> + * Records that the entire folio is preserved across KHO. The order
> + * will be preserved as well.
> + *
> + * Return: 0 on succes
On Thu, Apr 03, 2025 at 03:50:04PM +, Pratyush Yadav wrote:
> The patch currently has a limitation where it does not free any of the
> empty tables after a unpreserve operation. But Changyuan's patch also
> doesn't do it so at least it is not any worse off.
We do we even have unpreserve? Just
On Wed, Apr 02, 2025 at 07:16:27PM +, Pratyush Yadav wrote:
> > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > +{
> > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size);
> > + unsigned int order = ilog2(end_pfn - pfn);
>
> This caught my eye when playing aroun
On Thu, Apr 03, 2025 at 05:37:06PM +, Pratyush Yadav wrote:
> And I think this will help make the 2 seconds much smaller as well later
> down the line since we can now find out if a given page is reserved in a
> few operations, and do it in parallel.
Yes, most certainly
> > This should be m
On Thu, Mar 27, 2025 at 05:28:40PM +, Pratyush Yadav wrote:
> > Otherwise we are going to be spending months just polishing this one
> > patch without any actual data on where the performance issues and hot
> > spots actually are.
>
> The memblock_reserve side we can optimize later, I agree. B
On Fri, Apr 04, 2025 at 12:54:25PM +0300, Mike Rapoport wrote:
> > IMHO it should not call kho_preserve_phys() at all.
>
> Do you mean that for preserving large physical ranges we need something
> entirely different?
If they don't use the buddy allocator, then yes?
> Then we don't need the bitma
On Thu, Apr 03, 2025 at 04:58:27PM +0300, Mike Rapoport wrote:
> On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 02, 2025 at 07:16:27PM +, Pratyush Yadav wrote:
> > > > +int kho_preserve_phys(phys_addr_t phys, size_t size)
> > >
On Wed, Mar 26, 2025 at 10:40:29PM +, Pratyush Yadav wrote:
> Ideally, kho_preserve_folio() should be similar to freeing the folio,
> except that it doesn't go to buddy for re-allocation. In that case,
> re-using those pages should not be a problem as long as the driver made
> sure the page was
On Thu, Mar 27, 2025 at 10:03:17AM +, Pratyush Yadav wrote:
> Of course, with the current linked list structure, this cannot work. But
> I don't see why we need to have it. I think having a page-table like
> structure would be better -- only instead of having PTEs at the lowest
> levels, you h
On Mon, Mar 24, 2025 at 02:18:34PM -0400, Mike Rapoport wrote:
> On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote:
> > On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote:
> >
> > > > > + page->private = order;
> > >
On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote:
> This hunk actually came from me. I decided to keep it simple for now and
> check what are the alternatives, like moving away from memblock_reserve(),
> adding a maple_tree or even something else.
Okat, makes sense to me
> > > +
On Mon, Mar 24, 2025 at 05:21:45PM -0700, Changyuan Lyu wrote:
> Thanks for the suggestions! I am a little bit concerned about assuming
> every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
> larger than PAGE_SIZE, I would like to turn the single u64 in the parent
> FDT into a u64
On Sun, Mar 23, 2025 at 12:07:58PM -0700, Changyuan Lyu wrote:
> > > + down_read(&kho_out.tree_lock);
> > > + if (kho_out.fdt) {
> >
> > What is the lock and fdt test for?
>
> It is to avoid the competition between the following 2 operations,
> - converting the hashtables and mem traker to FDT,
>
On Sun, Mar 23, 2025 at 12:02:04PM -0700, Changyuan Lyu wrote:
> > Why are we changing this? I much prefered the idea of having recursive
> > FDTs than this notion copying eveything into tables then out into FDT?
> > Now that we have the preserved pages mechanism there is a pretty
> > direct path
On Wed, Mar 19, 2025 at 06:55:45PM -0700, Changyuan Lyu wrote:
> +int kho_copy_fdt(struct kimage *image)
> +{
> + int err = 0;
> + void *fdt;
> +
> + if (!kho_enable || !image->file_mode)
> + return 0;
> +
> + if (!kho_out.fdt) {
> + err = kho_finalize();
> +
> I didn't mean the exact flags value, but the ability to have
> per-folio flags. The exact bits and their meaning would of course
> need to be part of the ABI. Shmem uses the dirty and uptodate flags
> to track some state on the folios, and the flags can affect it's
> behavior (lazily zeroing ou
On Tue, Mar 18, 2025 at 11:02:31PM +, Pratyush Yadav wrote:
> I suppose we can serialize all FDs when the box is sealed and get rid of
> the struct file. If kexec fails, userspace can unseal the box, and FDs
> will be deserialized into a new struct file. This way, the behaviour
> from userspac
On Tue, Mar 18, 2025 at 03:25:25PM +0100, Christian Brauner wrote:
> > It is not really a stash, it is not keeping files, it is hardwired to
>
> Right now as written it is keeping references to files in these fdboxes
> and thus functioning both as a crippled high-privileged fdstore and a
> serial
On Sun, Mar 16, 2025 at 08:52:43PM -0700, David Rientjes wrote:
> Pasha asked how drivers would know if reservations would be denied in the
> first 4GB of memory. Mike said an error code would be returned. Pasha
> was specific about devices that wanted to preserve the memory because
> they knew
On Sun, Mar 09, 2025 at 01:03:31PM +0100, Christian Brauner wrote:
> So either that work is done right from the start or that stashing files
> goes out the window and instead that KHO part is implemented in a way
> where during a KHO dump relevant userspace is notified that they must
> now seriali
On Sat, Mar 08, 2025 at 12:09:53PM +0100, Christian Brauner wrote:
> On Fri, Mar 07, 2025 at 11:14:17AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote:
> > > On Fri, Mar 07, 2025 at 12:57:35AM +, Pratyush Yadav wrote
On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote:
> On Fri, Mar 07, 2025 at 12:57:35AM +, Pratyush Yadav wrote:
> > The File Descriptor Box (FDBox) is a mechanism for userspace to name
> > file descriptors and give them over to the kernel to hold. They can
> > later be retrieve
On Sun, Feb 23, 2025 at 08:51:27PM +0200, Mike Rapoport wrote:
> On Wed, Feb 12, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:
> >
> > > As I've mentioned off-list earlier, KHO in its current form
On Tue, Feb 18, 2025 at 08:04:47PM -0800, David Rientjes wrote:
> - the future of guestmemfs and what it becomes, including alignment so
>prototyping can be done
IMHO we need a generic FDBOX sort of filesystem and the ability to put
guestmemfd, memfd and hugetlbfs (fd) into it. This would co
On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:
> As I've mentioned off-list earlier, KHO in its current form is the lowest
> level of abstraction for state preservation and it is by no means is
> intended to provide complex drivers with all the tools necessary.
My point, is I thin
On Tue, Feb 11, 2025 at 12:37:20PM -0400, Jason Gunthorpe wrote:
> To do that you need to preserve folios as the basic primitive.
I made a small sketch of what I suggest.
I imagine the FDT schema for this would look something like this:
/dts-v1/;
/ {
compatible = "linux-kho,v1"
1 - 100 of 116 matches
Mail list logo