On Tue, 11 Sep 2007 18:36:34 -0700
Ethan Solomita <[EMAIL PROTECTED]> wrote:

> Add a dirty map to struct address_space

I get a tremendous number of rejects trying to wedge this stuff on top of
Peter's mm-dirty-balancing-for-tasks changes.  More rejects than I am
prepared to partially-fix so that I can usefully look at these changes in
tkdiff, so this is all based on a quick peek at the diff itself..

> In a NUMA system it is helpful to know where the dirty pages of a mapping
> are located. That way we will be able to implement writeout for applications
> that are constrained to a portion of the memory of the system as required by
> cpusets.
> 
> This patch implements the management of dirty node maps for an address
> space through the following functions:
> 
> cpuset_clear_dirty_nodes(mapping)     Clear the map of dirty nodes
> 
> cpuset_update_nodes(mapping, page)    Record a node in the dirty nodes map
> 
> cpuset_init_dirty_nodes(mapping)      First time init of the map
> 
> 
> The dirty map may be stored either directly in the mapping (for NUMA
> systems with less then BITS_PER_LONG nodes) or separately allocated
> for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
> 
> Updating the dirty map may involve allocating it first for large
> configurations. Therefore we protect the allocation and setting
> of a node in the map through the tree_lock. The tree_lock is
> already taken when a page is dirtied so there is no additional
> locking overhead if we insert the updating of the nodemask there.
> 
> The dirty map is only cleared (or freed) when the inode is cleared.
> At that point no pages are attached to the inode anymore and therefore it can
> be done without any locking. The dirty map therefore records all nodes that
> have been used for dirty pages by that inode until the inode is no longer
> used.
>

It'd be nice to see some discussion regarding the memory consumption of
this patch and the associated tradeoffs.


> ...
>
> +#if MAX_NUMNODES <= BITS_PER_LONG

The patch is sprinkled full of this conditional.

  I don't understand why this is being done.  afaict it isn't described
  in a code comment (it should be) nor even in the changelogs?

  Given its overall complexity and its likelihood to change in the
  future, I'd suggest that this conditional be centralised in a single
  place.  Something like

  /*
   * nice comment goes here
   */
  #if MAX_NUMNODES <= BITS_PER_LONG
  #define CPUSET_DIRTY_LIMITS 1
  #else
  #define CPUSET_DIRTY_LIMITS 0
  #endif

  Then use #if CPUSET_DIRTY_LIMITS everywhere else.

  (This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
  warning if someone typos '#if CPUSET_DITRY_LIMITS')

  Even better would be to calculate CPUSET_DIRTY_LIMITS within Kconfig,
  but I suspect you'll need to jump through unfeasible hoops to do that
  sort of calculation within Kconfig.


> --- 0/include/linux/fs.h      2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/fs.h      2007-09-11 14:36:24.000000000 -0700
> @@ -516,6 +516,13 @@ struct address_space {
>       spinlock_t              private_lock;   /* for use by the address_space 
> */
>       struct list_head        private_list;   /* ditto */
>       struct address_space    *assoc_mapping; /* ditto */
> +#ifdef CONFIG_CPUSETS
> +#if MAX_NUMNODES <= BITS_PER_LONG
> +     nodemask_t              dirty_nodes;    /* nodes with dirty pages */
> +#else
> +     nodemask_t              *dirty_nodes;   /* pointer to map if dirty */
> +#endif
> +#endif

afacit there is no code comment and no changelog text which explains the
above design decision?  There should be, please.

There is talk of making cpusets available with CONFIG_SMP=n.  Will this new
feature be available in that case?  (it should be).

>  } __attribute__((aligned(sizeof(long))));
>       /*
>        * On most architectures that alignment is already the case; but
> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 
> 1/include/linux/writeback.h
> --- 0/include/linux/writeback.h       2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/writeback.h       2007-09-11 14:37:46.000000000 -0700
> @@ -62,6 +62,7 @@ struct writeback_control {
>       unsigned for_writepages:1;      /* This is a writepages() call */
>       unsigned range_cyclic:1;        /* range_start is cyclic */
>       void *fs_private;               /* For use by ->writepages() */
> +     nodemask_t *nodes;              /* Set of nodes of interest */
>  };

That comment is a bit terse.  It's always good to be lavish when commenting
data structures, for understanding those is key to understanding a design.

>  /*
> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
> --- 0/kernel/cpuset.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/kernel/cpuset.c 2007-09-11 14:36:24.000000000 -0700
> @@ -4,7 +4,7 @@
>   *  Processor and Memory placement constraints for sets of tasks.
>   *
>   *  Copyright (C) 2003 BULL SA.
> - *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
> + *  Copyright (C) 2004-2007 Silicon Graphics, Inc.
>   *  Copyright (C) 2006 Google, Inc
>   *
>   *  Portions derived from Patrick Mochel's sysfs code.
> @@ -14,6 +14,7 @@
>   *  2003-10-22 Updates by Stephen Hemminger.
>   *  2004 May-July Rework by Paul Jackson.
>   *  2006 Rework by Paul Menage to use generic containers
> + *  2007 Cpuset writeback by Christoph Lameter.
>   *
>   *  This file is subject to the terms and conditions of the GNU General 
> Public
>   *  License.  See the file COPYING in the main directory of the Linux
> @@ -1754,6 +1755,63 @@ int cpuset_mem_spread_node(void)
>  }
>  EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>  
> +#if MAX_NUMNODES > BITS_PER_LONG

waah.  In other places we do "MAX_NUMNODES <= BITS_PER_LONG"

> +
> +/*
> + * Special functions for NUMA systems with a large number of nodes.
> + * The nodemask is pointed to from the address space structures.
> + * The attachment of the dirty_node mask is protected by the
> + * tree_lock. The nodemask is freed only when the inode is cleared
> + * (and therefore unused, thus no locking necessary).
> + */

hmm, OK, there's a hint as to wghat's going on.

It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
we might want to tweak that in the future.  Yet another argument for
centralising this comparison.

> +void cpuset_update_dirty_nodes(struct address_space *mapping,
> +                     struct page *page)
> +{
> +     nodemask_t *nodes = mapping->dirty_nodes;
> +     int node = page_to_nid(page);
> +
> +     if (!nodes) {
> +             nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);

Does it have to be atomic?  atomic is weak and can fail.

If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
should at least pass the gfp_t into this function so it can do the stronger
allocation when possible.


> +             if (!nodes)
> +                     return;
> +
> +             *nodes = NODE_MASK_NONE;
> +             mapping->dirty_nodes = nodes;
> +     }
> +
> +     if (!node_isset(node, *nodes))
> +             node_set(node, *nodes);
> +}
> +
> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> +{
> +     nodemask_t *nodes = mapping->dirty_nodes;
> +
> +     if (nodes) {
> +             mapping->dirty_nodes = NULL;
> +             kfree(nodes);
> +     }
> +}

Can this race with cpuset_update_dirty_nodes()?  And with itself?  If not,
a comment which describes the locking requirements would be good.

> +/*
> + * Called without the tree_lock. The nodemask is only freed when the inode
> + * is cleared and therefore this is safe.
> + */
> +int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> +                     nodemask_t *mask)
> +{
> +     nodemask_t *dirty_nodes = mapping->dirty_nodes;
> +
> +     if (!mask)
> +             return 1;
> +
> +     if (!dirty_nodes)
> +             return 0;
> +
> +     return nodes_intersects(*dirty_nodes, *mask);
> +}
> +#endif
> +
>  /**
>   * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
>   * @p: pointer to task_struct of some other task.
> diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 
> 1/mm/page-writeback.c
> --- 0/mm/page-writeback.c     2007-09-11 14:35:58.000000000 -0700
> +++ 1/mm/page-writeback.c     2007-09-11 14:36:24.000000000 -0700
> @@ -33,6 +33,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h>
>  #include <linux/pagevec.h>
> +#include <linux/cpuset.h>
>  
>  /*
>   * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
>                       radix_tree_tag_set(&mapping->page_tree,
>                               page_index(page), PAGECACHE_TAG_DIRTY);
>               }
> +             cpuset_update_dirty_nodes(mapping, page);
>               write_unlock_irq(&mapping->tree_lock);
>               if (mapping->host) {
>                       /* !PageAnon && !swapper_space */
> 
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to