On Wed, Feb 14, 2007 at 03:32:04PM -0500, Lee Schermerhorn wrote: > On Tue, 2007-02-13 at 07:09 +0100, Nick Piggin wrote: > > Hi, > > > > Just tinkering around with this and got something working, so I'll see > > if anyone else wants to try it. > > > > Not proposing for inclusion, but I'd be interested in comments or results. > > > > Thanks, > > Nick > > I've included a small patch below that allow me to build and boot with > these patches on an HP NUMA platform. I'm still seeing an "unable to
Thanks Lee. Merged. > > - Would like to be able to control replication via userspace, and maybe > > even internally to the kernel. > How about per cpuset? Consider a cpuset, on a NUMA system, with cpus > and memories from a specific set of nodes. One might choose to have > page cache pages referenced by tasks in this cpuset to be pulled into > the cpuset's memories for local access. The remainder of the system may > choose not to replicate page cache pages--e.g., to conserve memory. > However, "unreplicating" on write would still need to work system wide. > > But, note: may [probably] want option to disable replication for shmem > pages? I'm thinking here of large data base shmem regions that, at any > time, might have a lot of pages accessed "read only". Probably wouldn't > want a lot of replication/unreplication happening behind the scene. Yeah cpusets is an interesting possibility. A per-inode attribute could be another one. The good old global sysctl is also a must :) > > - Ideally, reclaim might reclaim replicated pages preferentially, however > > I aim to be _minimally_ intrusive. > > - Would like to replicate PagePrivate, but filesystem may dirty page via > > buffers. Any solutions? (currently should mount with 'nobh'). > Linux migrates pages with PagePrivate using a per mapping migratepage > address space op to handle the buffers. File systems can provide their > own or use a generic version. How about a "replicatepage" aop? I guess the main problem is those filesystems which dirty the page via the buffers, via b_this_page, or b_data. However AFAIKS, these only happen for things like directories. I _think_ we can safely assume that regular file pages will not get modified (that would be data corruption!). > > +struct page * find_get_page_readonly(struct address_space *mapping, > > unsigned long offset) > > +{ > > + struct page *page; > > + > > +retry: > > + read_lock_irq(&mapping->tree_lock); > > + if (radix_tree_tag_get(&mapping->page_tree, offset, > > + PAGECACHE_TAG_REPLICATED)) { > > + int nid; > > + struct pcache_desc *pcd; > > +replicated: > > + nid = numa_node_id(); > > + pcd = radix_tree_lookup(&mapping->page_tree, offset); > ??? possible NULL pcd? I believe I'm seeing one here... Hmm, OK. I'll have to do some stress testing. I'm sure there are a few bugs left. > > > + if (!node_isset(nid, pcd->nodes_present)) { > Do this check [and possible replicate] only if replication enabled > [system wide?, per cpuset? based on explicit replication policy?, ...]? Yep. > > + struct page *repl_page; > > + > > + page = pcd->master; > > + page_cache_get(page); > > + read_unlock_irq(&mapping->tree_lock); > > + repl_page = alloc_pages_node(nid, > > + mapping_gfp_mask(mapping), 0); > ??? don't try to hard to allocate page, as it's only a performance > optimization. E.g., add in GFP_THISNODE and remove and __GFP_WAIT? I think that has merit. The problem if we remove __GFP_WAIT is that the page allocator gives us access to some reserves. __GFP_NORETRY should be reasonable? > > > + if (!repl_page) > > + return page; > > + copy_highpage(repl_page, page); > > + flush_dcache_page(repl_page); > > + page->mapping = mapping; > > + page->index = offset; > > + SetPageUptodate(repl_page); /* XXX: nonatomic */ > > + page_cache_release(page); > > + write_lock_irq(&mapping->tree_lock); > > + __insert_replicated_page(repl_page, mapping, offset, > > nid); > ??? can this fail due to race? Don't care because we retry the lookup? > page freed [released] in the function... Yeah, I told you it was ugly :P Sorry you had to wade through this, but it can be cleaned up.. > > EXPORT_SYMBOL(find_lock_page); > ??? should find_trylock_page() handle potential replicated page? > until it is removed, anyway? It is removed upstream, but in 2.6.20 it has no callers anyway so I didn't worry about it. Thanks for the comments & patch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/