This patch introduces change brackets: begin_change and end_change, similar to Ext3's journal_start/journal_stop. A pair of brackets groups together all backing store updates needed to bring the filesystem to a consistent state as expected for the a given operation.
An end_change checks whether it is time to make a delta transition, and if so, waits for all concurrent changes to run to completion, then does the transition, including committing the current filesystem state to disk. A begin_change ensures that no commit will occur before the matching end_change is reached. This mechanism is described in greater detail here, with skeleton code: http://kerneltrap.org/mailarchive/tux3/2008/12/30/4546684 Design note: Atomic filesystem changes (Note the correction in the followup post.) For now the commit_delta operation in end_change will use a crude synchronous approach, including the steps: - Flush dirty inode table blocks - Flush dirty dleaf blocks - Flush dirty bitmap blocks - Flush log blocks - Wait for all transfers to complete - Update superblock pointer to new delta commit block This is just to give an idea what happens in end_change, which should help in understanding what is going on in the patch below. Filesystem changes fall into three categories: - Name changes - Various operations in kernel/namei.c - Data changes - File truncate in inode.c - File write in filemap.c (wrong!!!) - Attribute changes - Setattr except truncate - Extended attributes Name changes require grouping together the directory change and updates to the inode table in the same delta. Data changes require grouping together the data write, index update and inode attribute changes including ->i_size in the same delta. Attribute changes require grouping together the inode table and atom table changes. Notes: * Operation brackets as currently conceived do not nest. I am not sure whether or not they must nest, this needs pondering. * Delta staging and delta commit are big operations that take place under ->i_mutex locks, possibly other locks. We need to think carefully about locks that staging and commit take. * map_region is not the right place to wrap data writes, this has to be done in a whole flock of higher level places. * The begin_change operation does not yet check of space availble for worst case metadata space required by an operation. It needs to do this, to avoid commit failing on ENOSPC for metadata. * Setattr is not done. The VFS does truncates through setattr, creating a horrible little tangle. Truncate is handled at a lower level, and we can leave other setattr attribute updates for later. * Truncate... this patch assumes it is always possible to do the truncate in one commit. Is it? Anyway it would be much better if all truncate does is log an i_size change, and let the truncate be incremental after that. Currently, begin_change and end_change are just no-ops, which is good because the usage in map_region below is wrong and will not work: in map_region, the page for which buffers are being mapped has not been committed to disk yet, so if a delta commit occurs then, there will be a small window when some data ill be referenced before it has arrived on disk. Correct usage requires wrapping the write operations at a higher level, which I will attempt in a later patch. The purpose of the current patch is to check the usage in the more straightforward cases of name and xattr operations, and to check that all file operations are covered. This is a call for eyeballs! This is the skeleton of atomic commit, it needs to be pondered carefully. diff -r a24f282a8451 user/kernel/filemap.c --- a/user/kernel/filemap.c Mon Jan 05 03:05:20 2009 -0800 +++ b/user/kernel/filemap.c Mon Jan 05 13:44:07 2009 -0800 @@ -20,6 +20,8 @@ void show_segs(struct seg map[], unsigne static int map_region(struct inode *inode, block_t start, unsigned count, struct seg map[], unsigned max_segs, int create) { + struct sb *sb = tux_sb(inode->i_sb); + begin_change(sb); struct cursor *cursor = alloc_cursor(&tux_inode(inode)->btree, 1); /* allows for depth increase */ if (!cursor) return -ENOMEM; @@ -33,7 +35,6 @@ static int map_region(struct inode *inod block_t limit = start + count; trace("--- index %Lx, limit %Lx ---", (L)start, (L)limit); struct btree *btree = cursor->btree; - struct sb *sb = btree->sb; int err, segs = 0; if (!btree->root.depth) @@ -194,6 +195,7 @@ out_unlock: else up_read(&cursor->btree->lock); free_cursor(cursor); + end_change(sb); return segs; } diff -r a24f282a8451 user/kernel/inode.c --- a/user/kernel/inode.c Mon Jan 05 03:05:20 2009 -0800 +++ b/user/kernel/inode.c Mon Jan 05 13:44:07 2009 -0800 @@ -288,11 +288,13 @@ static void tux3_truncate(struct inode * /* FIXME: must fix expand size */ WARN_ON(inode->i_size); block_truncate_page(inode->i_mapping, inode->i_size, tux3_get_block); + begin_change(sb); err = tree_chop(&tux_inode(inode)->btree, &del_info, 0); inode->i_blocks = ((inode->i_size + sb->blockmask) & ~(loff_t)sb->blockmask) >> 9; inode->i_mtime = inode->i_ctime = gettime(); mark_inode_dirty(inode); + end_change(sb); } void tux3_delete_inode(struct inode *inode) diff -r a24f282a8451 user/kernel/namei.c --- a/user/kernel/namei.c Mon Jan 05 03:05:20 2009 -0800 +++ b/user/kernel/namei.c Mon Jan 05 13:44:07 2009 -0800 @@ -60,15 +60,21 @@ static int tux3_mknod(struct inode *dir, // if (!huge_valid_dev(rdev)) // return -EINVAL; + begin_change(tux_sb(dir->i_sb)); inode = tux_create_inode(dir, mode, rdev); err = PTR_ERR(inode); if (!IS_ERR(inode)) { err = tux_add_dirent(dir, dentry, inode); - if (!err) - return 0; + if (!err) { + if ((inode->i_mode & S_IFMT) == S_IFDIR) + inode_inc_link_count(dir); + goto out; + } inode_dec_link_count(inode); iput(inode); } +out: + end_change(tux_sb(dir->i_sb)); return err; } @@ -79,13 +85,9 @@ static int tux3_create(struct inode *dir static int tux3_mkdir(struct inode *dir, struct dentry *dentry, int mode) { - int err; if (dir->i_nlink >= TUX_LINK_MAX) return -EMLINK; - err = tux3_mknod(dir, dentry, S_IFDIR | mode, 0); - if (!err) - inode_inc_link_count(dir); - return err; + return tux3_mknod(dir, dentry, S_IFDIR | mode, 0); } static int tux3_link(struct dentry *old_dentry, struct inode *dir, @@ -97,6 +99,7 @@ static int tux3_link(struct dentry *old_ if (inode->i_nlink >= TUX_LINK_MAX) return -EMLINK; + begin_change(tux_sb(inode->i_sb)); inode->i_ctime = gettime(); inode_inc_link_count(inode); atomic_inc(&inode->i_count); @@ -105,6 +108,7 @@ static int tux3_link(struct dentry *old_ inode_dec_link_count(inode); iput(inode); } + end_change(tux_sb(inode->i_sb)); return err; } @@ -114,6 +118,7 @@ static int tux3_symlink(struct inode *di struct inode *inode; int err; + begin_change(tux_sb(dir->i_sb)); inode = tux_create_inode(dir, S_IFLNK | S_IRWXUGO, 0); err = PTR_ERR(inode); if (!IS_ERR(inode)) { @@ -121,22 +126,26 @@ static int tux3_symlink(struct inode *di if (!err) { err = tux_add_dirent(dir, dentry, inode); if (!err) - return 0; + goto out; } inode_dec_link_count(inode); iput(inode); } +out: + end_change(tux_sb(dir->i_sb)); return err; } static int tux3_unlink(struct inode *dir, struct dentry *dentry) { struct inode *inode = dentry->d_inode; + begin_change(tux_sb(inode->i_sb)); int err = tux_del_dirent(dir, dentry); if (!err) { inode->i_ctime = dir->i_ctime; inode_dec_link_count(inode); } + end_change(tux_sb(inode->i_sb)); return err; } @@ -147,6 +156,7 @@ static int tux3_rmdir(struct inode *dir, err = tux_dir_is_empty(inode); if (!err) { + begin_change(tux_sb(inode->i_sb)); err = tux_del_dirent(dir, dentry); if (!err) { inode->i_ctime = dir->i_ctime; @@ -155,6 +165,7 @@ static int tux3_rmdir(struct inode *dir, mark_inode_dirty(inode); inode_dec_link_count(dir); } + end_change(tux_sb(inode->i_sb)); } return err; } @@ -176,6 +187,7 @@ static int tux3_rename(struct inode *old /* FIXME: is this needed? */ BUG_ON(from_be_u64(old_entry->inum) != tux_inode(old_inode)->inum); + begin_change(tux_sb(old_inode->i_sb)); if (new_inode) { int old_is_dir = S_ISDIR(old_inode->i_mode); if (old_is_dir) { @@ -225,9 +237,11 @@ static int tux3_rename(struct inode *old if (!err && new_subdir) inode_dec_link_count(old_dir); + end_change(tux_sb(old_inode->i_sb)); return err; error: + end_change(tux_sb(old_inode->i_sb)); brelse(old_buffer); return err; } diff -r a24f282a8451 user/kernel/tux3.h --- a/user/kernel/tux3.h Mon Jan 05 03:05:20 2009 -0800 +++ b/user/kernel/tux3.h Mon Jan 05 13:44:07 2009 -0800 @@ -726,4 +726,6 @@ static inline struct inode *buffer_inode } #endif /* !__KERNEL__ */ +static inline void begin_change(struct sb *sb) { }; +static inline void end_change(struct sb *sb) { }; #endif diff -r a24f282a8451 user/kernel/xattr.c --- a/user/kernel/xattr.c Mon Jan 05 03:05:20 2009 -0800 +++ b/user/kernel/xattr.c Mon Jan 05 13:44:07 2009 -0800 @@ -377,9 +377,11 @@ int set_xattr(struct inode *inode, const { struct inode *atable = tux_sb(inode->i_sb)->atable; mutex_lock(&atable->i_mutex); + begin_change(tux_sb(inode->i_sb)); atom_t atom = make_atom(atable, name, len); int err = (atom == -1) ? -EINVAL : xcache_update(inode, atom, data, size, flags); + end_change(tux_sb(inode->i_sb)); mutex_unlock(&atable->i_mutex); return err; } @@ -389,6 +391,7 @@ int del_xattr(struct inode *inode, const int err = 0; struct inode *atable = tux_sb(inode->i_sb)->atable; mutex_lock(&atable->i_mutex); + begin_change(tux_sb(inode->i_sb)); atom_t atom = find_atom(atable, name, len); if (atom == -1) { err = -ENOATTR; @@ -404,6 +407,7 @@ int del_xattr(struct inode *inode, const if (used) use_atom(atable, atom, -used); out: + end_change(tux_sb(inode->i_sb)); mutex_unlock(&atable->i_mutex); return err; } _______________________________________________ Tux3 mailing list Tux3@tux3.org http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3