Re: [PATCH 00/11] reiserfs: xattr rework
Hello, > Following this message is a series of 11 patches that rework the reiserfs > xattr code. The current implementation open codes a number of functions that > are well tested and stable elsewhere, but will slight modifications. It also > does a number of things suboptimally, such as operations that affect all of > the xattrs associated with an inode, as well as not handling journalling as > well as can be. > > I've run them through a weekend of stress testing successfully, but I'd like > some additional testing before considering them safe. > > Here's the run down: > > * 01 - Make internal xattr operations use vfs ops > This eliminates the open coding of the read/write loops in favor of using > vfs_read and vfs_write. Performance-wise, it's very little difference > from the open coding, and implementation-wise, it's more tested. Yes, this > violates the no-file-io-in-the-kernel rules, but this rule has been violated > since the day the original patches were accepted. I'm not completely sure but from briefly looking at the patches I think there might be the following problem: you first start a transaction and then call a VFS write operation. That will lock a page it wants to write. But if bdflush works on the xattr file and sends some dirty data to disk, it will first take page lock and then start the transaction. This introduces essentially a lock inversion (as starting a transaction behaves like taking a lock) and hence deadlocks... I've been solving these for the quota code some time ago (quota also has similar needs as your xattr code) - I really had some deadlock reports for heavily loaded machines. There the only reasonable solution was an extra write functions bypassing the page cache. Maybe in your case you can solve the problems differently as you don't need the working solution for all the filesystems but just for Reiserfs. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs
Re: [PATCH 00/11] reiserfs: xattr rework
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan Kara wrote: > Hello, > >> Following this message is a series of 11 patches that rework the reiserfs >> xattr code. The current implementation open codes a number of functions that >> are well tested and stable elsewhere, but will slight modifications. It also >> does a number of things suboptimally, such as operations that affect all of >> the xattrs associated with an inode, as well as not handling journalling as >> well as can be. >> >> I've run them through a weekend of stress testing successfully, but I'd like >> some additional testing before considering them safe. >> >> Here's the run down: >> >> * 01 - Make internal xattr operations use vfs ops >> This eliminates the open coding of the read/write loops in favor of using >> vfs_read and vfs_write. Performance-wise, it's very little difference >> from the open coding, and implementation-wise, it's more tested. Yes, this >> violates the no-file-io-in-the-kernel rules, but this rule has been >> violated >> since the day the original patches were accepted. > I'm not completely sure but from briefly looking at the patches I > think there might be the following problem: you first start a transaction > and then call a VFS write operation. That will lock a page it wants to > write. But if bdflush works on the xattr file and sends some dirty data > to disk, it will first take page lock and then start the transaction. > This introduces essentially a lock inversion (as starting a transaction > behaves like taking a lock) and hence deadlocks... I've been solving > these for the quota code some time ago (quota also has similar needs as > your xattr code) - I really had some deadlock reports for heavily loaded > machines. There the only reasonable solution was an extra write > functions bypassing the page cache. Maybe in your case you can solve the > problems differently as you don't need the working solution for all the > filesystems but just for Reiserfs. Sigh. Ok. The way you describe it definitely makes it a lock inversion issue. I haven't run into it yet, but as you say, it occurs on heavily loaded machines. I've done some load testing, but apparently not enough, since your analysis is sound. But, I think there is a silver lining after all. It sounds like you've already worked around these issues for the journaled quota code. What do you think about turning the quota read/write functions into something more generic and using that for xattrs as well as quotas? Ultimately, I think that quota files and xattrs are the same things - "normal" files read from and written to during the I/O path. The changes to journaled quotas would be minimal - just turn reiserfs_quota_{read,write} into small wrappers that make the type->inode mapping and then call the remainder of the existing code as reiserfs_internal_{read,write} with the appropriate inode. The xattr code could juse the reiserfs_internal_{read,write} similarly and get all the deadlock avoidance work you've already done for free. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEB2l0LPWxlyuTD7IRAsuaAKCHnEKlcgxcIGEV5n5f3m4CY5c/pACeK4CH h1cj7wLkY+irYbaagmUHtm4= =o4nE -END PGP SIGNATURE-
Re: [PATCH 00/11] reiserfs: xattr rework
> -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Jan Kara wrote: > > Hello, > > > >> Following this message is a series of 11 patches that rework the reiserfs > >> xattr code. The current implementation open codes a number of functions > >> that > >> are well tested and stable elsewhere, but will slight modifications. It > >> also > >> does a number of things suboptimally, such as operations that affect all of > >> the xattrs associated with an inode, as well as not handling journalling as > >> well as can be. > >> > >> I've run them through a weekend of stress testing successfully, but I'd > >> like > >> some additional testing before considering them safe. > >> > >> Here's the run down: > >> > >> * 01 - Make internal xattr operations use vfs ops > >> This eliminates the open coding of the read/write loops in favor of using > >> vfs_read and vfs_write. Performance-wise, it's very little difference > >> from the open coding, and implementation-wise, it's more tested. Yes, > >> this > >> violates the no-file-io-in-the-kernel rules, but this rule has been > >> violated > >> since the day the original patches were accepted. > > I'm not completely sure but from briefly looking at the patches I > > think there might be the following problem: you first start a transaction > > and then call a VFS write operation. That will lock a page it wants to > > write. But if bdflush works on the xattr file and sends some dirty data > > to disk, it will first take page lock and then start the transaction. > > This introduces essentially a lock inversion (as starting a transaction > > behaves like taking a lock) and hence deadlocks... I've been solving > > these for the quota code some time ago (quota also has similar needs as > > your xattr code) - I really had some deadlock reports for heavily loaded > > machines. There the only reasonable solution was an extra write > > functions bypassing the page cache. Maybe in your case you can solve the > > problems differently as you don't need the working solution for all the > > filesystems but just for Reiserfs. > > Sigh. Ok. The way you describe it definitely makes it a lock inversion > issue. I haven't run into it yet, but as you say, it occurs on heavily > loaded machines. I've done some load testing, but apparently not enough, > since your analysis is sound. > > But, I think there is a silver lining after all. It sounds like you've > already worked around these issues for the journaled quota code. What do > you think about turning the quota read/write functions into something > more generic and using that for xattrs as well as quotas? > > Ultimately, I think that quota files and xattrs are the same things - > "normal" files read from and written to during the I/O path. The changes > to journaled quotas would be minimal - just turn > reiserfs_quota_{read,write} into small wrappers that make the > type->inode mapping and then call the remainder of the existing code as > reiserfs_internal_{read,write} with the appropriate inode. The xattr > code could juse the reiserfs_internal_{read,write} similarly and get all > the deadlock avoidance work you've already done for free. You are right that the quota code and xattrs need to do the same thing. We only need to do slight interface changes (currently functions take a superblock and a type and pick appropriate quota inode themselves) and function renaming. I would vote for renaming the s_op->quota_{read,write} to s_op->internal_{read,write} and pass appropriate inode directly from the quota code. The only thing I'm not sure about is how to deal with the journaling mode - quota code either uses data journaling or just ordered mode depending on mount options (journaled / non-journaled quota). So we probably also need to pass the journaling mode to the write function. BTW: note that using these functions bypassing page cache means that userspace really should not touch these files. It is asking for data corruption. Quota code does during quotaon sync the quota inode and set it as immutable to prevent accidents. Also during quotaoff it flushes the page cache of the inode so that userspace is able to see the changes made by kernel. I guess something similar will be needed for xattrs too. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs
Re: [PATCH 00/11] reiserfs: xattr rework
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan Kara wrote: > You are right that the quota code and xattrs need to do the same thing. > We only need to do slight interface changes (currently functions take a > superblock and a type and pick appropriate quota inode themselves) and > function renaming. I would vote for renaming the s_op->quota_{read,write} > to s_op->internal_{read,write} and pass appropriate inode directly from > the quota code. The only thing I'm not sure about is how to deal with the > journaling mode - quota code either uses data journaling or just ordered > mode depending on mount options (journaled / non-journaled quota). So we > probably also need to pass the journaling mode to the write function. > BTW: note that using these functions bypassing page cache means that > userspace really should not touch these files. It is asking for data > corruption. Quota code does during quotaon sync the quota inode and > set it as immutable to prevent accidents. Also during quotaoff it flushes > the page cache of the inode so that userspace is able to see the changes > made by kernel. I guess something similar will be needed for xattrs too. If you feel that changing the entire quota system to reflect the change is a good plan, that's your call. Personally, I'd like to keep the patches as small as possible, but if you think there is a need for internal_{read,write} elsewhere, I wouldn't object. The data journaling mode can be set as a flag associated with the inode. Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add i_data_ordered in one of my later patches. They can be tested easily with reiserfs_file_data_{log,ordered}. There's no reason that one couldn't be moved up and made a prerequisite for the first patch. There is one difference between quota files and xattrs: The quota files are visible to userspace, xattrs are completely hidden. There's nothing needed to flush anything to the page cache. I'll work up a patch locally and do some testing. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEB5RPLPWxlyuTD7IRAn94AJwKvmGcT09QtjcBOFJoyd6JrxRTywCeN5nE giNMMOkNlSsuGCwR6ad9/ao= =DOMs -END PGP SIGNATURE-
Re: [PATCH 00/11] reiserfs: xattr rework
> -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Jan Kara wrote: > > You are right that the quota code and xattrs need to do the same thing. > > We only need to do slight interface changes (currently functions take a > > superblock and a type and pick appropriate quota inode themselves) and > > function renaming. I would vote for renaming the s_op->quota_{read,write} > > to s_op->internal_{read,write} and pass appropriate inode directly from > > the quota code. The only thing I'm not sure about is how to deal with the > > journaling mode - quota code either uses data journaling or just ordered > > mode depending on mount options (journaled / non-journaled quota). So we > > probably also need to pass the journaling mode to the write function. > > BTW: note that using these functions bypassing page cache means that > > userspace really should not touch these files. It is asking for data > > corruption. Quota code does during quotaon sync the quota inode and > > set it as immutable to prevent accidents. Also during quotaoff it flushes > > the page cache of the inode so that userspace is able to see the changes > > made by kernel. I guess something similar will be needed for xattrs too. > > If you feel that changing the entire quota system to reflect the change > is a good plan, that's your call. Personally, I'd like to keep the > patches as small as possible, but if you think there is a need for > internal_{read,write} elsewhere, I wouldn't object. OK, when I'm thinking about it in the morning, you're probably right. And I can do the bigger change if I see that more filesystems would use it too. > The data journaling mode can be set as a flag associated with the inode. > Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add > i_data_ordered in one of my later patches. They can be tested easily > with reiserfs_file_data_{log,ordered}. There's no reason that one > couldn't be moved up and made a prerequisite for the first patch. Fine. So we can just set proper journaling flags in reiserfs_quota_on and then honor them in the internal writing functions. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs
Re: [PATCH 00/11] reiserfs: xattr rework
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan Kara wrote: >> The data journaling mode can be set as a flag associated with the inode. >> Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add >> i_data_ordered in one of my later patches. They can be tested easily >> with reiserfs_file_data_{log,ordered}. There's no reason that one >> couldn't be moved up and made a prerequisite for the first patch. > Fine. So we can just set proper journaling flags in reiserfs_quota_on > and then honor them in the internal writing functions. Ok, how do the attached patches look to you? The internal I/O changes need to be applied after the journaled xattr patch or we get an Oops trying to start a transaction without calling reiserfs_write_lock() first. I've modified the first patch in the xattr series to abstract out the fp->f_op->{read,write} calls to an xattr_{read,write} pair of functions. This makes it easier to move to the internal i/o code later. I've included it for clarity, but that is the only change. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFECMCALPWxlyuTD7IRAlqfAJ0bQnHuNqzEif4hVfvGKI8tR2bUrACfU1Mg BFbe0xayAJHhvNgtGl7N6Jk= =P6ED -END PGP SIGNATURE- From: Jeff Mahoney <[EMAIL PROTECTED]> Subject: [PATCH 10/13] reiserfs: make quota file i/o routines generic Jan Kara, with his work on journaled quotas, discovered some conditions were lock inversions can occur between pdflush and the internal file i/o paths when transactions are involved. This patch modifies the quota code in a few ways: 1) Makes the reiserfs_quota_{read,write} generic and adds wrappers to map the quota type to the appropriate inode. 2) Uses the i_data_log flag to mark whether the quota file is journaled or not. This is the standard way of doing this, and allows the generic code to work without knowledge of quotas. fs/reiserfs/super.c | 212 include/linux/reiserfs_fs.h |5 + 2 files changed, 122 insertions(+), 95 deletions(-) Signed-off-by: Jeff Mahoney <[EMAIL PROTECTED]> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c linux-2.6.15-staging2/fs/reiserfs/super.c --- linux-2.6.15-staging1/fs/reiserfs/super.c 2006-03-03 17:09:01.0 -0500 +++ linux-2.6.15-staging2/fs/reiserfs/super.c 2006-03-03 17:09:04.0 -0500 @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_ return 0; } +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR) +/* Read data from quotafile - avoid pagecache and such because we cannot afford + * acquiring the locks... As quota files are never truncated and quota code + * itself serializes the operations (and noone else should touch the files) + * we don't have to be afraid of races */ +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len, + loff_t off) +{ + struct super_block *sb = inode->i_sb; + unsigned long blk = off >> sb->s_blocksize_bits; + int err = 0, offset = off & (sb->s_blocksize - 1), tocopy; + size_t toread; + struct buffer_head tmp_bh, *bh; + loff_t i_size = i_size_read(inode); + + if (off > i_size) + return 0; + if (off + len > i_size) + len = i_size - off; + toread = len; + while (toread > 0) { + tocopy = + sb->s_blocksize - offset < + toread ? sb->s_blocksize - offset : toread; + tmp_bh.b_state = 0; + /* Quota files are without tails so we can safely use this function */ + reiserfs_write_lock(sb); + err = reiserfs_get_block(inode, blk, &tmp_bh, 0); + reiserfs_write_unlock(sb); + if (err) + return err; + if (!buffer_mapped(&tmp_bh)) /* A hole? */ + memset(data, 0, tocopy); + else { + bh = sb_bread(sb, tmp_bh.b_blocknr); + if (!bh) +return -EIO; + memcpy(data, bh->b_data + offset, tocopy); + brelse(bh); + } + offset = 0; + toread -= tocopy; + data += tocopy; + blk++; + } + return len; +} + +/* Write to quotafile (we know the transaction is already started and has + * enough credits) */ +ssize_t reiserfs_internal_write(struct inode *inode, const char *data, +size_t len, loff_t off) +{ + struct super_block *sb = inode->i_sb; + unsigned long blk = off >> sb->s_blocksize_bits; + int err = 0, offset = off & (sb->s_blocksize - 1), tocopy; + size_t towrite = len; + struct buffer_head tmp_bh, *bh; + + mutex_lock(&inode->i_mutex); + while (towrite > 0) { + tocopy = sb->s_blocksize - offset < towrite ? + sb->s_blocksize - offset : towrite; + tmp_bh.b_state = 0; + err = reiserfs_get_block(inode, blk, &tmp_bh, GET_BLOCK_CREATE); + if (err) + goto out; + if (offset || tocopy != sb->s_blocksize) + bh = sb_bread(sb, tmp_bh.b_blocknr); + else + bh = sb_getblk(sb, tmp_bh.b_blocknr); + if (!bh) { + err = -EIO; + goto out; + } + lock_buffer(bh); + memcpy(bh->b_data + offset, data, tocopy); + flu
Re: [PATCH 00/11] reiserfs: xattr rework
> -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Jan Kara wrote: > >> The data journaling mode can be set as a flag associated with the inode. > >> Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add > >> i_data_ordered in one of my later patches. They can be tested easily > >> with reiserfs_file_data_{log,ordered}. There's no reason that one > >> couldn't be moved up and made a prerequisite for the first patch. > > Fine. So we can just set proper journaling flags in reiserfs_quota_on > > and then honor them in the internal writing functions. > > Ok, how do the attached patches look to you? The internal I/O changes > need to be applied after the journaled xattr patch or we get an Oops > trying to start a transaction without calling reiserfs_write_lock() > first. I've modified the first patch in the xattr series to abstract out > the fp->f_op->{read,write} calls to an xattr_{read,write} pair of > functions. This makes it easier to move to the internal i/o code later. > I've included it for clarity, but that is the only change. The patch looks fine. Just two minor comments: > diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c > linux-2.6.15-staging2/fs/reiserfs/super.c > --- linux-2.6.15-staging1/fs/reiserfs/super.c 2006-03-03 17:09:01.0 > -0500 > +++ linux-2.6.15-staging2/fs/reiserfs/super.c 2006-03-03 17:09:04.0 > -0500 > @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_ > return 0; > } > > +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR) > +/* Read data from quotafile - avoid pagecache and such because we cannot > afford > + * acquiring the locks... As quota files are never truncated and quota code > + * itself serializes the operations (and noone else should touch the files) > + * we don't have to be afraid of races */ Update here the comment to reflect that we use this function also for xattrs now - I suppose those files cannot be truncated either and that xattr code serializes the operations there. > +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len, > + loff_t off) > +/* Write to quotafile (we know the transaction is already started and has > + * enough credits) */ Here again update the comment... > +ssize_t reiserfs_internal_write(struct inode *inode, const char *data, > +size_t len, loff_t off) Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs
Re: [PATCH 00/11] reiserfs: xattr rework
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan Kara wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA1 >> >> Jan Kara wrote: The data journaling mode can be set as a flag associated with the inode. Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add i_data_ordered in one of my later patches. They can be tested easily with reiserfs_file_data_{log,ordered}. There's no reason that one couldn't be moved up and made a prerequisite for the first patch. >>> Fine. So we can just set proper journaling flags in reiserfs_quota_on >>> and then honor them in the internal writing functions. >> Ok, how do the attached patches look to you? The internal I/O changes >> need to be applied after the journaled xattr patch or we get an Oops >> trying to start a transaction without calling reiserfs_write_lock() >> first. I've modified the first patch in the xattr series to abstract out >> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of >> functions. This makes it easier to move to the internal i/o code later. >> I've included it for clarity, but that is the only change. > The patch looks fine. Just two minor comments: > > >> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c >> linux-2.6.15-staging2/fs/reiserfs/super.c >> --- linux-2.6.15-staging1/fs/reiserfs/super.c2006-03-03 >> 17:09:01.0 -0500 >> +++ linux-2.6.15-staging2/fs/reiserfs/super.c2006-03-03 >> 17:09:04.0 -0500 >> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_ >> return 0; >> } >> >> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR) >> +/* Read data from quotafile - avoid pagecache and such because we cannot >> afford >> + * acquiring the locks... As quota files are never truncated and quota code >> + * itself serializes the operations (and noone else should touch the files) >> + * we don't have to be afraid of races */ > Update here the comment to reflect that we use this function also for > xattrs now - I suppose those files cannot be truncated either and that > xattr code serializes the operations there. > >> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len, >> + loff_t off) > > >> +/* Write to quotafile (we know the transaction is already started and has >> + * enough credits) */ > Here again update the comment... > >> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data, >> +size_t len, loff_t off) > > Honza > I've updated the patches with the comment changes, though I did run into a much bigger snag. The internal i/o patches don't support tails, and that's a silver bullet against this working for xattrs. Most xattrs, such as ACLs, are likley to be only a few tens of bytes long and allocating an entire block is extremely wasteful. I've managed to alter internal read to handle tails by allocating an anonymous page and using it with the temporary buffer head to get the tail data from reiserfs_get_block back. But the rest of the tail packing code very much needs the page cache. Is there going to be any way this can be managed without reintroducing deadlocks? - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEDf16LPWxlyuTD7IRApj7AJ4jymXed/tOb5tpVar1SZ6ZnrYl0ACfUuxk eC6kOQW5zNuUIaoLrV0gIf8= =u6L1 -END PGP SIGNATURE-
Re: [PATCH 00/11] reiserfs: xattr rework
> -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Jan Kara wrote: > >> -BEGIN PGP SIGNED MESSAGE- > >> Hash: SHA1 > >> > >> Jan Kara wrote: > The data journaling mode can be set as a flag associated with the inode. > Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add > i_data_ordered in one of my later patches. They can be tested easily > with reiserfs_file_data_{log,ordered}. There's no reason that one > couldn't be moved up and made a prerequisite for the first patch. > >>> Fine. So we can just set proper journaling flags in reiserfs_quota_on > >>> and then honor them in the internal writing functions. > >> Ok, how do the attached patches look to you? The internal I/O changes > >> need to be applied after the journaled xattr patch or we get an Oops > >> trying to start a transaction without calling reiserfs_write_lock() > >> first. I've modified the first patch in the xattr series to abstract out > >> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of > >> functions. This makes it easier to move to the internal i/o code later. > >> I've included it for clarity, but that is the only change. > > The patch looks fine. Just two minor comments: > > > > > >> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c > >> linux-2.6.15-staging2/fs/reiserfs/super.c > >> --- linux-2.6.15-staging1/fs/reiserfs/super.c 2006-03-03 > >> 17:09:01.0 -0500 > >> +++ linux-2.6.15-staging2/fs/reiserfs/super.c 2006-03-03 > >> 17:09:04.0 -0500 > >> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_ > >>return 0; > >> } > >> > >> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR) > >> +/* Read data from quotafile - avoid pagecache and such because we cannot > >> afford > >> + * acquiring the locks... As quota files are never truncated and quota > >> code > >> + * itself serializes the operations (and noone else should touch the > >> files) > >> + * we don't have to be afraid of races */ > > Update here the comment to reflect that we use this function also for > > xattrs now - I suppose those files cannot be truncated either and that > > xattr code serializes the operations there. > > > >> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t > >> len, > >> + loff_t off) > > > > > >> +/* Write to quotafile (we know the transaction is already started and has > >> + * enough credits) */ > > Here again update the comment... > > > >> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data, > >> +size_t len, loff_t off) > > > > Honza > > > > I've updated the patches with the comment changes, though I did run into > a much bigger snag. > > The internal i/o patches don't support tails, and that's a silver bullet > against this working for xattrs. Most xattrs, such as ACLs, are likley > to be only a few tens of bytes long and allocating an entire block is > extremely wasteful. Umm, that is really nasty. Ext3 solves this by sharing a block among several inodes but that's far to much work to fix this bug... > I've managed to alter internal read to handle tails by allocating an > anonymous page and using it with the temporary buffer head to get the > tail data from reiserfs_get_block back. But the rest of the tail packing > code very much needs the page cache. Is there going to be any way this > can be managed without reintroducing deadlocks? I've been trying to find some other way when solving problems for quotas but find none. If you want xattr changes to be journaled with other data changes, you have to first start a transaction and then issue a write that will consequently need PageLock. So do you really need a trasaction started before a write starts? For journaled quota this was must but for xattrs it might not be necessary. Then we would still need to sort out the problems with xattr lock but that might be easier to deal with. Bye Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs
Re: [PATCH 00/11] reiserfs: xattr rework
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan Kara wrote: >> The internal i/o patches don't support tails, and that's a silver bullet >> against this working for xattrs. Most xattrs, such as ACLs, are likley >> to be only a few tens of bytes long and allocating an entire block is >> extremely wasteful. > Umm, that is really nasty. Ext3 solves this by sharing a block among > several inodes but that's far to much work to fix this bug... I had considered sharing files, and the code knows to drop a link to a shared file when it's changed. That's one of the features I had wanted from the beginning but never got around to implementing. >> I've managed to alter internal read to handle tails by allocating an >> anonymous page and using it with the temporary buffer head to get the >> tail data from reiserfs_get_block back. But the rest of the tail packing >> code very much needs the page cache. Is there going to be any way this >> can be managed without reintroducing deadlocks? > I've been trying to find some other way when solving problems for > quotas but find none. If you want xattr changes to be journaled with > other data changes, you have to first start a transaction and then issue > a write that will consequently need PageLock. So do you really need > a trasaction started before a write starts? For journaled quota this was > must but for xattrs it might not be necessary. Then we would still need > to sort out the problems with xattr lock but that might be easier to > deal with. The locking constraints for xattrs have changed somewhat. Have you looked at the rest of the xattr patches, or am I being dumb and missing your point? Do the quota races occur when they are completely journaled or only when ordered writes are used? I'm wondering if perhaps we could alter reiserfs_file_write a bit to not mark the buffers dirty yet for internal files, and have the ordered writeout do it then. Then, we wouldn't race against pdflush. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEDyyALPWxlyuTD7IRAiCuAJ4o8/SpIk8VBAl8/98kppRB9o7VcACfSGnM pTqsa6fCMSDNBZoketOyx0I= =ZKxx -END PGP SIGNATURE-
Re: [PATCH 00/11] reiserfs: xattr rework
On Mar 08, 2006 14:12 -0500, Jeff Mahoney wrote: > Jan Kara wrote: > >> The internal i/o patches don't support tails, and that's a silver bullet > >> against this working for xattrs. Most xattrs, such as ACLs, are likley > >> to be only a few tens of bytes long and allocating an entire block is > >> extremely wasteful. > > > > Umm, that is really nasty. Ext3 solves this by sharing a block among > > several inodes but that's far to much work to fix this bug... > > I had considered sharing files, and the code knows to drop a link to a > shared file when it's changed. That's one of the features I had wanted > from the beginning but never got around to implementing. Just FYI, ext3 has recently implemented support for larger inodes exactly to store small EAs with the inode instead of an external block. For ACLs there is some benefit to sharing the block, because the overhead is amortized over many inodes. However, virtually all other EA data is unique per inode and the ext3 EA block sharing only works if ALL the EAs for an inode are identical, so that isn't very useful if you have anything other than ACLs to store. The performance of in-inode EAs is vastly better than external blocks because of seeks and not wasting 4kB of disk/RAM for 10-100 bytes of EA. Not sure if this is useful (haven't been following discussion too closely), but thought I would steer you away from the shared-block idea early. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Re: [PATCH 00/11] reiserfs: xattr rework
All of this is why xattrs should not be implemented without first implementing plugins.:-/ Really guys, there was a reason for my not wanting xattrs to go into V3. Do you see it now? This is what happens when marketing determines feature ship schedules. You could implement this xattr stuff in V4 in 1/5th the time, 5 times the performance, and twice the elegance. Of course, implementing them as pseudo files would be far more elegant still Hans
Re: [PATCH 00/11] reiserfs: xattr rework
Jeff Mahoney wrote: > > > And as I've said before, Hans, if the original code base was capable of > supporting the plethora of items the white paper hyped, we wouldn't have > run into this problem either. This is an odd criticism. V3 never pretended to have item plugins. V4 does. If you add a new item plugin to V4, it will work quite well. > > I would have loved to have implemented xattrs as another item type, but > as soon as I did that, the kernel crashed almost instantly on not > recognizing the new item type in the balance code. While it was > certainly fixable in that version, properly fixing it would have > required a ReiserFS 3.7 with capability bits similar to ext[23]. Looking > back, maybe that wouldn't have been such a bad thing. All of the performance issues and stable code destabilization issues you are experiencing now are what I expected, and what motivated my wanting to defer acls for V4, and by so doing (and thus concentrating our development resources on V4) advance the schedule for V4. > > As for waiting for v4, we've been through this before. Users wanted ACLs > on ReiserFS yesterday, and I'd hardly brush aside features that users > have been demanding as marketing. There is a difference between brushing them aside, and choosing to put them into the next major release because proper solution of them requires deeper work than stuffing them into a file. Also, marketing is important. The problem is when marketing motivates wishful thinking about the shape of the code. Architecture is like sculpting, you must sense what the wood or marble wants to be shaped into as clearly as you see the vision in your head of what you want it to be. When it won't support the shape you want, then in sculpting you need another piece of wood, and in programming you need another major release. That said, I understand you are seeking to provide the users with what they want, and that while we disagree on the best strategy for that, we both try to do the best we can for the users. > There's no denying that a > reiser4-based solution would have been cleaner, but sometimes we just > have to make do with what we've got. > > -Jeff > > -- > Jeff Mahoney > SUSE Labs
Re: [PATCH 00/11] reiserfs: xattr rework
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hans Reiser wrote: > All of this is why xattrs should not be implemented without first > implementing plugins.:-/ > > Really guys, there was a reason for my not wanting xattrs to go into > V3. Do you see it now? This is what happens when marketing determines > feature ship schedules. You could implement this xattr stuff in V4 in > 1/5th the time, 5 times the performance, and twice the elegance. Of > course, implementing them as pseudo files would be far more elegant > still And as I've said before, Hans, if the original code base was capable of supporting the plethora of items the white paper hyped, we wouldn't have run into this problem either. I would have loved to have implemented xattrs as another item type, but as soon as I did that, the kernel crashed almost instantly on not recognizing the new item type in the balance code. While it was certainly fixable in that version, properly fixing it would have required a ReiserFS 3.7 with capability bits similar to ext[23]. Looking back, maybe that wouldn't have been such a bad thing. As for waiting for v4, we've been through this before. Users wanted ACLs on ReiserFS yesterday, and I'd hardly brush aside features that users have been demanding as marketing. There's no denying that a reiser4-based solution would have been cleaner, but sometimes we just have to make do with what we've got. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEEH3lLPWxlyuTD7IRAgz6AJ4l4/f92LJAKs65OqAbl8cIIL1PRACdE5Yb MmQcruV2Nmd2l3RS+V0TYrI= =ujKr -END PGP SIGNATURE-