Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-01 Thread Jan Kara
  Hello,

> Following this message is a series of 11 patches that rework the reiserfs
> xattr code. The current implementation open codes a number of functions that
> are well tested and stable elsewhere, but will slight modifications. It also
> does a number of things suboptimally, such as operations that affect all of
> the xattrs associated with an inode, as well as not handling journalling as
> well as can be.
> 
> I've run them through a weekend of stress testing successfully, but I'd like
> some additional testing before considering them safe.
> 
> Here's the run down:
> 
> * 01 - Make internal xattr operations use vfs ops
>   This eliminates the open coding of the read/write loops in favor of using
>   vfs_read and vfs_write. Performance-wise, it's very little difference
>   from the open coding, and implementation-wise, it's more tested. Yes, this
>   violates the no-file-io-in-the-kernel rules, but this rule has been violated
>   since the day the original patches were accepted.
  I'm not completely sure but from briefly looking at the patches I
think there might be the following problem: you first start a transaction
and then call a VFS write operation. That will lock a page it wants to
write. But if bdflush works on the xattr file and sends some dirty data
to disk, it will first take page lock and then start the transaction.
This introduces essentially a lock inversion (as starting a transaction
behaves like taking a lock) and hence deadlocks... I've been solving
these for the quota code some time ago (quota also has similar needs as
your xattr code) - I really had some deadlock reports for heavily loaded
machines. There the only reasonable solution was an extra write
functions bypassing the page cache. Maybe in your case you can solve the
problems differently as you don't need the working solution for all the
filesystems but just for Reiserfs. 

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-02 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jan Kara wrote:
>   Hello,
> 
>> Following this message is a series of 11 patches that rework the reiserfs
>> xattr code. The current implementation open codes a number of functions that
>> are well tested and stable elsewhere, but will slight modifications. It also
>> does a number of things suboptimally, such as operations that affect all of
>> the xattrs associated with an inode, as well as not handling journalling as
>> well as can be.
>>
>> I've run them through a weekend of stress testing successfully, but I'd like
>> some additional testing before considering them safe.
>>
>> Here's the run down:
>>
>> * 01 - Make internal xattr operations use vfs ops
>>   This eliminates the open coding of the read/write loops in favor of using
>>   vfs_read and vfs_write. Performance-wise, it's very little difference
>>   from the open coding, and implementation-wise, it's more tested. Yes, this
>>   violates the no-file-io-in-the-kernel rules, but this rule has been 
>> violated
>>   since the day the original patches were accepted.
>   I'm not completely sure but from briefly looking at the patches I
> think there might be the following problem: you first start a transaction
> and then call a VFS write operation. That will lock a page it wants to
> write. But if bdflush works on the xattr file and sends some dirty data
> to disk, it will first take page lock and then start the transaction.
> This introduces essentially a lock inversion (as starting a transaction
> behaves like taking a lock) and hence deadlocks... I've been solving
> these for the quota code some time ago (quota also has similar needs as
> your xattr code) - I really had some deadlock reports for heavily loaded
> machines. There the only reasonable solution was an extra write
> functions bypassing the page cache. Maybe in your case you can solve the
> problems differently as you don't need the working solution for all the
> filesystems but just for Reiserfs. 

Sigh. Ok. The way you describe it definitely makes it a lock inversion
issue. I haven't run into it yet, but as you say, it occurs on heavily
loaded machines. I've done some load testing, but apparently not enough,
since your analysis is sound.

But, I think there is a silver lining after all. It sounds like you've
already worked around these issues for the journaled quota code. What do
you think about turning the quota read/write functions into something
more generic and using that for xattrs as well as quotas?

Ultimately, I think that quota files and xattrs are the same things -
"normal" files read from and written to during the I/O path. The changes
to journaled quotas would be minimal - just turn
reiserfs_quota_{read,write} into small wrappers that make the
type->inode mapping and then call the remainder of the existing code as
reiserfs_internal_{read,write} with the appropriate inode. The xattr
code could juse the reiserfs_internal_{read,write} similarly and get all
the deadlock avoidance work you've already done for free.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEB2l0LPWxlyuTD7IRAsuaAKCHnEKlcgxcIGEV5n5f3m4CY5c/pACeK4CH
h1cj7wLkY+irYbaagmUHtm4=
=o4nE
-END PGP SIGNATURE-


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-02 Thread Jan Kara
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Jan Kara wrote:
> >   Hello,
> > 
> >> Following this message is a series of 11 patches that rework the reiserfs
> >> xattr code. The current implementation open codes a number of functions 
> >> that
> >> are well tested and stable elsewhere, but will slight modifications. It 
> >> also
> >> does a number of things suboptimally, such as operations that affect all of
> >> the xattrs associated with an inode, as well as not handling journalling as
> >> well as can be.
> >>
> >> I've run them through a weekend of stress testing successfully, but I'd 
> >> like
> >> some additional testing before considering them safe.
> >>
> >> Here's the run down:
> >>
> >> * 01 - Make internal xattr operations use vfs ops
> >>   This eliminates the open coding of the read/write loops in favor of using
> >>   vfs_read and vfs_write. Performance-wise, it's very little difference
> >>   from the open coding, and implementation-wise, it's more tested. Yes, 
> >> this
> >>   violates the no-file-io-in-the-kernel rules, but this rule has been 
> >> violated
> >>   since the day the original patches were accepted.
> >   I'm not completely sure but from briefly looking at the patches I
> > think there might be the following problem: you first start a transaction
> > and then call a VFS write operation. That will lock a page it wants to
> > write. But if bdflush works on the xattr file and sends some dirty data
> > to disk, it will first take page lock and then start the transaction.
> > This introduces essentially a lock inversion (as starting a transaction
> > behaves like taking a lock) and hence deadlocks... I've been solving
> > these for the quota code some time ago (quota also has similar needs as
> > your xattr code) - I really had some deadlock reports for heavily loaded
> > machines. There the only reasonable solution was an extra write
> > functions bypassing the page cache. Maybe in your case you can solve the
> > problems differently as you don't need the working solution for all the
> > filesystems but just for Reiserfs. 
> 
> Sigh. Ok. The way you describe it definitely makes it a lock inversion
> issue. I haven't run into it yet, but as you say, it occurs on heavily
> loaded machines. I've done some load testing, but apparently not enough,
> since your analysis is sound.
> 
> But, I think there is a silver lining after all. It sounds like you've
> already worked around these issues for the journaled quota code. What do
> you think about turning the quota read/write functions into something
> more generic and using that for xattrs as well as quotas?
> 
> Ultimately, I think that quota files and xattrs are the same things -
> "normal" files read from and written to during the I/O path. The changes
> to journaled quotas would be minimal - just turn
> reiserfs_quota_{read,write} into small wrappers that make the
> type->inode mapping and then call the remainder of the existing code as
> reiserfs_internal_{read,write} with the appropriate inode. The xattr
> code could juse the reiserfs_internal_{read,write} similarly and get all
> the deadlock avoidance work you've already done for free.
  You are right that the quota code and xattrs need to do the same thing.
We only need to do slight interface changes (currently functions take a
superblock and a type and pick appropriate quota inode themselves) and
function renaming. I would vote for renaming the s_op->quota_{read,write}
to s_op->internal_{read,write} and pass appropriate inode directly from
the quota code. The only thing I'm not sure about is how to deal with the
journaling mode - quota code either uses data journaling or just ordered
mode depending on mount options (journaled / non-journaled quota). So we
probably also need to pass the journaling mode to the write function.
  BTW: note that using these functions bypassing page cache means that
userspace really should not touch these files. It is asking for data
corruption. Quota code does during quotaon sync the quota inode and
set it as immutable to prevent accidents. Also during quotaoff it flushes
the page cache of the inode so that userspace is able to see the changes
made by kernel. I guess something similar will be needed for xattrs too.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-02 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jan Kara wrote:
>   You are right that the quota code and xattrs need to do the same thing.
> We only need to do slight interface changes (currently functions take a
> superblock and a type and pick appropriate quota inode themselves) and
> function renaming. I would vote for renaming the s_op->quota_{read,write}
> to s_op->internal_{read,write} and pass appropriate inode directly from
> the quota code. The only thing I'm not sure about is how to deal with the
> journaling mode - quota code either uses data journaling or just ordered
> mode depending on mount options (journaled / non-journaled quota). So we
> probably also need to pass the journaling mode to the write function.
>   BTW: note that using these functions bypassing page cache means that
> userspace really should not touch these files. It is asking for data
> corruption. Quota code does during quotaon sync the quota inode and
> set it as immutable to prevent accidents. Also during quotaoff it flushes
> the page cache of the inode so that userspace is able to see the changes
> made by kernel. I guess something similar will be needed for xattrs too.

If you feel that changing the entire quota system to reflect the change
is a good plan, that's your call. Personally, I'd like to keep the
patches as small as possible, but if you think there is a need for
internal_{read,write} elsewhere, I wouldn't object.

The data journaling mode can be set as a flag associated with the inode.
 Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
i_data_ordered in one of my later patches. They can be tested easily
with reiserfs_file_data_{log,ordered}. There's no reason that one
couldn't be moved up and made a prerequisite for the first patch.

There is one difference between quota files and xattrs: The quota files
are visible to userspace, xattrs are completely hidden. There's nothing
needed to flush anything to the page cache.

I'll work up a patch locally and do some testing.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEB5RPLPWxlyuTD7IRAn94AJwKvmGcT09QtjcBOFJoyd6JrxRTywCeN5nE
giNMMOkNlSsuGCwR6ad9/ao=
=DOMs
-END PGP SIGNATURE-


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-03 Thread Jan Kara
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Jan Kara wrote:
> >   You are right that the quota code and xattrs need to do the same thing.
> > We only need to do slight interface changes (currently functions take a
> > superblock and a type and pick appropriate quota inode themselves) and
> > function renaming. I would vote for renaming the s_op->quota_{read,write}
> > to s_op->internal_{read,write} and pass appropriate inode directly from
> > the quota code. The only thing I'm not sure about is how to deal with the
> > journaling mode - quota code either uses data journaling or just ordered
> > mode depending on mount options (journaled / non-journaled quota). So we
> > probably also need to pass the journaling mode to the write function.
> >   BTW: note that using these functions bypassing page cache means that
> > userspace really should not touch these files. It is asking for data
> > corruption. Quota code does during quotaon sync the quota inode and
> > set it as immutable to prevent accidents. Also during quotaoff it flushes
> > the page cache of the inode so that userspace is able to see the changes
> > made by kernel. I guess something similar will be needed for xattrs too.
> 
> If you feel that changing the entire quota system to reflect the change
> is a good plan, that's your call. Personally, I'd like to keep the
> patches as small as possible, but if you think there is a need for
> internal_{read,write} elsewhere, I wouldn't object.
  OK, when I'm thinking about it in the morning, you're probably right.
And I can do the bigger change if I see that more filesystems would use
it too.

> The data journaling mode can be set as a flag associated with the inode.
>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
> i_data_ordered in one of my later patches. They can be tested easily
> with reiserfs_file_data_{log,ordered}. There's no reason that one
> couldn't be moved up and made a prerequisite for the first patch.
  Fine. So we can just set proper journaling flags in reiserfs_quota_on
and then honor them in the internal writing functions.

Honza
  
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-03 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jan Kara wrote:
>> The data journaling mode can be set as a flag associated with the inode.
>>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
>> i_data_ordered in one of my later patches. They can be tested easily
>> with reiserfs_file_data_{log,ordered}. There's no reason that one
>> couldn't be moved up and made a prerequisite for the first patch.
>   Fine. So we can just set proper journaling flags in reiserfs_quota_on
> and then honor them in the internal writing functions.

Ok, how do the attached patches look to you? The internal I/O changes
need to be applied after the journaled xattr patch or we get an Oops
trying to start a transaction without calling reiserfs_write_lock()
first. I've modified the first patch in the xattr series to abstract out
the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
functions. This makes it easier to move to the internal i/o code later.
I've included it for clarity, but that is the only change.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFECMCALPWxlyuTD7IRAlqfAJ0bQnHuNqzEif4hVfvGKI8tR2bUrACfU1Mg
BFbe0xayAJHhvNgtGl7N6Jk=
=P6ED
-END PGP SIGNATURE-
From: Jeff Mahoney <[EMAIL PROTECTED]>
Subject: [PATCH 10/13] reiserfs: make quota file i/o routines generic

 Jan Kara, with his work on journaled quotas, discovered some conditions
 were lock inversions can occur between pdflush and the internal file i/o
 paths when transactions are involved.

 This patch modifies the quota code in a few ways:
 1) Makes the reiserfs_quota_{read,write} generic and adds wrappers to
map the quota type to the appropriate inode.
 2) Uses the i_data_log flag to mark whether the quota file is journaled
or not. This is the standard way of doing this, and allows the generic
code to work without knowledge of quotas.

 fs/reiserfs/super.c |  212 
 include/linux/reiserfs_fs.h |5 +
 2 files changed, 122 insertions(+), 95 deletions(-)

Signed-off-by: Jeff Mahoney <[EMAIL PROTECTED]>

diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c linux-2.6.15-staging2/fs/reiserfs/super.c
--- linux-2.6.15-staging1/fs/reiserfs/super.c	2006-03-03 17:09:01.0 -0500
+++ linux-2.6.15-staging2/fs/reiserfs/super.c	2006-03-03 17:09:04.0 -0500
@@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
 	return 0;
 }
 
+#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
+/* Read data from quotafile - avoid pagecache and such because we cannot afford
+ * acquiring the locks... As quota files are never truncated and quota code
+ * itself serializes the operations (and noone else should touch the files)
+ * we don't have to be afraid of races */
+ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
+   loff_t off)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long blk = off >> sb->s_blocksize_bits;
+	int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
+	size_t toread;
+	struct buffer_head tmp_bh, *bh;
+	loff_t i_size = i_size_read(inode);
+
+	if (off > i_size)
+		return 0;
+	if (off + len > i_size)
+		len = i_size - off;
+	toread = len;
+	while (toread > 0) {
+		tocopy =
+		sb->s_blocksize - offset <
+		toread ? sb->s_blocksize - offset : toread;
+		tmp_bh.b_state = 0;
+		/* Quota files are without tails so we can safely use this function */
+		reiserfs_write_lock(sb);
+		err = reiserfs_get_block(inode, blk, &tmp_bh, 0);
+		reiserfs_write_unlock(sb);
+		if (err)
+			return err;
+		if (!buffer_mapped(&tmp_bh))	/* A hole? */
+			memset(data, 0, tocopy);
+		else {
+			bh = sb_bread(sb, tmp_bh.b_blocknr);
+			if (!bh)
+return -EIO;
+			memcpy(data, bh->b_data + offset, tocopy);
+			brelse(bh);
+		}
+		offset = 0;
+		toread -= tocopy;
+		data += tocopy;
+		blk++;
+	}
+	return len;
+}
+
+/* Write to quotafile (we know the transaction is already started and has
+ * enough credits) */
+ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
+size_t len, loff_t off)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long blk = off >> sb->s_blocksize_bits;
+	int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
+	size_t towrite = len;
+	struct buffer_head tmp_bh, *bh;
+
+	mutex_lock(&inode->i_mutex);
+	while (towrite > 0) {
+		tocopy = sb->s_blocksize - offset < towrite ?
+		sb->s_blocksize - offset : towrite;
+		tmp_bh.b_state = 0;
+		err = reiserfs_get_block(inode, blk, &tmp_bh, GET_BLOCK_CREATE);
+		if (err)
+			goto out;
+		if (offset || tocopy != sb->s_blocksize)
+			bh = sb_bread(sb, tmp_bh.b_blocknr);
+		else
+			bh = sb_getblk(sb, tmp_bh.b_blocknr);
+		if (!bh) {
+			err = -EIO;
+			goto out;
+		}
+		lock_buffer(bh);
+		memcpy(bh->b_data + offset, data, tocopy);
+		flu

Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-06 Thread Jan Kara
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Jan Kara wrote:
> >> The data journaling mode can be set as a flag associated with the inode.
> >>  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
> >> i_data_ordered in one of my later patches. They can be tested easily
> >> with reiserfs_file_data_{log,ordered}. There's no reason that one
> >> couldn't be moved up and made a prerequisite for the first patch.
> >   Fine. So we can just set proper journaling flags in reiserfs_quota_on
> > and then honor them in the internal writing functions.
> 
> Ok, how do the attached patches look to you? The internal I/O changes
> need to be applied after the journaled xattr patch or we get an Oops
> trying to start a transaction without calling reiserfs_write_lock()
> first. I've modified the first patch in the xattr series to abstract out
> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
> functions. This makes it easier to move to the internal i/o code later.
> I've included it for clarity, but that is the only change.
  The patch looks fine. Just two minor comments:


> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c 
> linux-2.6.15-staging2/fs/reiserfs/super.c
> --- linux-2.6.15-staging1/fs/reiserfs/super.c 2006-03-03 17:09:01.0 
> -0500
> +++ linux-2.6.15-staging2/fs/reiserfs/super.c 2006-03-03 17:09:04.0 
> -0500
> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
>   return 0;
>  }
>  
> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
> +/* Read data from quotafile - avoid pagecache and such because we cannot 
> afford
> + * acquiring the locks... As quota files are never truncated and quota code
> + * itself serializes the operations (and noone else should touch the files)
> + * we don't have to be afraid of races */
 Update here the comment to reflect that we use this function also for
xattrs now - I suppose those files cannot be truncated either and that
xattr code serializes the operations there.

> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
> +   loff_t off)
  

> +/* Write to quotafile (we know the transaction is already started and has
> + * enough credits) */
  Here again update the comment...

> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
> +size_t len, loff_t off)

Honza

-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-07 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jan Kara wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Jan Kara wrote:
 The data journaling mode can be set as a flag associated with the inode.
  Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
 i_data_ordered in one of my later patches. They can be tested easily
 with reiserfs_file_data_{log,ordered}. There's no reason that one
 couldn't be moved up and made a prerequisite for the first patch.
>>>   Fine. So we can just set proper journaling flags in reiserfs_quota_on
>>> and then honor them in the internal writing functions.
>> Ok, how do the attached patches look to you? The internal I/O changes
>> need to be applied after the journaled xattr patch or we get an Oops
>> trying to start a transaction without calling reiserfs_write_lock()
>> first. I've modified the first patch in the xattr series to abstract out
>> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
>> functions. This makes it easier to move to the internal i/o code later.
>> I've included it for clarity, but that is the only change.
>   The patch looks fine. Just two minor comments:
> 
> 
>> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c 
>> linux-2.6.15-staging2/fs/reiserfs/super.c
>> --- linux-2.6.15-staging1/fs/reiserfs/super.c2006-03-03 
>> 17:09:01.0 -0500
>> +++ linux-2.6.15-staging2/fs/reiserfs/super.c2006-03-03 
>> 17:09:04.0 -0500
>> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
>>  return 0;
>>  }
>>  
>> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
>> +/* Read data from quotafile - avoid pagecache and such because we cannot 
>> afford
>> + * acquiring the locks... As quota files are never truncated and quota code
>> + * itself serializes the operations (and noone else should touch the files)
>> + * we don't have to be afraid of races */
>  Update here the comment to reflect that we use this function also for
> xattrs now - I suppose those files cannot be truncated either and that
> xattr code serializes the operations there.
> 
>> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t len,
>> +   loff_t off)
>   
> 
>> +/* Write to quotafile (we know the transaction is already started and has
>> + * enough credits) */
>   Here again update the comment...
> 
>> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
>> +size_t len, loff_t off)
> 
>   Honza
> 

I've updated the patches with the comment changes, though I did run into
a much bigger snag.

The internal i/o patches don't support tails, and that's a silver bullet
against this working for xattrs. Most xattrs, such as ACLs, are likley
to be only a few tens of bytes long and allocating an entire block is
extremely wasteful.

I've managed to alter internal read to handle tails by allocating an
anonymous page and using it with the temporary buffer head to get the
tail data from reiserfs_get_block back. But the rest of the tail packing
code very much needs the page cache. Is there going to be any way this
can be managed without reintroducing deadlocks?

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEDf16LPWxlyuTD7IRApj7AJ4jymXed/tOb5tpVar1SZ6ZnrYl0ACfUuxk
eC6kOQW5zNuUIaoLrV0gIf8=
=u6L1
-END PGP SIGNATURE-


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-08 Thread Jan Kara
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Jan Kara wrote:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA1
> >>
> >> Jan Kara wrote:
>  The data journaling mode can be set as a flag associated with the inode.
>   Currently, i_data_log is set in REISERFS_I(inode)->i_flags. I add
>  i_data_ordered in one of my later patches. They can be tested easily
>  with reiserfs_file_data_{log,ordered}. There's no reason that one
>  couldn't be moved up and made a prerequisite for the first patch.
> >>>   Fine. So we can just set proper journaling flags in reiserfs_quota_on
> >>> and then honor them in the internal writing functions.
> >> Ok, how do the attached patches look to you? The internal I/O changes
> >> need to be applied after the journaled xattr patch or we get an Oops
> >> trying to start a transaction without calling reiserfs_write_lock()
> >> first. I've modified the first patch in the xattr series to abstract out
> >> the fp->f_op->{read,write} calls to an xattr_{read,write} pair of
> >> functions. This makes it easier to move to the internal i/o code later.
> >> I've included it for clarity, but that is the only change.
> >   The patch looks fine. Just two minor comments:
> > 
> > 
> >> diff -ruNpX ../dontdiff linux-2.6.15-staging1/fs/reiserfs/super.c 
> >> linux-2.6.15-staging2/fs/reiserfs/super.c
> >> --- linux-2.6.15-staging1/fs/reiserfs/super.c  2006-03-03 
> >> 17:09:01.0 -0500
> >> +++ linux-2.6.15-staging2/fs/reiserfs/super.c  2006-03-03 
> >> 17:09:04.0 -0500
> >> @@ -1949,6 +1949,109 @@ static int reiserfs_statfs(struct super_
> >>return 0;
> >>  }
> >>  
> >> +#if defined(CONFIG_QUOTA) || defined(CONFIG_REISERFS_FS_XATTR)
> >> +/* Read data from quotafile - avoid pagecache and such because we cannot 
> >> afford
> >> + * acquiring the locks... As quota files are never truncated and quota 
> >> code
> >> + * itself serializes the operations (and noone else should touch the 
> >> files)
> >> + * we don't have to be afraid of races */
> >  Update here the comment to reflect that we use this function also for
> > xattrs now - I suppose those files cannot be truncated either and that
> > xattr code serializes the operations there.
> > 
> >> +ssize_t reiserfs_internal_read(struct inode *inode, char *data, size_t 
> >> len,
> >> +   loff_t off)
> >   
> > 
> >> +/* Write to quotafile (we know the transaction is already started and has
> >> + * enough credits) */
> >   Here again update the comment...
> > 
> >> +ssize_t reiserfs_internal_write(struct inode *inode, const char *data,
> >> +size_t len, loff_t off)
> > 
> > Honza
> > 
> 
> I've updated the patches with the comment changes, though I did run into
> a much bigger snag.
> 
> The internal i/o patches don't support tails, and that's a silver bullet
> against this working for xattrs. Most xattrs, such as ACLs, are likley
> to be only a few tens of bytes long and allocating an entire block is
> extremely wasteful.
  Umm, that is really nasty. Ext3 solves this by sharing a block among
several inodes but that's far to much work to fix this bug...

> I've managed to alter internal read to handle tails by allocating an
> anonymous page and using it with the temporary buffer head to get the
> tail data from reiserfs_get_block back. But the rest of the tail packing
> code very much needs the page cache. Is there going to be any way this
> can be managed without reintroducing deadlocks?
  I've been trying to find some other way when solving problems for
quotas but find none. If you want xattr changes to be journaled with
other data changes, you have to first start a transaction and then issue
a write that will consequently need PageLock. So do you really need
a trasaction started before a write starts? For journaled quota this was
must but for xattrs it might not be necessary. Then we would still need
to sort out the problems with xattr lock but that might be easier to
deal with.

Bye
Honza

-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-08 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jan Kara wrote:
>> The internal i/o patches don't support tails, and that's a silver bullet
>> against this working for xattrs. Most xattrs, such as ACLs, are likley
>> to be only a few tens of bytes long and allocating an entire block is
>> extremely wasteful.
>   Umm, that is really nasty. Ext3 solves this by sharing a block among
> several inodes but that's far to much work to fix this bug...

I had considered sharing files, and the code knows to drop a link to a
shared file when it's changed. That's one of the features I had wanted
from the beginning but never got around to implementing.

>> I've managed to alter internal read to handle tails by allocating an
>> anonymous page and using it with the temporary buffer head to get the
>> tail data from reiserfs_get_block back. But the rest of the tail packing
>> code very much needs the page cache. Is there going to be any way this
>> can be managed without reintroducing deadlocks?
>   I've been trying to find some other way when solving problems for
> quotas but find none. If you want xattr changes to be journaled with
> other data changes, you have to first start a transaction and then issue
> a write that will consequently need PageLock. So do you really need
> a trasaction started before a write starts? For journaled quota this was
> must but for xattrs it might not be necessary. Then we would still need
> to sort out the problems with xattr lock but that might be easier to
> deal with.

The locking constraints for xattrs have changed somewhat. Have you
looked at the rest of the xattr patches, or am I being dumb and missing
your point?

Do the quota races occur when they are completely journaled or only when
ordered writes are used? I'm wondering if perhaps we could alter
reiserfs_file_write a bit to not mark the buffers dirty yet for internal
files, and have the ordered writeout do it then. Then, we wouldn't race
against pdflush.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEDyyALPWxlyuTD7IRAiCuAJ4o8/SpIk8VBAl8/98kppRB9o7VcACfSGnM
pTqsa6fCMSDNBZoketOyx0I=
=ZKxx
-END PGP SIGNATURE-


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-08 Thread Andreas Dilger
On Mar 08, 2006  14:12 -0500, Jeff Mahoney wrote:
> Jan Kara wrote:
> >> The internal i/o patches don't support tails, and that's a silver bullet
> >> against this working for xattrs. Most xattrs, such as ACLs, are likley
> >> to be only a few tens of bytes long and allocating an entire block is
> >> extremely wasteful.
> >
> >   Umm, that is really nasty. Ext3 solves this by sharing a block among
> > several inodes but that's far to much work to fix this bug...
> 
> I had considered sharing files, and the code knows to drop a link to a
> shared file when it's changed. That's one of the features I had wanted
> from the beginning but never got around to implementing.

Just FYI, ext3 has recently implemented support for larger inodes exactly
to store small EAs with the inode instead of an external block.  For ACLs
there is some benefit to sharing the block, because the overhead is
amortized over many inodes.  However, virtually all other EA data is
unique per inode and the ext3 EA block sharing only works if ALL the EAs
for an inode are identical, so that isn't very useful if you have anything
other than ACLs to store.

The performance of in-inode EAs is vastly better than external blocks
because of seeks and not wasting 4kB of disk/RAM for 10-100 bytes of EA.

Not sure if this is useful (haven't been following discussion too closely),
but thought I would steer you away from the shared-block idea early.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-09 Thread Hans Reiser
All of this is why xattrs should not be implemented without first
implementing plugins.:-/

Really guys, there was a reason for my not wanting xattrs to go into
V3.  Do you see it now?  This is what happens when marketing determines
feature ship schedules.  You could implement this xattr stuff in V4 in
1/5th the time, 5 times the performance, and twice the elegance.   Of
course, implementing them as pseudo files would be far more elegant
still

Hans


Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-09 Thread Hans Reiser
Jeff Mahoney wrote:

>
>
> And as I've said before, Hans, if the original code base was capable of
> supporting the plethora of items the white paper hyped, we wouldn't have
> run into this problem either.

This is an odd criticism.  V3 never pretended to have item plugins.  V4
does.  If you add a new item plugin to V4, it will work quite well.

>
> I would have loved to have implemented xattrs as another item type, but
> as soon as I did that, the kernel crashed almost instantly on not
> recognizing the new item type in the balance code. While it was
> certainly fixable in that version, properly fixing it would have
> required a ReiserFS 3.7 with capability bits similar to ext[23]. Looking
> back, maybe that wouldn't have been such a bad thing.

All of the performance issues and stable code destabilization issues you
are experiencing now are what I expected, and what motivated my wanting
to defer acls for V4, and by so doing (and thus concentrating our
development resources on V4) advance the schedule for V4.

>
> As for waiting for v4, we've been through this before. Users wanted ACLs
> on ReiserFS yesterday, and I'd hardly brush aside features that users
> have been demanding as marketing.

There is a difference between brushing them aside, and choosing to put
them into the next major release because proper solution of them
requires deeper work than stuffing them into a file.  

Also, marketing is important.  The problem is when marketing motivates
wishful thinking about the shape of the code.  Architecture is like
sculpting, you must sense what the wood or marble wants to be shaped
into as clearly as you see the vision in your head of what you want it
to be.   When it won't support the shape you want, then in sculpting you
need another piece of wood, and in programming you need another major
release.

That said, I understand you are seeking to provide the users with what
they want, and that while we disagree on the best strategy for that, we
both try to do the best we can for the users.

> There's no denying that a
> reiser4-based solution would have been cleaner, but sometimes we just
> have to make do with what we've got.
>
> -Jeff
>
> --
> Jeff Mahoney
> SUSE Labs



Re: [PATCH 00/11] reiserfs: xattr rework

2006-03-09 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hans Reiser wrote:
> All of this is why xattrs should not be implemented without first
> implementing plugins.:-/
> 
> Really guys, there was a reason for my not wanting xattrs to go into
> V3.  Do you see it now?  This is what happens when marketing determines
> feature ship schedules.  You could implement this xattr stuff in V4 in
> 1/5th the time, 5 times the performance, and twice the elegance.   Of
> course, implementing them as pseudo files would be far more elegant
> still

And as I've said before, Hans, if the original code base was capable of
supporting the plethora of items the white paper hyped, we wouldn't have
run into this problem either.

I would have loved to have implemented xattrs as another item type, but
as soon as I did that, the kernel crashed almost instantly on not
recognizing the new item type in the balance code. While it was
certainly fixable in that version, properly fixing it would have
required a ReiserFS 3.7 with capability bits similar to ext[23]. Looking
back, maybe that wouldn't have been such a bad thing.

As for waiting for v4, we've been through this before. Users wanted ACLs
on ReiserFS yesterday, and I'd hardly brush aside features that users
have been demanding as marketing. There's no denying that a
reiser4-based solution would have been cleaner, but sometimes we just
have to make do with what we've got.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEEH3lLPWxlyuTD7IRAgz6AJ4l4/f92LJAKs65OqAbl8cIIL1PRACdE5Yb
MmQcruV2Nmd2l3RS+V0TYrI=
=ujKr
-END PGP SIGNATURE-